LLMs & Vision LLMs for GeoAI — Part 1: Foundations#
Course: GeoAI / Multimodal Geospatial Reasoning Duration: ~45 minutes Instructor notebook (complete reference version)
What you’ll learn#
By the end of this notebook, you will be able to:
Connect to a locally-served LLM via Ollama’s OpenAI-compatible API.
Prompt an LLM for geographic reasoning (zero-shot, few-shot, chain-of-thought).
Reason about OpenStreetMap (OSM) tags using LLMs.
Extract structured geographic entities (POIs, coordinates, admin units) into JSON using Pydantic schemas.
Use a Vision LLM to interpret satellite imagery and map screenshots.
Why Ollama?#
Ollama is a local LLM runtime that:
Exposes an OpenAI-compatible HTTP API at
http://localhost:11434/v1— so the sameopenaiPython SDK works.Pulls and runs open-weight models (Llama 3.1, Qwen2.5-VL, Gemma, Mistral, …) with one command.
Handles GPU/CPU placement and quantization automatically.
Runs comfortably on a laptop — perfect for class.
Assumed setup#
We assume the Ollama daemon is already running on your machine (or on a course server) at:
http://localhost:11434/v1
If you do not have Ollama installed yet:
# macOS / Linux — one-line install
curl -fsSL https://ollama.com/install.sh | sh
# Pull the models we will use in this lab (one-off, ~5–10 GB each)
ollama pull llama3.1:8b
ollama pull qwen2.5vl:7b
# Start the server (usually started automatically on macOS / Linux service)
ollama serve
💡 You do not need an OpenAI API key. Ollama accepts any non-empty string — by convention we use
"ollama".
§1 — Setup (≈ 5 min)#
We’ll use the standard openai SDK pointed at our Ollama endpoint, plus a few helpers for images and maps.
# Install dependencies (uncomment if needed)
# !pip install --quiet openai pydantic pillow requests numpy
!pip install json_repair
import json_repair
import os
import json
import base64
from io import BytesIO
from typing import List, Optional, Tuple
from pathlib import Path
import requests
from PIL import Image
from openai import OpenAI
from pydantic import BaseModel, Field
# ---- Configure endpoint ---------------------------------------------------
# Ollama exposes a single OpenAI-compatible endpoint that serves *all*
# pulled models — text and vision share the same base URL.
OLLAMA_BASE_URL = os.environ.get("OLLAMA_BASE_URL", "http://g3201.mahti.csc.fi:11434/v1")
TEXT_MODEL = os.environ.get("OLLAMA_TEXT_MODEL", "qwen3.5")
VISION_MODEL = os.environ.get("OLLAMA_VISION_MODEL", "qwen3.5")
# Ollama accepts any non-empty API key
client = OpenAI(base_url=OLLAMA_BASE_URL, api_key="ollama")
text_client = client # alias for clarity downstream
vision_client = client
print(f"Endpoint : {OLLAMA_BASE_URL}")
print(f"Text model : {TEXT_MODEL}")
print(f"Vision model : {VISION_MODEL}")
Endpoint : http://g3201.mahti.csc.fi:11434/v1
Text model : qwen3.5
Vision model : qwen3.5
1.1 Reasoning-aware helpers#
Some Ollama models are thinking models (Gemma 3 thinking variants, GPT-OSS, DeepSeek-R1, Qwen 3 / 3.5). They emit a chain-of-thought in a separate reasoning field on the message, before giving you the final answer in content. The raw response object looks like:
ChatCompletionMessage(
content='The answer is 391.',
reasoning="17 × 20 = 340, 17 × 3 = 51, 340 + 51 = 391",
role='assistant', ...
)
Two practical consequences:
Reasoning eats output tokens. A
max_tokens=512call that finishes withfinish_reason='length'and an empty content field usually means the model spent all its budget thinking. Bumpmax_tokensto 1500–4000 for thinking models.Some buggy models put everything in
reasoningand leavecontentempty (a known Ollama issue with certain Gemma 4 / Qwen 3.5 builds). Our helper falls back to the reasoning text in that case so you never see an empty-string answer.
You can control thinking with reasoning_effort:
Value |
Meaning |
|---|---|
|
Disable thinking (where supported) — fastest |
|
Minimal thinking |
|
Default |
|
Deep thinking — slowest, best for hard reasoning |
Below we define chat() and vision_chat() that return both fields and a print_response() pretty-printer.
def _extract_parts(message) -> Tuple[str, str]:
"""Return (content, reasoning) from a ChatCompletionMessage, handling all field aliases."""
content = (getattr(message, "content", None) or "")
# Ollama uses `reasoning`; other SDKs may use `reasoning_content` or `thinking`.
reasoning = (getattr(message, "reasoning", None)
or getattr(message, "reasoning_content", None)
or getattr(message, "thinking", None)
or "")
return content.strip(), reasoning.strip()
def print_response(resp, show_reasoning: bool = True, max_reasoning_chars: int = 1200):
"""Pretty-print a ChatCompletion response, including reasoning if present."""
choice = resp.choices[0]
content, reasoning = _extract_parts(choice.message)
finish = choice.finish_reason
usage = resp.usage
# Banner
print(f"┌─ model={resp.model} finish_reason={finish}", end="")
if usage is not None:
print(f" tokens={usage.prompt_tokens}+{usage.completion_tokens}={usage.total_tokens}")
else:
print()
if show_reasoning and reasoning:
print("│")
print("│ 🧠 REASONING")
print("│ " + "─" * 50)
snippet = reasoning if len(reasoning) <= max_reasoning_chars \
else reasoning[:max_reasoning_chars] + f"\n… [truncated, +{len(reasoning) - max_reasoning_chars} chars]"
for line in snippet.splitlines():
print(f"│ {line}")
print("│")
print("│ 💬 ANSWER")
print("│ " + "─" * 50)
if content:
for line in content.splitlines():
print(f"│ {line}")
elif reasoning:
# Some thinking models leave content empty — fall back to reasoning.
print("│ (content was empty — model emitted only reasoning; showing it as the answer)")
else:
print("│ (empty response)")
if finish == "length":
print("│")
print("│ ⚠️ Output was truncated by max_tokens. Increase it (reasoning consumes tokens).")
print("└" + "─" * 60)
def chat(prompt: str,
system: str = "You are a helpful geographic assistant.",
model: Optional[str] = None,
temperature: float = 0.2,
max_tokens: int = 2048, # bumped from 512 — reasoning eats tokens
reasoning_effort: Optional[str] = "none", # None | "none" | "low" | "medium" | "high"
return_full: bool = False,
extra_body: Optional[dict] = None):
"""
Single-turn chat. Returns the content string by default.
If the model is a thinking model and content is empty, falls back to reasoning.
Set return_full=True to get the raw ChatCompletion object.
"""
eb = dict(extra_body or {})
if reasoning_effort is not None:
eb["reasoning_effort"] = reasoning_effort
resp = client.chat.completions.create(
model=model or TEXT_MODEL,
messages=[{"role": "system", "content": system},
{"role": "user", "content": prompt}],
temperature=temperature,
max_tokens=max_tokens,
extra_body=eb,
)
if return_full:
return resp
content, reasoning = _extract_parts(resp.choices[0].message)
return content if content else reasoning # graceful fallback
def vision_chat(prompt: str, image: str,
model: Optional[str] = None,
temperature: float = 0.2,
max_tokens: int = 2048,
reasoning_effort: Optional[str] = None,
return_full: bool = False,
extra_body: Optional[dict] = None):
"""Vision counterpart of chat(). `image` may be a URL, local path, or data URI."""
eb = dict(extra_body or {})
if reasoning_effort is not None:
eb["reasoning_effort"] = reasoning_effort
img_url = encode_image_b64(image)
resp = client.chat.completions.create(
model=model or VISION_MODEL,
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": img_url}},
],
}],
temperature=temperature,
max_tokens=max_tokens,
extra_body=eb,
)
if return_full:
return resp
content, reasoning = _extract_parts(resp.choices[0].message)
return content if content else reasoning
def encode_image_b64(path_or_url: str) -> str:
if path_or_url.startswith("http"):
r = requests.get(
path_or_url,
headers={"User-Agent": "Mozilla/5.0"},
timeout=30
)
r.raise_for_status()
data = r.content
else:
data = Path(path_or_url).read_bytes()
# Use same pattern that worked
img = Image.open(BytesIO(data))
img = img.convert("RGB")
img.thumbnail((1024, 1024))
buf = BytesIO()
img.save(buf, format="JPEG", quality=85)
return f"data:image/jpeg;base64,{base64.b64encode(buf.getvalue()).decode()}"
print("Helpers loaded: chat(), vision_chat(), print_response(), encode_image_b64()")
Helpers loaded: chat(), vision_chat(), print_response(), encode_image_b64()
# Quick sanity check — list the models the Ollama server has pulled.
# If this fails, the server is unreachable: check that `ollama serve` is running.
try:
models = client.models.list()
print("Models available locally on Ollama:")
for m in models.data:
print(f" - {m.id}")
except Exception as e:
print(f"⚠️ Cannot reach Ollama: {e}")
print(" Run `ollama serve` and `ollama pull llama3.1:8b qwen2.5vl:7b`")
Models available locally on Ollama:
- qwen3.5:latest
- gemma4:latest
- llama3.1:8b
§2 — Geographic Q&A and spatial reasoning (≈ 10 min)#
LLMs encode a surprising amount of world geographic knowledge in their parameters: country borders, capitals, rough coordinates of major cities, climate zones, landmarks. But they also hallucinate confidently — especially about coordinates and small places.
We’ll use the chat() helper defined above. For thinking models, set reasoning_effort="medium" and pass return_full=True if you want to inspect the reasoning trace via print_response().
# Zero-shot factual recall — content only
print(chat("What is the capital of Finland, and roughly at what latitude does it sit?"))
The capital of Finland is **Helsinki**.
It sits at a latitude of approximately **60° 10′ N** (or about **60.17° North**).
This places Helsinki slightly south of the Arctic Circle (which is at 66° 33′ N), but still well within the subarctic zone, giving it a distinct northern European climate with long, dark winters and bright, long summers.
# Same call, but ask for the FULL response so we can see reasoning (if the model emits any).
resp = chat(
"Is Tallinn north or south of Helsinki? "
"Answer in one sentence and give the approximate latitude of each.",
return_full=True,
)
print_response(resp)
┌─ model=qwen3.5 finish_reason=stop tokens=45+38=83
│
│ 💬 ANSWER
│ ──────────────────────────────────────────────────
│ Tallinn is located south of Helsinki, with Tallinn at approximately 59.44° N latitude and Helsinki at approximately 60.17° N latitude.
└────────────────────────────────────────────────────────────
2.1 Chain-of-thought and reasoning models#
There are two ways to get a model to “think”:
Prompt-level CoT — tell any model to “think step by step” in the prompt. Works on Llama, Gemma 2, Qwen 2.5, etc.
Native reasoning — for thinking models (DeepSeek-R1, Qwen3, Gemma 3 thinking, GPT-OSS), Ollama exposes the reasoning trace in a separate
reasoningfield viareasoning_effort.
The cell below uses return_full=True + print_response() so you can see whichever the model produced.
cot_prompt = (
"I am in Ghent, Belgium and want to drive to Aalto University in Espoo, Finland.\n"
"Which of these countries will I most plausibly cross or take a ferry through, "
"and in what order: Germany, Sweden, Denmark, Estonia, Norway?\n"
"Think step by step about the geography, then give the final ordered list."
)
# Crank max_tokens because reasoning eats them. reasoning_effort is silently
# ignored by non-thinking models, so this is safe to leave on by default.
resp = chat(cot_prompt, max_tokens=5000, reasoning_effort="low", return_full=True)
print_response(resp, show_reasoning=True)
┌─ model=qwen3.5 finish_reason=length tokens=92+5000=5092
│
│ 🧠 REASONING
│ ──────────────────────────────────────────────────
│ Thinking Process:
│
│ 1. **Analyze the Request:**
│ * Starting point: Ghent, Belgium.
│ * Destination: Aalto University, Espoo, Finland.
│ * Task: Identify which countries from the provided list (Germany, Sweden, Denmark, Estonia, Norway) will be crossed or taken a ferry through, and determine the order.
│ * Constraint: Think step by step about the geography.
│ * Output: Final ordered list.
│
│ 2. **Geographical Analysis:**
│ * **Start:** Ghent, Belgium (Western Europe).
│ * **Destination:** Espoo, Finland (Northern Europe, across the Gulf of Finland from Helsinki).
│ * **Route:** Driving from Belgium to Finland.
│ * **Neighboring Countries:**
│ * Belgium borders France, Germany, Netherlands, Luxembourg.
│ * To get to Finland, one must cross the Baltic Sea.
│ * The main land route involves crossing the Baltic Sea via ferry or driving through the Baltic states.
│ * **Countries in the List:** Germany, Sweden, Denmark, Estonia, Norway.
│
│ 3. **Evaluate Each Country:**
│ * **Germany:** Belgium borders Germany (via Netherlands or directly). Driving from Belgium to Finland usually goes through the Netherlands, then Germany (o
│ … [truncated, +16983 chars]
│
│ 💬 ANSWER
│ ──────────────────────────────────────────────────
│ (content was empty — model emitted only reasoning; showing it as the answer)
│
│ ⚠️ Output was truncated by max_tokens. Increase it (reasoning consumes tokens).
└────────────────────────────────────────────────────────────
2.2 ⚠️ Hallucination probe#
Always verify coordinates that come out of an LLM. The model will happily produce a plausible-looking lat/lon for an obscure village that is wrong by tens of kilometres. For production geocoding, use Nominatim, the Google Geocoding API, or a vetted gazetteer — and use the LLM only for normalization and disambiguation.
# This will likely return a plausible but unreliable number — a teaching moment.
print(chat(
"Give the latitude and longitude of the village 'Sotkamo, Finland' "
"to four decimal places. Just the numbers."
))
63.5167 23.4833
§4 — Structured outputs: geo-entity extraction with Pydantic (≈ 10 min)#
For pipelines, you almost never want free-form prose. Ollama supports JSON-schema-constrained decoding by passing the schema to the format= parameter (we send it through extra_body={"format": schema} so the OpenAI SDK forwards it to Ollama unchanged).
We’ll define a Pydantic schema for points of interest (POIs) and extract them from a paragraph — a recurring need in tasks like building a navigation knowledge base from Wikipedia, tourist guides, or trip reports.
class POI(BaseModel):
"A single point of interest mentioned in the text."
name: str = Field(..., description="Proper name of the place.")
category: str = Field(..., description="OSM-style category, e.g. museum, park, restaurant.")
city: Optional[str] = Field(None, description="The city the POI is in, if mentioned.")
country: Optional[str] = Field(None, description="The country, if it can be inferred.")
class POIList(BaseModel):
pois: List[POI]
# JSON schema for guided decoding
schema = POIList.model_json_schema()
print(json.dumps(schema, indent=2)[:600], "...")
{
"$defs": {
"POI": {
"description": "A single point of interest mentioned in the text.",
"properties": {
"name": {
"description": "Proper name of the place.",
"title": "Name",
"type": "string"
},
"category": {
"description": "OSM-style category, e.g. museum, park, restaurant.",
"title": "Category",
"type": "string"
},
"city": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
...
def extract_pois(text):
resp = text_client.chat.completions.create(
model=TEXT_MODEL,
messages=[
{"role": "system",
"content": "Extract every point of interest from the user text into the given JSON schema. "
"Only include places, not abstract concepts."},
{"role": "user", "content": text},
],
temperature=0.0,
max_tokens=600,
# Ollama-specific: pass the schema via `format` to constrain decoding
# to a JSON object that matches the schema exactly.
extra_body={"format": schema},
)
decoded_object = json_repair.repair_json(resp.choices[0].message.content, return_objects=True)
return POIList.model_validate_json(json.dumps(decoded_object))
paragraph = (
"On our trip to Helsinki we started at Senate Square, walked to the Ateneum art museum, "
"and had coffee at Café Regatta near Sibelius Park. The next day we took the ferry from "
"the South Harbour to Suomenlinna fortress, then flew from Helsinki-Vantaa to Tallinn, "
"where we visited Kadriorg Palace and the Estonian Open Air Museum."
)
result = extract_pois(paragraph)
for p in result.pois:
print(f"• {p.name:35s} {p.category:20s} {p.city or '-':10s} {p.country or '-'}")
---------------------------------------------------------------------------
ValidationError Traceback (most recent call last)
Cell In[13], line 24
16 return POIList.model_validate_json(resp.choices[0].message.content)
18 paragraph = (
19 "On our trip to Helsinki we started at Senate Square, walked to the Ateneum art museum, "
20 "and had coffee at Café Regatta near Sibelius Park. The next day we took the ferry from "
21 "the South Harbour to Suomenlinna fortress, then flew from Helsinki-Vantaa to Tallinn, "
22 "where we visited Kadriorg Palace and the Estonian Open Air Museum."
23 )
---> 24 result = extract_pois(paragraph)
25 for p in result.pois:
26 print(f"• {p.name:35s} {p.category:20s} {p.city or '-':10s} {p.country or '-'}")
Cell In[13], line 16, in extract_pois(text)
1 def extract_pois(text):
2 resp = text_client.chat.completions.create(
3 model=TEXT_MODEL,
4 messages=[
(...)
14 # extra_body={"format": schema},
15 )
---> 16 return POIList.model_validate_json(resp.choices[0].message.content)
File /PUHTI_TYKKY_Z2xYJyB/miniforge/envs/env1/lib/python3.12/site-packages/pydantic/main.py:746, in BaseModel.model_validate_json(cls, json_data, strict, context, by_alias, by_name)
740 if by_alias is False and by_name is not True:
741 raise PydanticUserError(
742 'At least one of `by_alias` or `by_name` must be set to True.',
743 code='validate-by-alias-and-name-false',
744 )
--> 746 return cls.__pydantic_validator__.validate_json(
747 json_data, strict=strict, context=context, by_alias=by_alias, by_name=by_name
748 )
ValidationError: 1 validation error for POIList
Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]
For further information visit https://errors.pydantic.dev/2.11/v/json_invalid
🛠️ Why this matters for GeoAI. This is the same pattern you would use to:
Build a structured knowledge base from a corpus of trip reports for a navigation agent.
Convert reviewer comments into structured fields for a meta-review pipeline.
Pre-process tourist guides into geocodable entities before passing them to Nominatim.
§5 — Vision LLMs for satellite & map understanding (≈ 10 min)#
Vision-language models extend the same chat interface with image inputs. The OpenAI-compatible format used by Ollama accepts images either as a URL or as a base64 data URI.
We’ll do three things:
Describe a satellite image of an urban area.
Read text and symbols from a map screenshot.
Combine vision + structured output to extract an inventory of visible features.
(vision_chat() and encode_image_b64() are already defined in §1.1 — we just reuse them.)
5.1 Describing a satellite image#
We’ll use a public Wikimedia Commons satellite image. Replace with any image URL or local path; for class we recommend pre-downloading 2–3 images to avoid wifi issues.
# A public-domain satellite image of central Helsinki from Wikimedia Commons.
# If you have local sample images, point at /mnt/data/<file>.jpg instead.
SAT_IMAGE = "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/Helsinki_by_Sentinel-2%2C_2020-06-26.jpg/1280px-Helsinki_by_Sentinel-2%2C_2020-06-26.jpg"
img = Image.open(BytesIO(requests.get(SAT_IMAGE, headers={"User-Agent": "Mozilla/5.0"}).content))
img.show()
prompt = (
"This is a satellite/aerial photograph. Describe in 4–6 bullets what you can see, "
"focusing on land use (water, built-up areas, vegetation, transport infrastructure). "
"Do NOT guess the city name; only describe what is visible."
)
print(vision_chat(prompt, SAT_IMAGE))
* **Water:** A large, deep blue body of water dominates the lower right portion of the image, featuring a coastline dotted with numerous small islands and skerries.
* **Built-up Areas:** A dense, sprawling urban area occupies the central and upper-left landmass, characterized by a mix of residential neighborhoods and commercial development with a visible grid-like street pattern.
* **Airport:** A distinct airport complex with intersecting runways is clearly visible just north of the main urban cluster, surrounded by open fields.
* **Vegetation:** The northern and eastern peripheries are dominated by dense, dark green forests, while the upper left quadrant shows patches of lighter green agricultural fields and pastures.
* **Transport Infrastructure:** A network of roads and highways connects the city center to the surrounding rural areas, with infrastructure extending towards the coastal water and inland lakes.
5.2 Vision + structured output#
We can combine guided JSON decoding with vision inputs. Below we ask the model to output a structured land-cover inventory.
class LandCover(BaseModel):
class_name: str = Field(..., description="One of: water, vegetation, built_up, road, rail, bare_ground, other")
coverage: str = Field(..., description="One of: dominant, frequent, rare, absent")
notes: Optional[str] = None
class LandCoverReport(BaseModel):
items: List[LandCover]
overall_setting: str = Field(..., description="One short phrase: e.g. coastal city, rural farmland, dense forest.")
prompt = (
"Inventory the land-cover classes visible in this aerial image. "
"Use the schema strictly; cover every class, marking absent ones as 'absent'."
)
resp = vision_client.chat.completions.create(
model=VISION_MODEL,
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": encode_image_b64(SAT_IMAGE)}},
],
}],
temperature=0.0,
max_tokens=700,
extra_body={"format": LandCoverReport.model_json_schema()},
)
decoded_object = json_repair.repair_json(resp.choices[0].message.content, return_objects=True)
report = LandCoverReport.model_validate_json(json.dumps(decoded_object))
print("Overall setting:", report.overall_setting)
print()
for it in report.items:
print(f" {it.class_name:12s} {it.coverage:10s} {it.notes or ''}")
---------------------------------------------------------------------------
BadRequestError Traceback (most recent call last)
Cell In[23], line 15
8 overall_setting: str = Field(..., description="One short phrase: e.g. coastal city, rural farmland, dense forest.")
10 prompt = (
11 "Inventory the land-cover classes visible in this aerial image. "
12 "Use the schema strictly; cover every class, marking absent ones as 'absent'."
13 )
---> 15 resp = vision_client.chat.completions.create(
16 model=VISION_MODEL,
17 messages=[{
18 "role": "user",
19 "content": [
20 {"type": "text", "text": prompt},
21 {"type": "image_url", "image_url": {"url": SAT_IMAGE}},
22 ],
23 }],
24 temperature=0.0,
25 max_tokens=700,
26 extra_body={"format": LandCoverReport.model_json_schema()},
27 )
29 report = LandCoverReport.model_validate_json(resp.choices[0].message.content)
30 print("Overall setting:", report.overall_setting)
File /projappl/project_2018216/LLM/lib/python3.12/site-packages/openai/_utils/_utils.py:287, in required_args.<locals>.inner.<locals>.wrapper(*args, **kwargs)
285 msg = f"Missing required argument: {quote(missing[0])}"
286 raise TypeError(msg)
--> 287 return func(*args, **kwargs)
File /projappl/project_2018216/LLM/lib/python3.12/site-packages/openai/resources/chat/completions/completions.py:1211, in Completions.create(self, messages, model, audio, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, n, parallel_tool_calls, prediction, presence_penalty, prompt_cache_key, prompt_cache_retention, reasoning_effort, response_format, safety_identifier, seed, service_tier, stop, store, stream, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, verbosity, web_search_options, extra_headers, extra_query, extra_body, timeout)
1164 @required_args(["messages", "model"], ["messages", "model", "stream"])
1165 def create(
1166 self,
(...)
1208 timeout: float | httpx.Timeout | None | NotGiven = not_given,
1209 ) -> ChatCompletion | Stream[ChatCompletionChunk]:
1210 validate_response_format(response_format)
-> 1211 return self._post(
1212 "/chat/completions",
1213 body=maybe_transform(
1214 {
1215 "messages": messages,
1216 "model": model,
1217 "audio": audio,
1218 "frequency_penalty": frequency_penalty,
1219 "function_call": function_call,
1220 "functions": functions,
1221 "logit_bias": logit_bias,
1222 "logprobs": logprobs,
1223 "max_completion_tokens": max_completion_tokens,
1224 "max_tokens": max_tokens,
1225 "metadata": metadata,
1226 "modalities": modalities,
1227 "n": n,
1228 "parallel_tool_calls": parallel_tool_calls,
1229 "prediction": prediction,
1230 "presence_penalty": presence_penalty,
1231 "prompt_cache_key": prompt_cache_key,
1232 "prompt_cache_retention": prompt_cache_retention,
1233 "reasoning_effort": reasoning_effort,
1234 "response_format": response_format,
1235 "safety_identifier": safety_identifier,
1236 "seed": seed,
1237 "service_tier": service_tier,
1238 "stop": stop,
1239 "store": store,
1240 "stream": stream,
1241 "stream_options": stream_options,
1242 "temperature": temperature,
1243 "tool_choice": tool_choice,
1244 "tools": tools,
1245 "top_logprobs": top_logprobs,
1246 "top_p": top_p,
1247 "user": user,
1248 "verbosity": verbosity,
1249 "web_search_options": web_search_options,
1250 },
1251 completion_create_params.CompletionCreateParamsStreaming
1252 if stream
1253 else completion_create_params.CompletionCreateParamsNonStreaming,
1254 ),
1255 options=make_request_options(
1256 extra_headers=extra_headers, extra_query=extra_query, extra_body=extra_body, timeout=timeout
1257 ),
1258 cast_to=ChatCompletion,
1259 stream=stream or False,
1260 stream_cls=Stream[ChatCompletionChunk],
1261 )
File /projappl/project_2018216/LLM/lib/python3.12/site-packages/openai/_base_client.py:1314, in SyncAPIClient.post(self, path, cast_to, body, content, options, files, stream, stream_cls)
1305 warnings.warn(
1306 "Passing raw bytes as `body` is deprecated and will be removed in a future version. "
1307 "Please pass raw bytes via the `content` parameter instead.",
1308 DeprecationWarning,
1309 stacklevel=2,
1310 )
1311 opts = FinalRequestOptions.construct(
1312 method="post", url=path, json_data=body, content=content, files=to_httpx_files(files), **options
1313 )
-> 1314 return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
File /projappl/project_2018216/LLM/lib/python3.12/site-packages/openai/_base_client.py:1087, in SyncAPIClient.request(self, cast_to, options, stream, stream_cls)
1084 err.response.read()
1086 log.debug("Re-raising status error")
-> 1087 raise self._make_status_error_from_response(err.response) from None
1089 break
1091 assert response is not None, "could not resolve response (should never happen)"
BadRequestError: Error code: 400 - {'error': {'message': 'image URLs are not currently supported, please use base64 encoded data instead', 'type': 'invalid_request_error', 'param': None, 'code': None}}
5.3 Map-screenshot reading#
Vision LLMs are particularly useful on rendered maps — they can read labels, identify symbols, and follow routes. Below we feed an OSM-style map tile and ask it to enumerate the labels.
⚠️ Reading text from low-resolution map tiles is genuinely hard. Expect partial answers. This is a great place to discuss failure modes with the class — what kinds of labels does the model miss? Tiny streets? Non-Latin scripts? Ambiguous icons? These failure patterns are exactly the kind of thing studied in vision-language navigation evaluation work.
# An OSM tile around central Helsinki — replace as needed.
MAP_IMAGE = "./osm.png"
img = Image.open(BytesIO(requests.get(MAP_IMAGE, headers={"User-Agent": "Mozilla/5.0"}).content))
img.show()
prompt = (
"This is a screenshot of an OpenStreetMap tile. "
"List every street name, place label, and POI symbol you can read. "
"Group your answer as: STREETS, PLACES, POIs."
)
print(vision_chat(prompt, MAP_IMAGE))
The user wants me to identify text and symbols from the provided image.
1. **Analyze the image:**
* The image is a screenshot of a web page or application interface.
* It has a yellow and black hazard stripe on the left.
* There is a number "418" in the top right.
* There is a small map graphic in the top left with a magnifying glass over it.
* The main text is a large heading "Access blocked".
* Below that is a paragraph of text explaining the block: "App is not following the tile usage policy of OpenStreetMap's volunteer-run servers: osm.wiki/Blocked".
* There is a magnifying glass icon over a map.
2. **Categorize the findings:**
* **STREETS:** I don't see any street names. The map graphic is too small and blurry to read any street names.
* **PLACES:** I don't see any specific place labels (like city names, park names) other than the generic "OpenStreetMap" text which is part of the error message, not a place label on a map. Wait, let me look closer at the map graphic. It's just a generic map tile. No readable place names.
* **POIs:** There are no Point of Interest symbols (like a red pin for a restaurant, a blue pin for a shop). The magnifying glass is an icon, not a POI symbol on a map.
3. **Refine the extraction:**
* The prompt asks for "street name, place label, and POI symbol".
* Looking strictly at the text provided:
* "418" (Number)
* "Access blocked" (Heading)
* "App is not following the tile usage policy of OpenStreetMap's volunteer-run servers: osm.wiki/Blocked" (Body text)
* Looking at the map graphic: It's a green map with blue water. No readable text.
* Looking at the magnifying glass: It's an icon.
* Wait, the prompt implies this *is* a screenshot of an OpenStreetMap tile, but the image provided is actually an error page *about* OSM tiles. The user says "This is a screenshot of an OpenStreetMap tile." This is a bit of a trick or a misunderstanding by the user. The image is clearly an error message page (likely from a tile server like `tile.openstreetmap.org` or a similar service that blocks unauthorized access).
* Let's re-read the prompt: "List every street name, place label, and POI symbol you can read."
* Since there are no readable street names, place labels, or POI symbols on the map part (it's too small/blurry), I should state that or list the text that *is* there if it fits the categories loosely.
* Actually, looking very closely at the map graphic in the top left... it's just a generic map. No text.
* The text "OpenStreetMap" is present. Is that a place label? No, it's a brand name.
* The text "osm.wiki/Blocked" is a URL.
* Let's look at the categories again.
* **STREETS:** None visible.
* **PLACES:** None visible (except maybe "OpenStreetMap" as a proper noun, but it's not a place label).
* **POIs:** None visible.
* However, usually, these prompts want me to extract *any* text. But the prompt is specific: "street name, place label, and POI symbol".
* If I can't find them, I should say "None found" or list the text that is there and explain.
* Let's look really closely at the map. Is there a street name? No.
* Is there a POI symbol? The magnifying glass is an icon.
* The prompt might be a trick question because the image is an error page, not a map tile.
* Let's reconsider the user's statement: "This is a screenshot of an OpenStreetMap tile." The user is mistaken. It's an error page.
* I need to be helpful. I will list the text I see, but clarify that there are no street names or POIs because it's an error message.
* Wait, maybe I missed something. Let's look at the map again. It's a green landmass and blue water. No text.
* So, the answer is essentially:
* STREETS: None.
* PLACES: None.
* POIs: None.
* But I should probably transcribe the text that *is* there just in case, or explain why.
* Actually, looking at the prompt again, it asks to "List every street name, place label, and POI symbol you can read."
* If I can't read them, I list nothing for those categories.
* But maybe the user considers "OpenStreetMap" a place? No.
* Maybe the user considers the magnifying glass a POI symbol? No, it's a UI element.
* Let's try to interpret the prompt as "Extract text from the image".
* Text: "418", "Access blocked", "App is not following the tile usage policy of OpenStreetMap's volunteer-run servers: osm.wiki/Blocked".
* But the specific categories are STREETS, PLACES, POIs.
* Since there are none, I will state that.
* Let's look at the image one more time. Is it possible that the map in the top left *does* have text?
* It looks like a standard OSM tile. It has a green area (land) and blue area (water). There might be tiny text, but it's illegible.
* So, the correct answer is that there are no readable street names, place labels, or POI symbols. The image is an error message.
* However, to be most helpful, I will list the text present in the image, categorized as best as possible, or just state there are none.
* Let's assume the user wants me to identify the text elements.
* "418" -> Number.
* "Access blocked" -> Heading.
* "App is not following..." -> Body text.
* "OpenStreetMap" -> Brand name.
* Okay, I will formulate the response.
* STREETS: None visible.
* PLACES: None visible.
* POIs: None visible.
* (Self-correction): The prompt asks me to list them. If there are none, I say none.
* Wait, is "OpenStreetMap" a place? No.
* Is "osm.wiki" a place? No.
* Let's look at the prompt again. "This is a screenshot of an OpenStreetMap tile."
* The user is asserting this. I should gently correct them or just answer based on what is visible.
* What is visible is an error message.
* So, "No street names, place labels, or POI symbols are visible. The image is an error message page."
* Let's try to find *any* text that could be construed as a place.
* "OpenStreetMap" is the name of the project.
* "osm.wiki" is a URL.
* Okay, I'll stick to the facts.
* STREETS: [Empty]
* PLACES: [Empty]
* POIs: [Empty]
* And maybe add a note about the text.
* Actually, looking at the very top left, there is a map. Is there a street name there? It's too blurry.
* Is there a POI symbol? The magnifying glass is an icon.
* So, the answer is effectively empty for the requested categories.
* Let's write the response.
* "Based on the image provided, which appears to be an error message rather than a map tile, there are no readable street names, place labels, or POI symbols."
* Then I can list the text that *is* there.
* "Text visible: '418', 'Access blocked', 'App is not following the tile usage policy of OpenStreetMap's volunteer-run servers: osm.wiki/Blocked'."
* Wait, the prompt says "Group your answer as: STREETS, PLACES, POIs."
* So I should follow that format.
* STREETS: None.
* PLACES: None.
* POIs: None.
* Is it possible that "OpenStreetMap" counts as a place? No.
* Is it possible that the user considers the magnifying glass a POI? Unlikely.
§6 — Wrap-up#
You now have working code patterns for:
✅ Talking to an Ollama server with the
openaiSDK.✅ Zero-shot, few-shot, and CoT prompting on geographic tasks.
✅ OSM tag prediction from natural language.
✅ JSON-schema-guided extraction of structured POIs.
✅ Vision LLM calls on satellite images and map tiles.
✅ Combining vision + guided JSON for structured remote-sensing-style outputs.
Key takeaways#
Theme |
Lesson |
|---|---|
Knowledge |
LLMs know a lot of geography but hallucinate coordinates. |
Prompting |
Few-shot + CoT beats zero-shot for OSM-style structured tasks. |
Reliability |
Always pair LLM output with a downstream validator (Pydantic, gazetteer lookup). |
Vision |
VLMs describe land cover well; reading map labels is fragile. |
Architecture |
The same |
Next#
In Part 2 you’ll get a notebook with the same skeleton but with # TODO blocks. You will build:
A natural-language → coordinate geocoder shim.
An OSM-tag classifier with confidence scores.
A vision-LLM landmark identifier.
A small Geo-RAG system over Wikipedia city descriptions.
Good luck! 🗺️