LLMs & Vision LLMs for GeoAI — Part 1: Foundations#

Course: GeoAI / Multimodal Geospatial Reasoning Duration: ~45 minutes Instructor notebook (complete reference version)

What you’ll learn#

By the end of this notebook, you will be able to:

Connect to a locally-served LLM via Ollama’s OpenAI-compatible API.
Prompt an LLM for geographic reasoning (zero-shot, few-shot, chain-of-thought).
Reason about OpenStreetMap (OSM) tags using LLMs.
Extract structured geographic entities (POIs, coordinates, admin units) into JSON using Pydantic schemas.
Use a Vision LLM to interpret satellite imagery and map screenshots.

Why Ollama?#

Ollama is a local LLM runtime that:

Exposes an OpenAI-compatible HTTP API at http://localhost:11434/v1 — so the same openai Python SDK works.
Pulls and runs open-weight models (Llama 3.1, Qwen2.5-VL, Gemma, Mistral, …) with one command.
Handles GPU/CPU placement and quantization automatically.
Runs comfortably on a laptop — perfect for class.

Assumed setup#

We assume the Ollama daemon is already running on your machine (or on a course server) at:

http://localhost:11434/v1

If you do not have Ollama installed yet:

# macOS / Linux — one-line install
curl -fsSL https://ollama.com/install.sh | sh

# Pull the models we will use in this lab (one-off, ~5–10 GB each)
ollama pull llama3.1:8b
ollama pull qwen2.5vl:7b

# Start the server (usually started automatically on macOS / Linux service)
ollama serve

💡 You do not need an OpenAI API key. Ollama accepts any non-empty string — by convention we use "ollama".

§1 — Setup (≈ 5 min)#

We’ll use the standard openai SDK pointed at our Ollama endpoint, plus a few helpers for images and maps.

# Install dependencies (uncomment if needed)
# !pip install --quiet openai pydantic pillow requests numpy
!pip install json_repair

import json_repair
import os
import json
import base64
from io import BytesIO
from typing import List, Optional, Tuple
from pathlib import Path

import requests
from PIL import Image
from openai import OpenAI
from pydantic import BaseModel, Field

# ---- Configure endpoint ---------------------------------------------------
# Ollama exposes a single OpenAI-compatible endpoint that serves *all*
# pulled models — text and vision share the same base URL.
OLLAMA_BASE_URL = os.environ.get("OLLAMA_BASE_URL", "http://g3201.mahti.csc.fi:11434/v1")

TEXT_MODEL   = os.environ.get("OLLAMA_TEXT_MODEL",   "qwen3.5")
VISION_MODEL = os.environ.get("OLLAMA_VISION_MODEL", "qwen3.5")

# Ollama accepts any non-empty API key
client        = OpenAI(base_url=OLLAMA_BASE_URL, api_key="ollama")
text_client   = client   # alias for clarity downstream
vision_client = client

print(f"Endpoint     : {OLLAMA_BASE_URL}")
print(f"Text model   : {TEXT_MODEL}")
print(f"Vision model : {VISION_MODEL}")

Endpoint     : http://g3201.mahti.csc.fi:11434/v1
Text model   : qwen3.5
Vision model : qwen3.5

1.1 Reasoning-aware helpers#

Some Ollama models are thinking models (Gemma 3 thinking variants, GPT-OSS, DeepSeek-R1, Qwen 3 / 3.5). They emit a chain-of-thought in a separate reasoning field on the message, before giving you the final answer in content. The raw response object looks like:

ChatCompletionMessage(
    content='The answer is 391.',
    reasoning="17 × 20 = 340, 17 × 3 = 51, 340 + 51 = 391",
    role='assistant', ...
)

Two practical consequences:

Reasoning eats output tokens. A max_tokens=512 call that finishes with finish_reason='length' and an empty content field usually means the model spent all its budget thinking. Bump max_tokens to 1500–4000 for thinking models.
Some buggy models put everything in reasoning and leave content empty (a known Ollama issue with certain Gemma 4 / Qwen 3.5 builds). Our helper falls back to the reasoning text in that case so you never see an empty-string answer.

You can control thinking with reasoning_effort:

Value	Meaning
`"none"`	Disable thinking (where supported) — fastest
`"low"`	Minimal thinking
`"medium"`	Default
`"high"`	Deep thinking — slowest, best for hard reasoning

Below we define chat() and vision_chat() that return both fields and a print_response() pretty-printer.

def _extract_parts(message) -> Tuple[str, str]:
    """Return (content, reasoning) from a ChatCompletionMessage, handling all field aliases."""
    content   = (getattr(message, "content", None) or "")
    # Ollama uses `reasoning`; other SDKs may use `reasoning_content` or `thinking`.
    reasoning = (getattr(message, "reasoning",          None)
                 or getattr(message, "reasoning_content", None)
                 or getattr(message, "thinking",         None)
                 or "")
    return content.strip(), reasoning.strip()

def print_response(resp, show_reasoning: bool = True, max_reasoning_chars: int = 1200):
    """Pretty-print a ChatCompletion response, including reasoning if present."""
    choice  = resp.choices[0]
    content, reasoning = _extract_parts(choice.message)
    finish  = choice.finish_reason
    usage   = resp.usage

    # Banner
    print(f"┌─ model={resp.model}  finish_reason={finish}", end="")
    if usage is not None:
        print(f"  tokens={usage.prompt_tokens}+{usage.completion_tokens}={usage.total_tokens}")
    else:
        print()

    if show_reasoning and reasoning:
        print("│")
        print("│  🧠 REASONING")
        print("│  " + "─" * 50)
        snippet = reasoning if len(reasoning) <= max_reasoning_chars \
                            else reasoning[:max_reasoning_chars] + f"\n… [truncated, +{len(reasoning) - max_reasoning_chars} chars]"
        for line in snippet.splitlines():
            print(f"│  {line}")

    print("│")
    print("│  💬 ANSWER")
    print("│  " + "─" * 50)
    if content:
        for line in content.splitlines():
            print(f"│  {line}")
    elif reasoning:
        # Some thinking models leave content empty — fall back to reasoning.
        print("│  (content was empty — model emitted only reasoning; showing it as the answer)")
    else:
        print("│  (empty response)")

    if finish == "length":
        print("│")
        print("│  ⚠️  Output was truncated by max_tokens. Increase it (reasoning consumes tokens).")
    print("└" + "─" * 60)

def chat(prompt: str,
         system: str = "You are a helpful geographic assistant.",
         model: Optional[str] = None,
         temperature: float = 0.2,
         max_tokens: int = 2048,           # bumped from 512 — reasoning eats tokens
         reasoning_effort: Optional[str] = "none",  # None | "none" | "low" | "medium" | "high"
         return_full: bool = False,
         extra_body: Optional[dict] = None):
    """
    Single-turn chat. Returns the content string by default.
    If the model is a thinking model and content is empty, falls back to reasoning.
    Set return_full=True to get the raw ChatCompletion object.
    """
    eb = dict(extra_body or {})
    if reasoning_effort is not None:
        eb["reasoning_effort"] = reasoning_effort

    resp = client.chat.completions.create(
        model=model or TEXT_MODEL,
        messages=[{"role": "system", "content": system},
                  {"role": "user",   "content": prompt}],
        temperature=temperature,
        max_tokens=max_tokens,
        extra_body=eb,
    )
    if return_full:
        return resp
    content, reasoning = _extract_parts(resp.choices[0].message)
    return content if content else reasoning  # graceful fallback

def vision_chat(prompt: str, image: str,
                model: Optional[str] = None,
                temperature: float = 0.2,
                max_tokens: int = 2048,
                reasoning_effort: Optional[str] = None,
                return_full: bool = False,
                extra_body: Optional[dict] = None):
    """Vision counterpart of chat(). `image` may be a URL, local path, or data URI."""
    eb = dict(extra_body or {})
    if reasoning_effort is not None:
        eb["reasoning_effort"] = reasoning_effort
    
    img_url = encode_image_b64(image)
    resp = client.chat.completions.create(
        model=model or VISION_MODEL,
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": img_url}},
            ],
        }],
        temperature=temperature,
        max_tokens=max_tokens,
        extra_body=eb,
    )
    if return_full:
        return resp
    content, reasoning = _extract_parts(resp.choices[0].message)
    return content if content else reasoning
    
def encode_image_b64(path_or_url: str) -> str:
    if path_or_url.startswith("http"):
        r = requests.get(
            path_or_url,
            headers={"User-Agent": "Mozilla/5.0"},
            timeout=30
        )
        r.raise_for_status()
        data = r.content
    else:
        data = Path(path_or_url).read_bytes()

    # Use same pattern that worked
    img = Image.open(BytesIO(data))
    img = img.convert("RGB")
    img.thumbnail((1024, 1024))
    buf = BytesIO()
    img.save(buf, format="JPEG", quality=85)

    return f"data:image/jpeg;base64,{base64.b64encode(buf.getvalue()).decode()}"

print("Helpers loaded: chat(), vision_chat(), print_response(), encode_image_b64()")

Helpers loaded: chat(), vision_chat(), print_response(), encode_image_b64()

# Quick sanity check — list the models the Ollama server has pulled.
# If this fails, the server is unreachable: check that `ollama serve` is running.
try:
    models = client.models.list()
    print("Models available locally on Ollama:")
    for m in models.data:
        print(f"  - {m.id}")
except Exception as e:
    print(f"⚠️  Cannot reach Ollama: {e}")
    print("   Run `ollama serve` and `ollama pull llama3.1:8b qwen2.5vl:7b`")

Models available locally on Ollama:
  - qwen3.5:latest
  - gemma4:latest
  - llama3.1:8b

§2 — Geographic Q&A and spatial reasoning (≈ 10 min)#

LLMs encode a surprising amount of world geographic knowledge in their parameters: country borders, capitals, rough coordinates of major cities, climate zones, landmarks. But they also hallucinate confidently — especially about coordinates and small places.

We’ll use the chat() helper defined above. For thinking models, set reasoning_effort="medium" and pass return_full=True if you want to inspect the reasoning trace via print_response().

# Zero-shot factual recall — content only
print(chat("What is the capital of Finland, and roughly at what latitude does it sit?"))

The capital of Finland is **Helsinki**.

It sits at a latitude of approximately **60° 10′ N** (or about **60.17° North**).

This places Helsinki slightly south of the Arctic Circle (which is at 66° 33′ N), but still well within the subarctic zone, giving it a distinct northern European climate with long, dark winters and bright, long summers.

# Same call, but ask for the FULL response so we can see reasoning (if the model emits any).
resp = chat(
    "Is Tallinn north or south of Helsinki? "
    "Answer in one sentence and give the approximate latitude of each.",
    return_full=True,
)
print_response(resp)

┌─ model=qwen3.5  finish_reason=stop  tokens=45+38=83
│
│  💬 ANSWER
│  ──────────────────────────────────────────────────
│  Tallinn is located south of Helsinki, with Tallinn at approximately 59.44° N latitude and Helsinki at approximately 60.17° N latitude.
└────────────────────────────────────────────────────────────

2.1 Chain-of-thought and reasoning models#

There are two ways to get a model to “think”:

Prompt-level CoT — tell any model to “think step by step” in the prompt. Works on Llama, Gemma 2, Qwen 2.5, etc.
Native reasoning — for thinking models (DeepSeek-R1, Qwen3, Gemma 3 thinking, GPT-OSS), Ollama exposes the reasoning trace in a separate reasoning field via reasoning_effort.

The cell below uses return_full=True + print_response() so you can see whichever the model produced.

cot_prompt = (
    "I am in Ghent, Belgium and want to drive to Aalto University in Espoo, Finland.\n"
    "Which of these countries will I most plausibly cross or take a ferry through, "
    "and in what order: Germany, Sweden, Denmark, Estonia, Norway?\n"
    "Think step by step about the geography, then give the final ordered list."
)

# Crank max_tokens because reasoning eats them. reasoning_effort is silently
# ignored by non-thinking models, so this is safe to leave on by default.
resp = chat(cot_prompt, max_tokens=5000, reasoning_effort="low", return_full=True)
print_response(resp, show_reasoning=True)

┌─ model=qwen3.5  finish_reason=length  tokens=92+5000=5092
│
│  🧠 REASONING
│  ──────────────────────────────────────────────────
│  Thinking Process:
│  
│  1.  **Analyze the Request:**
│      *   Starting point: Ghent, Belgium.
│      *   Destination: Aalto University, Espoo, Finland.
│      *   Task: Identify which countries from the provided list (Germany, Sweden, Denmark, Estonia, Norway) will be crossed or taken a ferry through, and determine the order.
│      *   Constraint: Think step by step about the geography.
│      *   Output: Final ordered list.
│  
│  2.  **Geographical Analysis:**
│      *   **Start:** Ghent, Belgium (Western Europe).
│      *   **Destination:** Espoo, Finland (Northern Europe, across the Gulf of Finland from Helsinki).
│      *   **Route:** Driving from Belgium to Finland.
│      *   **Neighboring Countries:**
│          *   Belgium borders France, Germany, Netherlands, Luxembourg.
│          *   To get to Finland, one must cross the Baltic Sea.
│          *   The main land route involves crossing the Baltic Sea via ferry or driving through the Baltic states.
│      *   **Countries in the List:** Germany, Sweden, Denmark, Estonia, Norway.
│  
│  3.  **Evaluate Each Country:**
│      *   **Germany:** Belgium borders Germany (via Netherlands or directly). Driving from Belgium to Finland usually goes through the Netherlands, then Germany (o
│  … [truncated, +16983 chars]
│
│  💬 ANSWER
│  ──────────────────────────────────────────────────
│  (content was empty — model emitted only reasoning; showing it as the answer)
│
│  ⚠️  Output was truncated by max_tokens. Increase it (reasoning consumes tokens).
└────────────────────────────────────────────────────────────

2.2 ⚠️ Hallucination probe#

Always verify coordinates that come out of an LLM. The model will happily produce a plausible-looking lat/lon for an obscure village that is wrong by tens of kilometres. For production geocoding, use Nominatim, the Google Geocoding API, or a vetted gazetteer — and use the LLM only for normalization and disambiguation.

# This will likely return a plausible but unreliable number — a teaching moment.
print(chat(
    "Give the latitude and longitude of the village 'Sotkamo, Finland' "
    "to four decimal places. Just the numbers."
))

63.5167 23.4833

§3 — Few-shot reasoning over OSM tags (≈ 10 min)#

OpenStreetMap features are described by key=value tags such as amenity=cafe, highway=residential, building=yes. Mapping natural language to OSM tags is a recurring sub-task in geospatial NLP — it powers natural-language search over OSM and is central to vision-free navigation systems built on OSM graphs.

We’ll build a tiny few-shot tag predictor.

FEW_SHOT_EXAMPLES = [
    ("a small bakery on the corner",
     "shop=bakery"),
    ("the main railway station",
     "railway=station, public_transport=station"),
    ("a paved cycle path along the river",
     "highway=cycleway, surface=paved"),
    ("a Lutheran church from the 19th century",
     "amenity=place_of_worship, religion=christian, denomination=lutheran"),
    ("a roundabout with three exits",
     "highway=primary, junction=roundabout"),
]

def build_fewshot_prompt(query):
    blocks = [
        "You are an OpenStreetMap tagging expert. "
        "Given a natural-language description, output the most likely OSM tags "
        "as a comma-separated list of key=value pairs. Output ONLY the tags, no prose.\n"
    ]
    for desc, tags in FEW_SHOT_EXAMPLES:
        blocks.append(f"Description: {desc}\nTags: {tags}\n")
    blocks.append(f"Description: {query}\nTags:")
    return "\n".join(blocks)

query = "an outdoor swimming pool in a public park"
print(build_fewshot_prompt(query))
print("---")
print("Predicted:", chat(build_fewshot_prompt(query), temperature=0.0))

You are an OpenStreetMap tagging expert. Given a natural-language description, output the most likely OSM tags as a comma-separated list of key=value pairs. Output ONLY the tags, no prose.

Description: a small bakery on the corner
Tags: shop=bakery

Description: the main railway station
Tags: railway=station, public_transport=station

Description: a paved cycle path along the river
Tags: highway=cycleway, surface=paved

Description: a Lutheran church from the 19th century
Tags: amenity=place_of_worship, religion=christian, denomination=lutheran

Description: a roundabout with three exits
Tags: highway=primary, junction=roundabout

Description: an outdoor swimming pool in a public park
Tags:
---
Predicted: leisure=swimming_pool, outdoor=yes

# A handful of test queries — the model should generalise from 5 examples.
test_queries = [
    "a 24-hour pharmacy",
    "a wooden footbridge over a small stream",
    "an electric vehicle charging station with 4 plugs",
    "an unpaved hiking trail in a national park",
]
for q in test_queries:
    print(f"• {q!r}")
    print(f"  → {chat(build_fewshot_prompt(q), temperature=0.0).strip()}\n")

• 'a 24-hour pharmacy'
  → shop=pharmacy, opening_hours=24/7

• 'a wooden footbridge over a small stream'
  → highway=footway, bridge=yes, material=wood

• 'an electric vehicle charging station with 4 plugs'
  → highway=charging_station, charging_station:capacity=4

• 'an unpaved hiking trail in a national park'
  → highway=footway, surface=unpaved, leisure=hiking

§4 — Structured outputs: geo-entity extraction with Pydantic (≈ 10 min)#

For pipelines, you almost never want free-form prose. Ollama supports JSON-schema-constrained decoding by passing the schema to the format= parameter (we send it through extra_body={"format": schema} so the OpenAI SDK forwards it to Ollama unchanged).

We’ll define a Pydantic schema for points of interest (POIs) and extract them from a paragraph — a recurring need in tasks like building a navigation knowledge base from Wikipedia, tourist guides, or trip reports.

class POI(BaseModel):
    "A single point of interest mentioned in the text."
    name:     str           = Field(..., description="Proper name of the place.")
    category: str           = Field(..., description="OSM-style category, e.g. museum, park, restaurant.")
    city:     Optional[str] = Field(None, description="The city the POI is in, if mentioned.")
    country:  Optional[str] = Field(None, description="The country, if it can be inferred.")

class POIList(BaseModel):
    pois: List[POI]

# JSON schema for guided decoding
schema = POIList.model_json_schema()
print(json.dumps(schema, indent=2)[:600], "...")

{
  "$defs": {
    "POI": {
      "description": "A single point of interest mentioned in the text.",
      "properties": {
        "name": {
          "description": "Proper name of the place.",
          "title": "Name",
          "type": "string"
        },
        "category": {
          "description": "OSM-style category, e.g. museum, park, restaurant.",
          "title": "Category",
          "type": "string"
        },
        "city": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
 ...

def extract_pois(text):
    resp = text_client.chat.completions.create(
        model=TEXT_MODEL,
        messages=[
            {"role": "system",
             "content": "Extract every point of interest from the user text into the given JSON schema. "
                        "Only include places, not abstract concepts."},
            {"role": "user", "content": text},
        ],
        temperature=0.0,
        max_tokens=600,
        # Ollama-specific: pass the schema via `format` to constrain decoding
        # to a JSON object that matches the schema exactly.
        extra_body={"format": schema},
    )
    decoded_object = json_repair.repair_json(resp.choices[0].message.content, return_objects=True)
    return POIList.model_validate_json(json.dumps(decoded_object))

paragraph = (
    "On our trip to Helsinki we started at Senate Square, walked to the Ateneum art museum, "
    "and had coffee at Café Regatta near Sibelius Park. The next day we took the ferry from "
    "the South Harbour to Suomenlinna fortress, then flew from Helsinki-Vantaa to Tallinn, "
    "where we visited Kadriorg Palace and the Estonian Open Air Museum."
)
result = extract_pois(paragraph)
for p in result.pois:
    print(f"• {p.name:35s}  {p.category:20s}  {p.city or '-':10s}  {p.country or '-'}")

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Cell In[13], line 24
     16     return POIList.model_validate_json(resp.choices[0].message.content)
     18 paragraph = (
     19     "On our trip to Helsinki we started at Senate Square, walked to the Ateneum art museum, "
     20     "and had coffee at Café Regatta near Sibelius Park. The next day we took the ferry from "
     21     "the South Harbour to Suomenlinna fortress, then flew from Helsinki-Vantaa to Tallinn, "
     22     "where we visited Kadriorg Palace and the Estonian Open Air Museum."
     23 )
---> 24 result = extract_pois(paragraph)
     25 for p in result.pois:
     26     print(f"• {p.name:35s}  {p.category:20s}  {p.city or '-':10s}  {p.country or '-'}")

Cell In[13], line 16, in extract_pois(text)
      1 def extract_pois(text):
      2     resp = text_client.chat.completions.create(
      3         model=TEXT_MODEL,
      4         messages=[
   (...)
     14         # extra_body={"format": schema},
     15     )
---> 16     return POIList.model_validate_json(resp.choices[0].message.content)

File /PUHTI_TYKKY_Z2xYJyB/miniforge/envs/env1/lib/python3.12/site-packages/pydantic/main.py:746, in BaseModel.model_validate_json(cls, json_data, strict, context, by_alias, by_name)
    740 if by_alias is False and by_name is not True:
    741     raise PydanticUserError(
    742         'At least one of `by_alias` or `by_name` must be set to True.',
    743         code='validate-by-alias-and-name-false',
    744     )
--> 746 return cls.__pydantic_validator__.validate_json(
    747     json_data, strict=strict, context=context, by_alias=by_alias, by_name=by_name
    748 )

ValidationError: 1 validation error for POIList
  Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/json_invalid

🛠️ Why this matters for GeoAI. This is the same pattern you would use to:

Build a structured knowledge base from a corpus of trip reports for a navigation agent.

Convert reviewer comments into structured fields for a meta-review pipeline.

Pre-process tourist guides into geocodable entities before passing them to Nominatim.

§5 — Vision LLMs for satellite & map understanding (≈ 10 min)#

Vision-language models extend the same chat interface with image inputs. The OpenAI-compatible format used by Ollama accepts images either as a URL or as a base64 data URI.

We’ll do three things:

Describe a satellite image of an urban area.
Read text and symbols from a map screenshot.
Combine vision + structured output to extract an inventory of visible features.

(vision_chat() and encode_image_b64() are already defined in §1.1 — we just reuse them.)

5.1 Describing a satellite image#

We’ll use a public Wikimedia Commons satellite image. Replace with any image URL or local path; for class we recommend pre-downloading 2–3 images to avoid wifi issues.

# A public-domain satellite image of central Helsinki from Wikimedia Commons.
# If you have local sample images, point at /mnt/data/<file>.jpg instead.
SAT_IMAGE = "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/Helsinki_by_Sentinel-2%2C_2020-06-26.jpg/1280px-Helsinki_by_Sentinel-2%2C_2020-06-26.jpg"
img = Image.open(BytesIO(requests.get(SAT_IMAGE, headers={"User-Agent": "Mozilla/5.0"}).content))
img.show()

prompt = (
    "This is a satellite/aerial photograph. Describe in 4–6 bullets what you can see, "
    "focusing on land use (water, built-up areas, vegetation, transport infrastructure). "
    "Do NOT guess the city name; only describe what is visible."
)

print(vision_chat(prompt, SAT_IMAGE))

../_images/9c99c0b530b55638eb23d13ac732f84fe2b6aaf93297236941fd0d70286d2a2e.png

*   **Water:** A large, deep blue body of water dominates the lower right portion of the image, featuring a coastline dotted with numerous small islands and skerries.
*   **Built-up Areas:** A dense, sprawling urban area occupies the central and upper-left landmass, characterized by a mix of residential neighborhoods and commercial development with a visible grid-like street pattern.
*   **Airport:** A distinct airport complex with intersecting runways is clearly visible just north of the main urban cluster, surrounded by open fields.
*   **Vegetation:** The northern and eastern peripheries are dominated by dense, dark green forests, while the upper left quadrant shows patches of lighter green agricultural fields and pastures.
*   **Transport Infrastructure:** A network of roads and highways connects the city center to the surrounding rural areas, with infrastructure extending towards the coastal water and inland lakes.

5.2 Vision + structured output#

We can combine guided JSON decoding with vision inputs. Below we ask the model to output a structured land-cover inventory.

class LandCover(BaseModel):
    class_name: str = Field(..., description="One of: water, vegetation, built_up, road, rail, bare_ground, other")
    coverage:   str = Field(..., description="One of: dominant, frequent, rare, absent")
    notes:      Optional[str] = None

class LandCoverReport(BaseModel):
    items:           List[LandCover]
    overall_setting: str = Field(..., description="One short phrase: e.g. coastal city, rural farmland, dense forest.")

prompt = (
    "Inventory the land-cover classes visible in this aerial image. "
    "Use the schema strictly; cover every class, marking absent ones as 'absent'."
)

resp = vision_client.chat.completions.create(
    model=VISION_MODEL,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "image_url", "image_url": {"url": encode_image_b64(SAT_IMAGE)}},
        ],
    }],
    temperature=0.0,
    max_tokens=700,
    extra_body={"format": LandCoverReport.model_json_schema()},
)

decoded_object = json_repair.repair_json(resp.choices[0].message.content, return_objects=True)
report = LandCoverReport.model_validate_json(json.dumps(decoded_object))
print("Overall setting:", report.overall_setting)
print()
for it in report.items:
    print(f"  {it.class_name:12s}  {it.coverage:10s}  {it.notes or ''}")

---------------------------------------------------------------------------
BadRequestError                           Traceback (most recent call last)
Cell In[23], line 15
   overall_setting: str = Field(..., description="One short phrase: e.g. coastal city, rural farmland, dense forest.")
prompt = (
   "Inventory the land-cover classes visible in this aerial image. "
   "Use the schema strictly; cover every class, marking absent ones as 'absent'."
)
---> 15 resp = vision_client.chat.completions.create(
   model=VISION_MODEL,
   messages=[{
       "role": "user",
       "content": [
           {"type": "text", "text": prompt},
           {"type": "image_url", "image_url": {"url": SAT_IMAGE}},
       ],
   }],
   temperature=0.0,
   max_tokens=700,
   extra_body={"format": LandCoverReport.model_json_schema()},
)
report = LandCoverReport.model_validate_json(resp.choices[0].message.content)
print("Overall setting:", report.overall_setting)

File /projappl/project_2018216/LLM/lib/python3.12/site-packages/openai/_utils/_utils.py:287, in required_args.<locals>.inner.<locals>.wrapper(*args, **kwargs)
           msg = f"Missing required argument: {quote(missing[0])}"
   raise TypeError(msg)
--> 287 return func(*args, **kwargs)

File /projappl/project_2018216/LLM/lib/python3.12/site-packages/openai/resources/chat/completions/completions.py:1211, in Completions.create(self, messages, model, audio, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, n, parallel_tool_calls, prediction, presence_penalty, prompt_cache_key, prompt_cache_retention, reasoning_effort, response_format, safety_identifier, seed, service_tier, stop, store, stream, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, verbosity, web_search_options, extra_headers, extra_query, extra_body, timeout)
@required_args(["messages", "model"], ["messages", "model", "stream"])
def create(
   self,
   (...)
   timeout: float | httpx.Timeout | None | NotGiven = not_given,
) -> ChatCompletion | Stream[ChatCompletionChunk]:
   validate_response_format(response_format)
-> 1211     return self._post(
       "/chat/completions",
       body=maybe_transform(
           {
               "messages": messages,
               "model": model,
               "audio": audio,
               "frequency_penalty": frequency_penalty,
               "function_call": function_call,
               "functions": functions,
               "logit_bias": logit_bias,
               "logprobs": logprobs,
               "max_completion_tokens": max_completion_tokens,
               "max_tokens": max_tokens,
               "metadata": metadata,
               "modalities": modalities,
               "n": n,
               "parallel_tool_calls": parallel_tool_calls,
               "prediction": prediction,
               "presence_penalty": presence_penalty,
               "prompt_cache_key": prompt_cache_key,
               "prompt_cache_retention": prompt_cache_retention,
               "reasoning_effort": reasoning_effort,
               "response_format": response_format,
               "safety_identifier": safety_identifier,
               "seed": seed,
               "service_tier": service_tier,
               "stop": stop,
               "store": store,
               "stream": stream,
               "stream_options": stream_options,
               "temperature": temperature,
               "tool_choice": tool_choice,
               "tools": tools,
               "top_logprobs": top_logprobs,
               "top_p": top_p,
               "user": user,
               "verbosity": verbosity,
               "web_search_options": web_search_options,
           },
           completion_create_params.CompletionCreateParamsStreaming
           if stream
           else completion_create_params.CompletionCreateParamsNonStreaming,
       ),
       options=make_request_options(
           extra_headers=extra_headers, extra_query=extra_query, extra_body=extra_body, timeout=timeout
       ),
       cast_to=ChatCompletion,
       stream=stream or False,
       stream_cls=Stream[ChatCompletionChunk],
   )

File /projappl/project_2018216/LLM/lib/python3.12/site-packages/openai/_base_client.py:1314, in SyncAPIClient.post(self, path, cast_to, body, content, options, files, stream, stream_cls)
   warnings.warn(
       "Passing raw bytes as `body` is deprecated and will be removed in a future version. "
       "Please pass raw bytes via the `content` parameter instead.",
       DeprecationWarning,
       stacklevel=2,
   )
opts = FinalRequestOptions.construct(
   method="post", url=path, json_data=body, content=content, files=to_httpx_files(files), **options
)
-> 1314 return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))

File /projappl/project_2018216/LLM/lib/python3.12/site-packages/openai/_base_client.py:1087, in SyncAPIClient.request(self, cast_to, options, stream, stream_cls)
           err.response.read()
       log.debug("Re-raising status error")
-> 1087         raise self._make_status_error_from_response(err.response) from None
   break
assert response is not None, "could not resolve response (should never happen)"

BadRequestError: Error code: 400 - {'error': {'message': 'image URLs are not currently supported, please use base64 encoded data instead', 'type': 'invalid_request_error', 'param': None, 'code': None}}

5.3 Map-screenshot reading#

Vision LLMs are particularly useful on rendered maps — they can read labels, identify symbols, and follow routes. Below we feed an OSM-style map tile and ask it to enumerate the labels.

⚠️ Reading text from low-resolution map tiles is genuinely hard. Expect partial answers. This is a great place to discuss failure modes with the class — what kinds of labels does the model miss? Tiny streets? Non-Latin scripts? Ambiguous icons? These failure patterns are exactly the kind of thing studied in vision-language navigation evaluation work.

# An OSM tile around central Helsinki — replace as needed.
MAP_IMAGE = "./osm.png"
img = Image.open(BytesIO(requests.get(MAP_IMAGE, headers={"User-Agent": "Mozilla/5.0"}).content))
img.show()
prompt = (
    "This is a screenshot of an OpenStreetMap tile. "
    "List every street name, place label, and POI symbol you can read. "
    "Group your answer as: STREETS, PLACES, POIs."
)
print(vision_chat(prompt, MAP_IMAGE))

../_images/88e179a1afbacb04dc45dcb9f9c8c4e6c878ce091d31facaf693d4db69f89401.png

The user wants me to identify text and symbols from the provided image.

1.  **Analyze the image:**
    *   The image is a screenshot of a web page or application interface.
    *   It has a yellow and black hazard stripe on the left.
    *   There is a number "418" in the top right.
    *   There is a small map graphic in the top left with a magnifying glass over it.
    *   The main text is a large heading "Access blocked".
    *   Below that is a paragraph of text explaining the block: "App is not following the tile usage policy of OpenStreetMap's volunteer-run servers: osm.wiki/Blocked".
    *   There is a magnifying glass icon over a map.

2.  **Categorize the findings:**
    *   **STREETS:** I don't see any street names. The map graphic is too small and blurry to read any street names.
    *   **PLACES:** I don't see any specific place labels (like city names, park names) other than the generic "OpenStreetMap" text which is part of the error message, not a place label on a map. Wait, let me look closer at the map graphic. It's just a generic map tile. No readable place names.
    *   **POIs:** There are no Point of Interest symbols (like a red pin for a restaurant, a blue pin for a shop). The magnifying glass is an icon, not a POI symbol on a map.

3.  **Refine the extraction:**
    *   The prompt asks for "street name, place label, and POI symbol".
    *   Looking strictly at the text provided:
        *   "418" (Number)
        *   "Access blocked" (Heading)
        *   "App is not following the tile usage policy of OpenStreetMap's volunteer-run servers: osm.wiki/Blocked" (Body text)
    *   Looking at the map graphic: It's a green map with blue water. No readable text.
    *   Looking at the magnifying glass: It's an icon.

    *   Wait, the prompt implies this *is* a screenshot of an OpenStreetMap tile, but the image provided is actually an error page *about* OSM tiles. The user says "This is a screenshot of an OpenStreetMap tile." This is a bit of a trick or a misunderstanding by the user. The image is clearly an error message page (likely from a tile server like `tile.openstreetmap.org` or a similar service that blocks unauthorized access).

    *   Let's re-read the prompt: "List every street name, place label, and POI symbol you can read."
    *   Since there are no readable street names, place labels, or POI symbols on the map part (it's too small/blurry), I should state that or list the text that *is* there if it fits the categories loosely.
    *   Actually, looking very closely at the map graphic in the top left... it's just a generic map. No text.
    *   The text "OpenStreetMap" is present. Is that a place label? No, it's a brand name.
    *   The text "osm.wiki/Blocked" is a URL.

    *   Let's look at the categories again.
        *   **STREETS:** None visible.
        *   **PLACES:** None visible (except maybe "OpenStreetMap" as a proper noun, but it's not a place label).
        *   **POIs:** None visible.

    *   However, usually, these prompts want me to extract *any* text. But the prompt is specific: "street name, place label, and POI symbol".
    *   If I can't find them, I should say "None found" or list the text that is there and explain.
    *   Let's look really closely at the map. Is there a street name? No.
    *   Is there a POI symbol? The magnifying glass is an icon.
    *   The prompt might be a trick question because the image is an error page, not a map tile.

    *   Let's reconsider the user's statement: "This is a screenshot of an OpenStreetMap tile." The user is mistaken. It's an error page.
    *   I need to be helpful. I will list the text I see, but clarify that there are no street names or POIs because it's an error message.
    *   Wait, maybe I missed something. Let's look at the map again. It's a green landmass and blue water. No text.
    *   So, the answer is essentially:
        *   STREETS: None.
        *   PLACES: None.
        *   POIs: None.
    *   But I should probably transcribe the text that *is* there just in case, or explain why.
    *   Actually, looking at the prompt again, it asks to "List every street name, place label, and POI symbol you can read."
    *   If I can't read them, I list nothing for those categories.
    *   But maybe the user considers "OpenStreetMap" a place? No.
    *   Maybe the user considers the magnifying glass a POI symbol? No, it's a UI element.

    *   Let's try to interpret the prompt as "Extract text from the image".
    *   Text: "418", "Access blocked", "App is not following the tile usage policy of OpenStreetMap's volunteer-run servers: osm.wiki/Blocked".
    *   But the specific categories are STREETS, PLACES, POIs.
    *   Since there are none, I will state that.

    *   Let's look at the image one more time. Is it possible that the map in the top left *does* have text?
    *   It looks like a standard OSM tile. It has a green area (land) and blue area (water). There might be tiny text, but it's illegible.
    *   So, the correct answer is that there are no readable street names, place labels, or POI symbols. The image is an error message.

    *   However, to be most helpful, I will list the text present in the image, categorized as best as possible, or just state there are none.
    *   Let's assume the user wants me to identify the text elements.
    *   "418" -> Number.
    *   "Access blocked" -> Heading.
    *   "App is not following..." -> Body text.
    *   "OpenStreetMap" -> Brand name.

    *   Okay, I will formulate the response.
    *   STREETS: None visible.
    *   PLACES: None visible.
    *   POIs: None visible.
    *   (Self-correction): The prompt asks me to list them. If there are none, I say none.
    *   Wait, is "OpenStreetMap" a place? No.
    *   Is "osm.wiki" a place? No.

    *   Let's look at the prompt again. "This is a screenshot of an OpenStreetMap tile."
    *   The user is asserting this. I should gently correct them or just answer based on what is visible.
    *   What is visible is an error message.
    *   So, "No street names, place labels, or POI symbols are visible. The image is an error message page."

    *   Let's try to find *any* text that could be construed as a place.
    *   "OpenStreetMap" is the name of the project.
    *   "osm.wiki" is a URL.

    *   Okay, I'll stick to the facts.
    *   STREETS: [Empty]
    *   PLACES: [Empty]
    *   POIs: [Empty]
    *   And maybe add a note about the text.

    *   Actually, looking at the very top left, there is a map. Is there a street name there? It's too blurry.
    *   Is there a POI symbol? The magnifying glass is an icon.
    *   So, the answer is effectively empty for the requested categories.

    *   Let's write the response.
    *   "Based on the image provided, which appears to be an error message rather than a map tile, there are no readable street names, place labels, or POI symbols."
    *   Then I can list the text that *is* there.
    *   "Text visible: '418', 'Access blocked', 'App is not following the tile usage policy of OpenStreetMap's volunteer-run servers: osm.wiki/Blocked'."

    *   Wait, the prompt says "Group your answer as: STREETS, PLACES, POIs."
    *   So I should follow that format.
    *   STREETS: None.
    *   PLACES: None.
    *   POIs: None.

    *   Is it possible that "OpenStreetMap" counts as a place? No.
    *   Is it possible that the user considers the magnifying glass a POI? Unlikely.

§6 — Wrap-up#

You now have working code patterns for:

✅ Talking to an Ollama server with the openai SDK.
✅ Zero-shot, few-shot, and CoT prompting on geographic tasks.
✅ OSM tag prediction from natural language.
✅ JSON-schema-guided extraction of structured POIs.
✅ Vision LLM calls on satellite images and map tiles.
✅ Combining vision + guided JSON for structured remote-sensing-style outputs.

Key takeaways#

Theme	Lesson
Knowledge	LLMs know a lot of geography but hallucinate coordinates.
Prompting	Few-shot + CoT beats zero-shot for OSM-style structured tasks.
Reliability	Always pair LLM output with a downstream validator (Pydantic, gazetteer lookup).
Vision	VLMs describe land cover well; reading map labels is fragile.
Architecture	The same `chat.completions` call covers text and vision — easy to swap models.

Next#

In Part 2 you’ll get a notebook with the same skeleton but with # TODO blocks. You will build:

A natural-language → coordinate geocoder shim.
An OSM-tag classifier with confidence scores.
A vision-LLM landmark identifier.
A small Geo-RAG system over Wikipedia city descriptions.

Good luck! 🗺️

LLMs & Vision LLMs for GeoAI — Part 1: Foundations

Contents