LLMs & Vision LLMs for GeoAI — Part 2: Hands-on Lab#

Course: GeoAI / Multimodal Geospatial Reasoning Duration: ~45 minutes Student notebook — fill in the # TODO blocks


What you’ll build#

Four mini-exercises, finishing with a small Geo-RAG system:

#

Exercise

What you practise

1

NL → coordinate geocoder

Prompting + validating LLM output

2

OSM-tag classifier with confidence

Few-shot + guided JSON

3

Vision-LLM landmark identifier

Multimodal structured output

4

Geo-RAG over city descriptions

Retrieval + grounded generation

Ground rules#

  • Every # TODO is small (1–10 lines). If your block grows past ~15 lines, ask for help.

  • After each TODO, run the provided test cell below it. It should print ✅.

  • If a test fails, read the error carefully — most failures are schema / prompt format issues, not Ollama problems.

  • Reference the Part 1 instructor notebook freely.

💡 Tip. Set temperature=0.0 for everything in this lab. You want determinism.

§0 — Setup (provided, just run)#

# !pip install --quiet openai pydantic pillow requests numpy scikit-learn
import os, json, base64, math
from io import BytesIO
from typing import List, Optional, Tuple
from pathlib import Path

import requests
from PIL import Image
from openai import OpenAI
from pydantic import BaseModel, Field

# ---- Ollama endpoint ------------------------------------------------------
OLLAMA_BASE_URL = os.environ.get("OLLAMA_BASE_URL", "http://localhost:11434/v1")
TEXT_MODEL      = os.environ.get("OLLAMA_TEXT_MODEL",   "llama3.1:8b")
VISION_MODEL    = os.environ.get("OLLAMA_VISION_MODEL", "qwen2.5vl:7b")

client        = OpenAI(base_url=OLLAMA_BASE_URL, api_key="ollama")
text_client   = client
vision_client = client

# ---- Reasoning-aware helpers ---------------------------------------------
# Some Ollama models (DeepSeek-R1, Qwen3, Gemma 3 thinking, GPT-OSS) put their
# chain-of-thought into a separate `reasoning` field. The helpers below extract
# both fields and gracefully fall back when one is empty.

def _extract_parts(message) -> Tuple[str, str]:
    "Return (content, reasoning) from a message, handling all field aliases."
    content   = (getattr(message, "content", None) or "")
    reasoning = (getattr(message, "reasoning",          None)
                 or getattr(message, "reasoning_content", None)
                 or getattr(message, "thinking",         None)
                 or "")
    return content.strip(), reasoning.strip()

def print_response(resp, show_reasoning: bool = True, max_reasoning_chars: int = 1200):
    "Pretty-print a ChatCompletion, including reasoning if present."
    choice = resp.choices[0]
    content, reasoning = _extract_parts(choice.message)
    finish, usage = choice.finish_reason, resp.usage
    print(f"┌─ model={resp.model}  finish_reason={finish}", end="")
    if usage is not None:
        print(f"  tokens={usage.prompt_tokens}+{usage.completion_tokens}={usage.total_tokens}")
    else:
        print()
    if show_reasoning and reasoning:
        print("│\n│  🧠 REASONING\n│  " + "─" * 50)
        snippet = reasoning if len(reasoning) <= max_reasoning_chars \
                            else reasoning[:max_reasoning_chars] + f"\n… [truncated, +{len(reasoning) - max_reasoning_chars} chars]"
        for line in snippet.splitlines():
            print(f"│  {line}")
    print("│\n│  💬 ANSWER\n│  " + "─" * 50)
    if content:
        for line in content.splitlines():
            print(f"│  {line}")
    elif reasoning:
        print("│  (content was empty — model emitted only reasoning; showing it as the answer)")
    else:
        print("│  (empty response)")
    if finish == "length":
        print("│\n│  ⚠️  Output truncated by max_tokens — increase it (reasoning consumes tokens).")
    print("└" + "─" * 60)

def chat(prompt, system="You are a concise geographic assistant.",
         model=None, temperature=0.0, max_tokens=2048,
         reasoning_effort=None,    # None | "none" | "low" | "medium" | "high"
         return_full=False, extra_body=None):
    "Single-turn chat. Returns content (or reasoning if content is empty)."
    eb = dict(extra_body or {})
    if reasoning_effort is not None:
        eb["reasoning_effort"] = reasoning_effort
    resp = client.chat.completions.create(
        model=model or TEXT_MODEL,
        messages=[{"role": "system", "content": system},
                  {"role": "user",   "content": prompt}],
        temperature=temperature,
        max_tokens=max_tokens,
        extra_body=eb,
    )
    if return_full:
        return resp
    content, reasoning = _extract_parts(resp.choices[0].message)
    return content if content else reasoning

def encode_image_b64(path_or_url):
    "Return a data URI for an image given a local path or HTTP URL."
    data = (requests.get(path_or_url, timeout=30).content
            if path_or_url.startswith("http") else Path(path_or_url).read_bytes())
    img = Image.open(BytesIO(data)).convert("RGB")
    img.thumbnail((1024, 1024))
    buf = BytesIO(); img.save(buf, format="JPEG", quality=85)
    return f"data:image/jpeg;base64,{base64.b64encode(buf.getvalue()).decode()}"

print(f"Ollama @ {OLLAMA_BASE_URL}")
print(f"Text model   : {TEXT_MODEL}")
print(f"Vision model : {VISION_MODEL}")
print("Helpers      : chat(), print_response(), encode_image_b64()")

💡 Reasoning models. If your assigned model is a thinking model (e.g. deepseek-r1, qwen3:8b, gpt-oss:20b, a Gemma 3 thinking variant), you can:

  • Pass return_full=True to chat() to get the raw ChatCompletion object.

  • Then call print_response(resp) to print both the reasoning trace and the final answer.

  • Set reasoning_effort="medium" (or "low"/"high"/"none") to control how hard the model thinks.

Non-thinking models silently ignore reasoning_effort — it is safe to leave on.

Exercise 1 — Build a natural-language → coordinate geocoder shim (≈ 7 min)#

In Part 1 we saw that LLMs hallucinate coordinates. A common pattern is:

  1. Ask the LLM for a (name, country, lat, lon) tuple.

  2. Validate the output: country plausibility + lat/lon inside a per-country bounding box.

  3. If validation fails, return None and let a downstream geocoder (Nominatim, etc.) take over.

Your job: implement step 1 with a Pydantic-guided prompt, then step 2.

class GeoGuess(BaseModel):
    name:    str   = Field(..., description="The canonical place name (English).")
    country: str   = Field(..., description="The country, English name.")
    lat:     float = Field(..., description="Latitude in decimal degrees, WGS84.")
    lon:     float = Field(..., description="Longitude in decimal degrees, WGS84.")

# Tight per-country bounding boxes (min_lon, min_lat, max_lon, max_lat) — extend as needed.
COUNTRY_BBOX = {
    "Finland":  (19.0, 59.5, 32.0, 70.5),
    "Estonia":  (21.5, 57.5, 28.5, 60.0),
    "Belgium":  ( 2.5, 49.5,  6.5, 51.5),
    "Germany":  ( 5.5, 47.0, 15.5, 55.5),
    "Italy":    ( 6.5, 35.0, 19.0, 47.5),
    "Iran":     (44.0, 25.0, 64.0, 40.0),
}
def llm_geocode(query: str) -> Optional[GeoGuess]:
    """
    Ask the LLM for a (name, country, lat, lon) for `query`.
    Use guided JSON so the output strictly matches the GeoGuess schema.
    Return a GeoGuess on success, or None if parsing fails.
    """
    # TODO 1.1
    # ----- Build a clear system prompt that tells the model:
    #       (a) it must output JSON matching the schema,
    #       (b) it should refuse with empty fields if it doesn't know.
    system = ...

    # TODO 1.2
    # ----- Call text_client.chat.completions.create(...)
    #       Pass extra_body={"format": GeoGuess.model_json_schema()} to enforce the schema.
    raise NotImplementedError("Implement llm_geocode")

    # TODO 1.3
    # ----- Parse the response with GeoGuess.model_validate_json(...) and return it.
    #       Wrap the parse in try/except and return None on failure.
def validate_guess(g: GeoGuess) -> bool:
    """Return True iff (lat, lon) is plausible for the country."""
    # TODO 1.4
    # ----- 1. Look up the country's bounding box in COUNTRY_BBOX.
    #         If the country isn't in the dict, return True (give the LLM the benefit of the doubt).
    #       2. Otherwise, check min_lon <= g.lon <= max_lon AND min_lat <= g.lat <= max_lat.
    raise NotImplementedError("Implement validate_guess")

Test cell for Exercise 1#

def _test_geocode():
    queries = [
        ("Helsinki, Finland",          True),
        ("Tampere, Finland",           True),
        ("Padova, Italy",              True),
        ("Ghent, Belgium",             True),
        ("Definitely-not-a-place-XYZ", False),  # we expect None or invalid
    ]
    for q, should_be_valid in queries:
        g = llm_geocode(q)
        ok = (g is not None) and validate_guess(g)
        status = "✅" if ok == should_be_valid else "❌"
        info = f"({g.lat:.3f},{g.lon:.3f}) in {g.country}" if g else "None"
        print(f"  {status}  {q:35s}{info}")

_test_geocode()

Exercise 2 — OSM-tag classifier with confidence scores (≈ 8 min)#

Extend the few-shot tagger from Part 1 so that, for each predicted tag, the model also returns a confidence score in [0,1] and a one-sentence rationale.

This is useful when you ingest user-provided descriptions into OSM and want to flag low-confidence predictions for human review (cf. computing-education-style auto-feedback pipelines).

class TagPrediction(BaseModel):
    tag:        str   = Field(..., description="A single OSM tag in key=value form.")
    confidence: float = Field(..., ge=0.0, le=1.0)
    rationale:  str   = Field(..., description="One short sentence justifying this tag.")

class TagPredictionSet(BaseModel):
    predictions: List[TagPrediction]
def classify_osm(description: str) -> TagPredictionSet:
    """Return a list of (tag, confidence, rationale) predictions for the description."""
    # TODO 2.1
    # ----- Build a system prompt that:
    #         - Explains OSM tagging.
    #         - Asks for 1–4 tags max.
    #         - Says confidence is the model's subjective probability that the tag is correct.
    system = ...

    # TODO 2.2
    # ----- Build a few-shot user prompt. Hint: re-use FEW_SHOT_EXAMPLES from Part 1
    #       OR define 3–5 of your own here.
    examples = [
        # (description, list of (tag, confidence, rationale))
        # e.g. ("a small bakery on the corner",
        #       [("shop=bakery", 0.95, "The phrase 'bakery' maps directly to shop=bakery.")]),
    ]
    # TODO: format `examples` into a few-shot string.
    fewshot_str = ...

    user = f"{fewshot_str}\n\nDescription: {description}\nPredictions:"

    # TODO 2.3
    # ----- Call the LLM with extra_body={"format": TagPredictionSet.model_json_schema()}
    #       and return TagPredictionSet.model_validate_json(...).
    raise NotImplementedError("Implement classify_osm")

Test cell for Exercise 2#

def _test_classify_osm():
    cases = [
        "a 24-hour pharmacy",
        "a wooden footbridge over a small stream",
        "an electric vehicle charging station with 4 plugs",
        "an unpaved hiking trail in a national park",
        "a polka-dot teleporter that hums in the rain",   # nonsense — confidence should be low
    ]
    for c in cases:
        out = classify_osm(c)
        print(f"\n{c!r}")
        for p in out.predictions:
            print(f"    {p.tag:35s}  conf={p.confidence:.2f}{p.rationale}")

_test_classify_osm()

Exercise 3 — Vision-LLM landmark identifier (≈ 10 min)#

Given a photograph, return:

  • The most likely landmark name.

  • The country/city if recognisable.

  • A confidence score.

  • A list of visual evidence strings (architectural features, signage, etc.) that justify the guess.

This is the same pattern used in geo-aware multimodal retrieval and in evaluating vision-only navigation pipelines: you want the model to explain what it saw, not just label.

class LandmarkID(BaseModel):
    landmark:   str           = Field(..., description="Best-guess landmark name, or 'unknown'.")
    city:       Optional[str] = None
    country:    Optional[str] = None
    confidence: float         = Field(..., ge=0.0, le=1.0)
    evidence:   List[str]     = Field(..., description="Visual cues supporting the guess.")
    
def identify_landmark(image_url: str) -> LandmarkID:
    "Send the image to the vision model and return a LandmarkID."
    # TODO 3.1
    # ----- Write a prompt that:
    #         - Tells the model to be conservative (set landmark='unknown' if not sure).
    #         - Asks for at least 2 evidence strings.
    prompt = ...

    # TODO 3.2
    # ----- Call vision_client.chat.completions.create(...) with image_url AND
    #       extra_body={"format": LandmarkID.model_json_schema()}.
    #       Remember: image content goes inside a list-of-parts under the user message.
    raise NotImplementedError("Implement identify_landmark")

    # TODO 3.3
    # ----- Parse with LandmarkID.model_validate_json(...) and return.

Test cell for Exercise 3#

We test on three public Wikimedia Commons photographs.

_LANDMARK_TEST_IMAGES = [
    # Helsinki Cathedral
    "https://upload.wikimedia.org/wikipedia/commons/thumb/8/8d/Helsinki_Cathedral_in_July_2004.jpg/640px-Helsinki_Cathedral_in_July_2004.jpg",
    # Atomium, Brussels
    "https://upload.wikimedia.org/wikipedia/commons/thumb/0/06/Atomium_Brussels.jpg/640px-Atomium_Brussels.jpg",
    # A random forest road (should be 'unknown')
    "https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Forest_road_in_autumn.jpg/640px-Forest_road_in_autumn.jpg",
]

for url in _LANDMARK_TEST_IMAGES:
    try:
        out = identify_landmark(url)
        print(f"\n{url.split('/')[-1]}")
        print(f"   landmark   = {out.landmark}")
        print(f"   city/ctry  = {out.city or '-'} / {out.country or '-'}")
        print(f"   confidence = {out.confidence:.2f}")
        print(f"   evidence   = {out.evidence}")
    except Exception as e:
        print(f"⚠️  {url} failed: {e}")

Exercise 4 — Build a mini Geo-RAG system (≈ 15 min)#

You’ll build the smallest sensible retrieval-augmented-generation pipeline over a tiny corpus of city descriptions, then ask grounded geographic questions.

Why a notebook-scale RAG? Because the architecture is identical to a full one:

question  →  retriever (top-k)  →  prompt with context  →  LLM  →  answer (with citations)

We’ll skip a vector DB and use scikit-learn’s TfidfVectorizer + cosine similarity. (In your real research, swap this for an Ollama embedding model or a proper vector store.)

# A tiny geographic knowledge base — 8 short city profiles.
GEO_CORPUS = [
    {"id": "helsinki",  "title": "Helsinki, Finland",
     "text": "Helsinki is the capital and most populous city of Finland. It sits on the Gulf of Finland in the south of the country, "
             "at roughly 60.17°N, 24.94°E. Helsinki is home to Aalto University (in nearby Espoo), the Finnish parliament, "
             "and the Lutheran Helsinki Cathedral. Its climate is humid continental, with snowy winters and mild summers."},
    {"id": "espoo",     "title": "Espoo, Finland",
     "text": "Espoo is the second-largest city in Finland, immediately west of Helsinki. It hosts Aalto University's Otaniemi campus "
             "and a large technology cluster including Nokia. The city has extensive forest and coastal areas, including Nuuksio National Park."},
    {"id": "tallinn",   "title": "Tallinn, Estonia",
     "text": "Tallinn, the capital of Estonia, lies on the southern shore of the Gulf of Finland, about 80 km south of Helsinki by ferry. "
             "Its medieval Old Town is a UNESCO World Heritage Site. Tallinn is known for a strong digital-government sector and the e-Residency programme."},
    {"id": "ghent",     "title": "Ghent, Belgium",
     "text": "Ghent is a port city in northwest Belgium, in the Flemish Region. It sits at the confluence of the rivers Lys and Scheldt. "
             "Ghent University is a major research institution. Landmarks include Saint Bavo's Cathedral, the Belfry, and the medieval Gravensteen castle."},
    {"id": "brussels",  "title": "Brussels, Belgium",
     "text": "Brussels is the de facto capital of the European Union and the capital of Belgium. The city hosts the European Commission, "
             "the Council of the EU, and the European Parliament's secondary seat. The Atomium, built for Expo 58, is one of its iconic landmarks."},
    {"id": "padova",    "title": "Padua (Padova), Italy",
     "text": "Padua, in northeastern Italy's Veneto region, is home to the University of Padua, founded in 1222. "
             "The Scrovegni Chapel houses Giotto's celebrated fresco cycle. Padua sits on the Bacchiglione river, about 40 km west of Venice."},
    {"id": "tehran",    "title": "Tehran, Iran",
     "text": "Tehran is the capital of Iran and the country's largest city, located on the southern slopes of the Alborz mountain range. "
             "Major landmarks include the Azadi Tower, the Milad Tower, and the Golestan Palace, a UNESCO World Heritage Site."},
    {"id": "seoul",     "title": "Seoul, South Korea",
     "text": "Seoul is the capital and largest metropolis of South Korea. The city is bisected by the Han River. "
             "Landmarks include Gyeongbokgung Palace, Bukhansan National Park, and the N Seoul Tower on Namsan."},
]

print(f"Corpus size: {len(GEO_CORPUS)} documents.")
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Pre-build a TF-IDF index over the corpus (title + body text).
_VECTORIZER = TfidfVectorizer(stop_words="english", ngram_range=(1, 2), min_df=1)
_DOC_TEXTS  = [d["title"] + ". " + d["text"] for d in GEO_CORPUS]
_DOC_MATRIX = _VECTORIZER.fit_transform(_DOC_TEXTS)

def retrieve(query: str, k: int = 3):
    """Return the top-k documents most similar to `query`, as list of (score, doc)."""
    # TODO 4.1
    # ----- 1. Transform `query` with _VECTORIZER (use .transform, not fit_transform).
    #       2. Compute cosine_similarity against _DOC_MATRIX. Result is shape (1, N).
    #       3. argsort the scores descending, take top-k indices.
    #       4. Return [(score, GEO_CORPUS[idx]) ...] for those indices.
    raise NotImplementedError("Implement retrieve")
def rag_answer(question: str, k: int = 3) -> str:
    """
    Geo-RAG pipeline:
      1. Retrieve top-k relevant docs.
      2. Build a prompt that puts the docs as numbered context.
      3. Ask the LLM to answer ONLY using the context, citing doc numbers like [1], [2].
    Return the model's full answer string.
    """
    # TODO 4.2
    # ----- 1. Call retrieve(question, k) to get hits.
    hits = ...

    # TODO 4.3
    # ----- 2. Build a context block, e.g.:
    #         [1] Helsinki, Finland — <text>
    #         [2] Espoo, Finland    — <text>
    context = ...

    # TODO 4.4
    # ----- 3. Build a system prompt that says:
    #           - Answer ONLY using the context.
    #           - If the answer isn't in the context, say "I don't know based on the provided documents."
    #           - Cite the document numbers in square brackets after each claim.
    system = ...

    user = f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"

    # TODO 4.5
    # ----- 4. Call chat(...) with system, user prompt, temperature=0.0, and return the result.
    raise NotImplementedError("Implement rag_answer")

Test cell for Exercise 4#

_RAG_QUESTIONS = [
    "Which Finnish city hosts Aalto University, and what is its relation to Helsinki?",
    "Name two iconic landmarks in Brussels and one in Tehran.",
    "How far is Tallinn from Helsinki, and how do people typically travel between them?",
    "What is the population of Antarctica?",  # not in corpus → should refuse
]

for q in _RAG_QUESTIONS:
    print("=" * 80)
    print("Q:", q)
    print("A:", rag_answer(q))

⭐ Bonus exercise (optional, take-home)#

Combine vision + RAG: given a photograph, run Exercise 3 to obtain a candidate landmark, then look up the city in GEO_CORPUS and have the model write a 3-sentence travel paragraph grounded in the corpus, with citations.

Sketch:

def vision_grounded_blurb(image_url):
    landmark = identify_landmark(image_url)        # exercise 3
    if landmark.confidence < 0.5:
        return "I cannot confidently identify the location."
    question = f"Tell me about {landmark.city or landmark.country}, " \
               f"specifically near {landmark.landmark}."
    return rag_answer(question)                     # exercise 4

Try it on the three test images from Exercise 3. Where does it fail? Why?

Wrap-up & reflection#

Take 2 minutes and discuss in pairs:

  1. Reliability. Which of your four exercises produced the most fragile output? What additional validation would you add?

  2. Cost. Each call to Ollama has latency. Where would you batch, cache, or pre-compute?

  3. Evaluation. How would you evaluate the OSM-tag classifier rigorously? What dataset would you build?

  4. Multimodality. When does the vision model add real value, and when is text alone enough?

Submit your completed notebook + a short (≤ 200-word) reflection on these questions.