LLMs & Vision LLMs for GeoAI — Part 2: Solutions#

Course: GeoAI / Multimodal Geospatial Reasoning Companion to: 02_geoai_llms_handson_STUDENT.ipynb

This notebook contains the reference solutions for every TODO in the student notebook, with brief comments on why each design choice was made. Hand it out only after the lab.

How to read this notebook#

Every solution cell starts with a short comment block:

# ─── Solution 1.1 — system prompt ────────────────────────────────────
# Why: ...

If a student’s solution differs from the reference but their test cell prints ✅, accept it. There is rarely one correct prompt.

§0 — Setup (identical to student notebook)#

# !pip install --quiet openai pydantic pillow requests numpy scikit-learn
!pip install json_repair

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: json_repair in /projappl/project_2018216/LLM/lib/python3.12/site-packages (0.59.5)

import json_repair
import os, json, base64, math
from io import BytesIO
from typing import List, Optional, Tuple
from pathlib import Path

import requests
from PIL import Image
from openai import OpenAI
from pydantic import BaseModel, Field

OLLAMA_BASE_URL = os.environ.get("OLLAMA_BASE_URL", "http://g3201.mahti.csc.fi:11434/v1")
TEXT_MODEL      = os.environ.get("OLLAMA_TEXT_MODEL",   "qwen3.5")
VISION_MODEL    = os.environ.get("OLLAMA_VISION_MODEL", "qwen3.5")

client        = OpenAI(base_url=OLLAMA_BASE_URL, api_key="ollama")
text_client   = client
vision_client = client


# ---- Reasoning-aware helpers (same as student notebook) ------------------

def _extract_parts(message) -> Tuple[str, str]:
    "Return (content, reasoning) from a message, handling all field aliases."
    content   = (getattr(message, "content", None) or "")
    reasoning = (getattr(message, "reasoning",          None)
                 or getattr(message, "reasoning_content", None)
                 or getattr(message, "thinking",         None)
                 or "")
    return content.strip(), reasoning.strip()

def print_response(resp, show_reasoning: bool = True, max_reasoning_chars: int = 1200):
    "Pretty-print a ChatCompletion, including reasoning if present."
    choice = resp.choices[0]
    content, reasoning = _extract_parts(choice.message)
    finish, usage = choice.finish_reason, resp.usage
    print(f"┌─ model={resp.model}  finish_reason={finish}", end="")
    if usage is not None:
        print(f"  tokens={usage.prompt_tokens}+{usage.completion_tokens}={usage.total_tokens}")
    else:
        print()
    if show_reasoning and reasoning:
        print("│\n│  🧠 REASONING\n│  " + "─" * 50)
        snippet = reasoning if len(reasoning) <= max_reasoning_chars \
                            else reasoning[:max_reasoning_chars] + f"\n… [truncated, +{len(reasoning) - max_reasoning_chars} chars]"
        for line in snippet.splitlines():
            print(f"│  {line}")
    print("│\n│  💬 ANSWER\n│  " + "─" * 50)
    if content:
        for line in content.splitlines():
            print(f"│  {line}")
    elif reasoning:
        print("│  (content was empty — model emitted only reasoning; showing it as the answer)")
    else:
        print("│  (empty response)")
    if finish == "length":
        print("│\n│  ⚠️  Output truncated by max_tokens — increase it (reasoning consumes tokens).")
    print("└" + "─" * 60)

def chat(prompt, system="You are a concise geographic assistant.",
         model=None, temperature=0.0, max_tokens=2048,
         reasoning_effort=None, return_full=False, extra_body=None):
    "Single-turn chat. Returns content (or reasoning if content is empty)."
    eb = dict(extra_body or {})
    if reasoning_effort is not None:
        eb["reasoning_effort"] = reasoning_effort
    resp = client.chat.completions.create(
        model=model or TEXT_MODEL,
        messages=[{"role": "system", "content": system},
                  {"role": "user",   "content": prompt}],
        temperature=temperature,
        max_tokens=max_tokens,
        extra_body=eb,
    )
    if return_full:
        return resp
    content, reasoning = _extract_parts(resp.choices[0].message)
    return content if content else reasoning

def encode_image_b64(path_or_url: str) -> str:
    if path_or_url.startswith("http"):
        r = requests.get(
            path_or_url,
            headers={"User-Agent": "Mozilla/5.0"},
            timeout=30
        )
        r.raise_for_status()
        data = r.content
    else:
        data = Path(path_or_url).read_bytes()

    # Use same pattern that worked
    img = Image.open(BytesIO(data))
    img = img.convert("RGB")
    img.thumbnail((1024, 1024))
    buf = BytesIO()
    img.save(buf, format="JPEG", quality=85)

    return f"data:image/jpeg;base64,{base64.b64encode(buf.getvalue()).decode()}"

print(f"Ollama @ {OLLAMA_BASE_URL}")
print(f"Text model   : {TEXT_MODEL}")
print(f"Vision model : {VISION_MODEL}")
print("Helpers      : chat(), print_response(), encode_image_b64()")

Ollama @ http://g3201.mahti.csc.fi:11434/v1
Text model   : qwen3.5
Vision model : qwen3.5
Helpers      : chat(), print_response(), encode_image_b64()

Exercise 1 — Geocoder shim — solution#

Key idea. Use guided JSON so the model must return the four expected fields. The schema itself is a strong implicit instruction; we only need a short system prompt to handle the “I don’t know” case.

class GeoGuess(BaseModel):
    name:    str   = Field(..., description="The canonical place name (English).")
    country: str   = Field(..., description="The country, English name.")
    lat:     float = Field(..., description="Latitude in decimal degrees, WGS84.")
    lon:     float = Field(..., description="Longitude in decimal degrees, WGS84.")

COUNTRY_BBOX = {
    "Finland":  (19.0, 59.5, 32.0, 70.5),
    "Estonia":  (21.5, 57.5, 28.5, 60.0),
    "Belgium":  ( 2.5, 49.5,  6.5, 51.5),
    "Germany":  ( 5.5, 47.0, 15.5, 55.5),
    "Italy":    ( 6.5, 35.0, 19.0, 47.5),
    "Iran":     (44.0, 25.0, 64.0, 40.0),
}

def llm_geocode(query: str) -> Optional[GeoGuess]:
    """LLM-based geocoding shim. Returns GeoGuess or None on failure."""
    # ─── Solution 1.1 — system prompt ─────────────────────────────────
    # Why: explicit refusal instruction. Without it the model will
    # invent coordinates for clearly fictional inputs.
    system = (
        "You geocode place names. Output JSON matching the schema. "
        "If you genuinely do not know the place, set name='unknown', "
        "country='unknown', lat=0.0, lon=0.0."
    )

    # ─── Solution 1.2 — schema-guided call ────────────────────────────
    # Why: extra_body={"format": ...} is the Ollama-specific knob
    # that constrains decoding to valid JSON matching the schema.
    # We use temperature=0 because geocoding should be deterministic.
    try:
        resp = text_client.chat.completions.create(
            model=TEXT_MODEL,
            messages=[
                {"role": "system", "content": system},
                {"role": "user",   "content": f"Geocode this place: {query}"},
            ],
            temperature=0.0,
            max_tokens=200,
            reasoning_effort='none',
            extra_body={"format": GeoGuess.model_json_schema()},
        )
        # ─── Solution 1.3 — parse + sentinel ──────────────────────────
        decoded_object = json_repair.repair_json(resp.choices[0].message.content, return_objects=True)
        # print(decoded_object)
        guess = GeoGuess.model_validate_json(json.dumps(decoded_object))
        # print(guess)
        # Treat the sentinel as a refusal.]
        if guess.name.lower() == "unknown":
            return None
        return guess
    except Exception:
        return None

def validate_guess(g: GeoGuess) -> bool:
    """True iff (lat, lon) is plausible for the country."""
    # ─── Solution 1.4 — bbox check ────────────────────────────────────
    # Why: we trust the LLM's *country* label more than its coordinates
    # (the country is a discrete label, the coords are a regression).
    # An unknown country gets a free pass — extend COUNTRY_BBOX in real life.
    bbox = COUNTRY_BBOX.get(g.country)
    if bbox is None:
        return True
    min_lon, min_lat, max_lon, max_lat = bbox
    return (min_lon <= g.lon <= max_lon) and (min_lat <= g.lat <= max_lat)

Test cell for Exercise 1#

def _test_geocode():
    queries = [
        ("Helsinki, Finland",          True),
        ("Tampere, Finland",           True),
        ("Padova, Italy",              True),
        ("Ghent, Belgium",             True),
        ("Definitely-not-a-place-XYZ", False),
    ]
    for q, should_be_valid in queries:
        g = llm_geocode(q)
        print(g)
        ok = (g is not None) and validate_guess(g)
        status = "✅" if ok == should_be_valid else "❌"
        info = f"({g.lat:.3f},{g.lon:.3f}) in {g.country}" if g else "None"
        print(f"  {status}  {q:35s} → {info}")

_test_geocode()

name='Helsinki' country='Finland' lat=60.1699 lon=24.9384
  ✅  Helsinki, Finland                   → (60.170,24.938) in Finland
name='Tampere' country='Finland' lat=61.4981 lon=23.7871
  ✅  Tampere, Finland                    → (61.498,23.787) in Finland
name='Padova' country='Italy' lat=45.4064 lon=11.8771
  ✅  Padova, Italy                       → (45.406,11.877) in Italy
name='Ghent' country='Belgium' lat=51.0543 lon=3.7174
  ✅  Ghent, Belgium                      → (51.054,3.717) in Belgium
None
  ✅  Definitely-not-a-place-XYZ          → None

Discussion points:

The bbox check does not verify accuracy, only plausibility. A guess of (60.0, 25.0) for “Tampere, Finland” passes our test but is hundreds of km off from the real Tampere (61.50, 23.77).
A stronger validator: cross-check with Nominatim and reject if Haversine distance > 50 km.
The “unknown” sentinel pattern is preferable to expecting the model to refuse spontaneously — guided JSON cannot output free-form refusal text.

Exercise 2 — OSM tag classifier — solution#

Key idea. Few-shot examples shown with their confidence values teach the model the meaning of “confidence” in this context. Without examples, models often saturate at 0.95.

class TagPrediction(BaseModel):
    tag:        str   = Field(..., description="A single OSM tag in key=value form.")
    confidence: float = Field(..., ge=0.0, le=1.0)
    rationale:  str   = Field(..., description="One short sentence justifying this tag.")

class TagPredictionSet(BaseModel):
    predictions: List[TagPrediction]

def classify_osm(description: str) -> TagPredictionSet:
    """Multi-label OSM tag classifier with confidence + rationale."""
    # ─── Solution 2.1 — system prompt ─────────────────────────────────
    # Why: explicit calibration anchors. We tell the model what 0.5 vs
    # 0.95 means; otherwise it tends to bunch all confidences near 0.9.
    system = (
        "You are an OpenStreetMap tagging expert. Given a natural-language "
        "description, predict 1–4 OSM tags. For each tag, include a confidence "
        "in [0,1] and a one-sentence rationale.\n"
        "Confidence calibration:\n"
        "  0.95 = the description names the feature explicitly\n"
        "  0.70 = strong inference from common phrasing\n"
        "  0.40 = plausible but ambiguous\n"
        "  0.10 = wild guess\n"
        "If the description is nonsense, return one prediction with confidence near 0."
    )

    # ─── Solution 2.2 — few-shot block ────────────────────────────────
    # Why: showing the model concrete (description, predictions) pairs
    # is far more effective than describing the format in prose.
    examples = [
        ("a small bakery on the corner",
         [("shop=bakery", 0.95, "The phrase 'bakery' maps directly to shop=bakery.")]),
        ("the main railway station",
         [("railway=station",         0.95, "Explicitly a railway station."),
          ("public_transport=station", 0.85, "Standard companion tag for stations.")]),
        ("a paved cycle path along the river",
         [("highway=cycleway", 0.92, "'Cycle path' is the OSM cycleway."),
          ("surface=paved",    0.88, "Surface is stated explicitly.")]),
        ("something rectangular",
         [("building=yes", 0.20, "Too vague to commit; building is a weak guess.")]),
    ]

    # Format the few-shot block as serialized JSON predictions.
    fewshot_lines = []
    for desc, preds in examples:
        pred_json = json.dumps({
            "predictions": [
                {"tag": t, "confidence": c, "rationale": r} for (t, c, r) in preds
            ]
        }, ensure_ascii=False)
        fewshot_lines.append(f"Description: {desc}\nPredictions: {pred_json}")
    fewshot_str = "\n\n".join(fewshot_lines)

    user = f"{fewshot_str}\n\nDescription: {description}\nPredictions:"

    # ─── Solution 2.3 — guided JSON call ──────────────────────────────
    resp = text_client.chat.completions.create(
        model=TEXT_MODEL,
        messages=[{"role": "system", "content": system},
                  {"role": "user",   "content": user}],
        temperature=0.0,
        max_tokens=500,
        reasoning_effort='none',
        extra_body={"format": TagPredictionSet.model_json_schema()},
    )
    
    decoded_object = json_repair.repair_json(resp.choices[0].message.content, return_objects=True)
    return TagPredictionSet.model_validate_json(json.dumps(decoded_object))

Test cell for Exercise 2#

def _test_classify_osm():
    cases = [
        "a 24-hour pharmacy",
        "a wooden footbridge over a small stream",
        "an electric vehicle charging station with 4 plugs",
        "an unpaved hiking trail in a national park",
        "a polka-dot teleporter that hums in the rain",
    ]
    for c in cases:
        out = classify_osm(c)
        print(f"\n• {c!r}")
        for p in out.predictions:
            print(f"    {p.tag:35s}  conf={p.confidence:.2f}  — {p.rationale}")

_test_classify_osm()

• 'a 24-hour pharmacy'
    shop=pharmacy                        conf=0.95  — The description explicitly identifies the feature as a pharmacy.
    opening_hours=24/7                   conf=0.95  — The '24-hour' descriptor directly maps to the 24/7 opening hours tag.

• 'a wooden footbridge over a small stream'
    highway=footway                      conf=0.95  — A footbridge is a type of footway in OSM.
    bridge=wooden                        conf=0.95  — The material is explicitly stated as wooden.
    bridge=pedestrian                    conf=0.90  — Footbridges are pedestrian bridges.

• 'an electric vehicle charging station with 4 plugs'
    highway=charging_station             conf=0.95  — Explicitly an electric vehicle charging station.
    charging_station:plugs=4             conf=0.95  — The number of plugs is explicitly stated as 4.

• 'an unpaved hiking trail in a national park'
    highway=footway                      conf=0.95  — A hiking trail is explicitly a footway in OSM.
    surface=unpaved                      conf=0.95  — The surface is explicitly stated as unpaved.
    leisure=hiking                       conf=0.85  — Hiking trails are commonly tagged with leisure=hiking.

• 'a polka-dot teleporter that hums in the rain'
    amenity=teleporter                   conf=0.10  — The term 'teleporter' is not a standard OSM tag and the description contains fictional elements ('polka-dot', 'hums') that make this a wild guess.

Discussion points:

Notice that the nonsense description (“polka-dot teleporter…”) still produces output — but with low confidence. Always threshold on confidence before pushing to a real OSM workflow.
A common student mistake: forgetting to escape the few-shot JSON string. Pydantic will then fail to parse the response and raise a noisy error.
For real evaluation you would build a labelled set of (description → gold tags) pairs and report precision/recall per key (shop, highway, …).

Exercise 3 — Vision-LLM landmark identifier — solution#

Key idea. Force the model to enumerate visual evidence before naming a landmark. This both improves accuracy and gives you something to inspect when the model is wrong.

class LandmarkID(BaseModel):
    landmark:   str           = Field(..., description="Best-guess landmark name, or 'unknown'.")
    city:       Optional[str] = None
    country:    Optional[str] = None
    confidence: float         = Field(..., ge=0.0, le=1.0)
    evidence:   List[str]     = Field(..., description="Visual cues supporting the guess.")

def identify_landmark(image_url: str) -> LandmarkID:
    "Vision-LLM landmark identification with evidence."
    # ─── Solution 3.1 — conservative prompt ───────────────────────────
    # Why: 'evidence first' nudges the model toward grounding, similar to
    # asking a human guide to point at things before naming them.
    prompt = (
        "Examine the photograph carefully.\n"
        "First, list at least 2 concrete visual features you can see "
        "(architectural style, materials, signage, surrounding landscape, etc.).\n"
        "Then identify the most likely landmark. If you are not confident, "
        "set landmark='unknown' and confidence below 0.3.\n"
        "Return JSON matching the schema."
    )

    # ─── Solution 3.2 — multimodal call ───────────────────────────────
    # Why: image goes in a list-of-parts under the user message — that's
    # the OpenAI multimodal format that Ollama mirrors.
    resp = vision_client.chat.completions.create(
        model=VISION_MODEL,
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": encode_image_b64(image_url)}},
            ],
        }],
        temperature=0.0,
        max_tokens=500,
        extra_body={"format": LandmarkID.model_json_schema()},
    )

    # ─── Solution 3.3 — parse ─────────────────────────────────────────
    decoded_object = json_repair.repair_json(resp.choices[0].message.content, return_objects=True)
    print(decoded_object)
    return LandmarkID.model_validate_json(json.dumps(decoded_object))

Test cell for Exercise 3#

_LANDMARK_TEST_IMAGES = [
    "./helsinki.jpg",
    "./brussels.jpg",
    "./forest.jpg",
]

for url in _LANDMARK_TEST_IMAGES:
    try:
        out = identify_landmark(url)
        print(f"\n→ {url.split('/')[-1]}")
        print(f"   landmark   = {out.landmark}")
        print(f"   city/ctry  = {out.city or '-'} / {out.country or '-'}")
        print(f"   confidence = {out.confidence:.2f}")
        print(f"   evidence   = {out.evidence}")
    except Exception as e:
        print(f"⚠️  {url} failed: {e}")

⚠️  https://upload.wikimedia.org/wikipedia/commons/thumb/8/8d/Helsinki_Cathedral_in_July_2004.jpg/640px-Helsinki_Cathedral_in_July_2004.jpg failed: 400 Client Error: Use thumbnail steps listed on https://w.wiki/GHai. Please contact noc@wikimedia.org for further information (a765913) for url: https://upload.wikimedia.org/wikipedia/commons/thumb/8/8d/Helsinki_Cathedral_in_July_2004.jpg/640px-Helsinki_Cathedral_in_July_2004.jpg
⚠️  https://upload.wikimedia.org/wikipedia/commons/thumb/0/06/Atomium_Brussels.jpg/640px-Atomium_Brussels.jpg failed: 429 Client Error: Use thumbnail steps listed on https://w.wiki/GHai. Please contact noc@wikimedia.org for further information (1d0265d) for url: https://upload.wikimedia.org/wikipedia/commons/thumb/0/06/Atomium_Brussels.jpg/640px-Atomium_Brussels.jpg
⚠️  https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Forest_road_in_autumn.jpg/640px-Forest_road_in_autumn.jpg failed: 429 Client Error: Use thumbnail steps listed on https://w.wiki/GHai. Please contact noc@wikimedia.org for further information (1d0265d) for url: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Forest_road_in_autumn.jpg/640px-Forest_road_in_autumn.jpg

Discussion points:

The forest-road image should produce landmark='unknown'. If it doesn’t, the model is ignoring the conservative-prompt instruction — try a stronger phrasing or a smaller max_tokens.
Notice that the evidence list is post-hoc in autoregressive models — the model writes evidence first because we asked it to, but internally the landmark guess can still drive the evidence rather than the other way around. Treat evidence as an interpretability aid, not a proof.
This same prompting pattern (evidence + structured guess) is the backbone of many vision-language navigation evaluators.

Exercise 4 — Mini Geo-RAG — solution#

Key idea. A complete RAG fits in ~30 lines: retrieve, format, prompt, parse. Everything else is engineering polish.

GEO_CORPUS = [
    {"id": "helsinki",  "title": "Helsinki, Finland",
     "text": "Helsinki is the capital and most populous city of Finland. It sits on the Gulf of Finland in the south of the country, "
             "at roughly 60.17°N, 24.94°E. Helsinki is home to Aalto University (in nearby Espoo), the Finnish parliament, "
             "and the Lutheran Helsinki Cathedral. Its climate is humid continental, with snowy winters and mild summers."},
    {"id": "espoo",     "title": "Espoo, Finland",
     "text": "Espoo is the second-largest city in Finland, immediately west of Helsinki. It hosts Aalto University's Otaniemi campus "
             "and a large technology cluster including Nokia. The city has extensive forest and coastal areas, including Nuuksio National Park."},
    {"id": "tallinn",   "title": "Tallinn, Estonia",
     "text": "Tallinn, the capital of Estonia, lies on the southern shore of the Gulf of Finland, about 80 km south of Helsinki by ferry. "
             "Its medieval Old Town is a UNESCO World Heritage Site. Tallinn is known for a strong digital-government sector and the e-Residency programme."},
    {"id": "ghent",     "title": "Ghent, Belgium",
     "text": "Ghent is a port city in northwest Belgium, in the Flemish Region. It sits at the confluence of the rivers Lys and Scheldt. "
             "Ghent University is a major research institution. Landmarks include Saint Bavo's Cathedral, the Belfry, and the medieval Gravensteen castle."},
    {"id": "brussels",  "title": "Brussels, Belgium",
     "text": "Brussels is the de facto capital of the European Union and the capital of Belgium. The city hosts the European Commission, "
             "the Council of the EU, and the European Parliament's secondary seat. The Atomium, built for Expo 58, is one of its iconic landmarks."},
    {"id": "padova",    "title": "Padua (Padova), Italy",
     "text": "Padua, in northeastern Italy's Veneto region, is home to the University of Padua, founded in 1222. "
             "The Scrovegni Chapel houses Giotto's celebrated fresco cycle. Padua sits on the Bacchiglione river, about 40 km west of Venice."},
    {"id": "tehran",    "title": "Tehran, Iran",
     "text": "Tehran is the capital of Iran and the country's largest city, located on the southern slopes of the Alborz mountain range. "
             "Major landmarks include the Azadi Tower, the Milad Tower, and the Golestan Palace, a UNESCO World Heritage Site."},
    {"id": "seoul",     "title": "Seoul, South Korea",
     "text": "Seoul is the capital and largest metropolis of South Korea. The city is bisected by the Han River. "
             "Landmarks include Gyeongbokgung Palace, Bukhansan National Park, and the N Seoul Tower on Namsan."},
]

print(f"Corpus size: {len(GEO_CORPUS)} documents.")

Corpus size: 8 documents.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

_VECTORIZER = TfidfVectorizer(stop_words="english", ngram_range=(1, 2), min_df=1)
_DOC_TEXTS  = [d["title"] + ". " + d["text"] for d in GEO_CORPUS]
_DOC_MATRIX = _VECTORIZER.fit_transform(_DOC_TEXTS)

def retrieve(query: str, k: int = 3):
    """Return the top-k documents most similar to `query`, as list of (score, doc)."""
    # ─── Solution 4.1 — TF-IDF retrieval ──────────────────────────────
    # Why: cosine similarity over TF-IDF is the canonical sparse-retrieval
    # baseline. For 8 documents it is overkill but keeps the same shape
    # as a real dense-vector retriever — students will swap _VECTORIZER
    # for a Ollama /v1/embeddings call in their own projects.
    qvec   = _VECTORIZER.transform([query])                      # shape (1, V)
    sims   = cosine_similarity(qvec, _DOC_MATRIX)[0]             # shape (N,)
    top_ix = np.argsort(-sims)[:k]                               # descending
    return [(float(sims[i]), GEO_CORPUS[i]) for i in top_ix]

def rag_answer(question: str, k: int = 3) -> str:
    """Retrieve top-k docs, build a grounded prompt, and answer with citations."""
    # ─── Solution 4.2 — retrieve ──────────────────────────────────────
    hits = retrieve(question, k=k)
    # print(hits)

    # ─── Solution 4.3 — context block ─────────────────────────────────
    # Why: numbered docs make citation easy. Including the score is a
    # nice debugging aid; you can drop it in production.
    context_lines = []
    for i, (score, doc) in enumerate(hits, start=1):
        context_lines.append(f"[{i}] {doc['title']} (score={score:.2f})\n    {doc['text']}")
    context = "\n\n".join(context_lines)

    # ─── Solution 4.4 — strict grounding prompt ───────────────────────
    # Why: "Answer ONLY using the context" + an explicit refusal phrase
    # are the two ingredients that suppress hallucination in small RAG.
    system = (
        "You are a careful research assistant. Answer the user's question "
        "using ONLY the numbered context below. After every factual claim, "
        "cite the document number(s) in square brackets, e.g. [1] or [1,2]."
        "If the answer is not in the context, reply exactly: "
        "\"I don't know based on the provided documents.\""
    )

    user = f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"

    # ─── Solution 4.5 — call & return ─────────────────────────────────
    return chat(user, system=system,reasoning_effort='none', temperature=0.0, max_tokens=400)

Test cell for Exercise 4#

_RAG_QUESTIONS = [
    "Which Finnish city hosts Aalto University, and what is its relation to Helsinki?",
    "Name two iconic landmarks in Brussels and one in Tehran.",
    "How far is Tallinn from Helsinki, and how do people typically travel between them?",
    "What is the population of Antarctica?",
    "where is the captial of South Korea?",
    "Where is Helsinki University ?",
    "Where is Helsinki?"
]

for q in _RAG_QUESTIONS:
    print("=" * 80)
    print("Q:", q)
    print("A:", rag_answer(q))

================================================================================
Q: Which Finnish city hosts Aalto University, and what is its relation to Helsinki?
A: Based on the provided documents, **Helsinki** is the city that hosts Aalto University (specifically noted as being "in nearby Espoo" in document [1]), while **Espoo** is the city that hosts Aalto University's Otaniemi campus [2].

Regarding the relation to Helsinki:
*   **Espoo** is the second-largest city in Finland and is located immediately west of Helsinki [2].
*   **Helsinki** is the capital and most populous city of Finland, situated on the Gulf of Finland in the south of the country [1].

Therefore, Aalto University is hosted in both cities: its main presence is associated with Helsinki (with the university located in nearby Espoo) [1], and its Otaniemi campus is specifically in Espoo [2]. Espoo is directly west of Helsinki [2].
================================================================================
Q: Name two iconic landmarks in Brussels and one in Tehran.
A: Two iconic landmarks in Brussels are the Atomium and the European Commission (or the Council of the EU, and the European Parliament's secondary seat) [1]. One iconic landmark in Tehran is the Azadi Tower (or the Milad Tower, and the Golestan Palace) [2].
================================================================================
Q: How far is Tallinn from Helsinki, and how do people typically travel between them?
A: Tallinn is about 80 km south of Helsinki, and people typically travel between them by ferry [1].
================================================================================
Q: What is the population of Antarctica?
A: I don't know based on the provided documents.
================================================================================
Q: where is the captial of South Korea?
A: The capital of South Korea is Seoul [1].
================================================================================
Q: Where is Helsinki University ?
A: I don't know based on the provided documents.
================================================================================
Q: Where is Helsinki?
A: Helsinki is the capital and most populous city of Finland. It sits on the Gulf of Finland in the south of the country, at roughly 60.17°N, 24.94°E [1].

Discussion points:

Question 1 should retrieve espoo as top hit (Aalto is mentioned there explicitly), with helsinki second. The answer should cite at least [1] and [2].
Question 4 should trigger the refusal phrase. If it doesn’t, the model is hallucinating from prior knowledge — try lowering temperature or strengthening the system prompt.
The biggest realistic upgrade: replace TF-IDF with a real embedding model. Ollama serves embeddings on the same endpoint:
```
ollama pull nomic-embed-text
```
then call client.embeddings.create(model="nomic-embed-text", input=[...]) and store the resulting vectors. The rest of retrieve() is unchanged.
Second realistic upgrade: replace numeric citations with full doc IDs ([helsinki]) so cited spans survive corpus reindexing.

⭐ Bonus — Vision + RAG combined — solution sketch#

def vision_grounded_blurb(image_url: str) -> str:
    """Identify a landmark from an image, then write a grounded blurb from GEO_CORPUS."""
    landmark = identify_landmark(image_url)
    if landmark.confidence < 0.5 or landmark.landmark.lower() == "unknown":
        return "I cannot confidently identify the location."

    # Use both the city and landmark name as the retrieval query.
    query = f"{landmark.city or ''} {landmark.country or ''} {landmark.landmark}".strip()
    return rag_answer(
        f"In 3 sentences, describe {landmark.city or landmark.landmark} "
        f"and the area around {landmark.landmark}."
    )

# Try on Exercise 3's images
for url in _LANDMARK_TEST_IMAGES:
    print("=" * 80)
    print(url.split("/")[-1])
    print(vision_grounded_blurb(url))

Where this typically fails:

Cascade error. If the vision model misidentifies the landmark, the retriever pulls the wrong city, and the LLM writes a fluent but wrong paragraph. Always log landmark.confidence.
Coverage gap. The vision model can identify Brussels’ Atomium but the RAG corpus may have little detail about it — the answer becomes generic.
Confidence calibration. A vision LLM that returns confidence=0.9 for an unknown forest road will skip the refusal branch. In production, add a second gating prompt that asks “is this a famous landmark?” before retrieval.

These are exactly the cascade-failure modes studied in vision-language-navigation evaluation work — error analysis here is publishable.

LLMs & Vision LLMs for GeoAI — Part 2: Solutions

Contents

LLMs & Vision LLMs for GeoAI — Part 2: Solutions#

How to read this notebook#

§0 — Setup (identical to student notebook)#

Exercise 1 — Geocoder shim — solution#

Test cell for Exercise 1#

Exercise 2 — OSM tag classifier — solution#

Test cell for Exercise 2#

Exercise 3 — Vision-LLM landmark identifier — solution#

Test cell for Exercise 3#

Exercise 4 — Mini Geo-RAG — solution#

Test cell for Exercise 4#

⭐ Bonus — Vision + RAG combined — solution sketch#