LLM-Assisted RBMT for Image Captioning
in Indigenous Languages

AmericasNLP 2026 Shared Task entry. An agent reads each language's dev captions plus open-web references and writes a Yaduha-compatible Pydantic grammar package. A vision-language model produces an English caption that the grammar deterministically renders it into the target language. No training, no fine-tuning, no parallel corpora.

🏆 Final results Source on GitHub Dev results Test predictions

Abstract

We submit to all five language tracks of the AmericasNLP 2026 Shared Task on Cultural Image Captioning — Bribri (bzd), Guaraní (grn), Yucatec Maya (yua), Orizaba Nahuatl (nlv), and Wixárika (hch) — using a single architecture: image + strict-Literal Pydantic schema → VLM emits SentenceList JSON via OpenAI structured outputs → deterministic Python rendering into the target language. The grammar packages are authored end-to-end by an Anthropic Opus 4.7 coding agent that consults the dev split and open-web references; humans never edit the resulting yaduha-{iso} packages.

Our best-per-language configuration averages 16.86 ChrF++ across the four ChrF++-comparable languages (+2.44 vs the organizer baseline Qwen3-VL → NLLB at 14.42), without ever generating target-language tokens directly. The system also covers Yucatec Maya, which the organizer’s NLLB pipeline does not. Lemma fields are typed as Pydantic Literal[...] enums. The LLM cannot emit OOV terms, so output is grammatical by construction (insofar as the LLM-generated grammar is correct). A narrowly-prompted proper_noun escape hatch carries genuine named entities through verbatim.

How the pipeline works

One LLM call per image. The VLM (gpt-5) sees the image and the Pydantic schema and emits a SentenceList JSON via OpenAI's native structured outputs API. The target string is synthesized by Python from each Sentence's __str__(). Again, the model never sees or generates target-language tokens.

   image
     │  gpt-5 with strict-Literal schema (OpenAI structured outputs)
     ▼
   list[Sentence]
     │  Sentence.__str__() (deterministic Python)
     ▼  
   target caption
    

The yaduha-{iso} packages are authored by a coding agent — not by us. The agent gets the dev captions, web search, and a reference implementation (yaduha-hch, yaduha-ovp), and produces a complete Pydantic schema with a vocabulary, sentence types, and morphology. Lemma fields are typed as Literal[...] enums so the LLM is forced to pick from the package's vocabulary. Genuine proper nouns can pass through verbatim via the proper_noun: Optional[str] slot — narrowly prompted to discourage abuse.

Aggregate results: dev split, ChrF++

Per-language mean ChrF++ on the 50-row dev split. Each cell is the mean of per-row ChrF++ scores. The ★ rows are our configurations (the primary submission config is highlighted). The bottom row is the few-shot direct-prompting baseline.

Configuration mean*

*Mean over the 4 ChrF++-comparable languages (bzd / grn / nlv / hch). The organizer’s NLLB baseline does not cover yua, so we exclude yua from the mean for direct comparison.

Translation explorer

Inspect every dev image and how each configuration captioned it. Each column header has a search box (per-column filter) and is click-sortable. Click a row to see the full image alongside the predicted caption, back-translated English, and (for the pipeline methods) the English intermediate the VLM produced.

Image Language ▾ Configuration ▾ ID ▾ Gold caption ▾ Predicted ▾ Back-translated ▾ ChrF++ ▾

ChrF++ by language mean ± 1 SEM

ChrF++ distribution

Final results

Shared-task organizers ranked systems by ChrF++ first, then took the top-5 (one per team) per language to human evaluation. We (team yaduha) won 2 of 5 languages on human eval (Bribri, Nahuatl) and placed top-3 on Maya. We didn't place in the top-5 ChrF for Guaraní or Wixárika so they weren't human-evaluated. RAN = Resource Abundance Notation (speakers / monolingual / bilingual partners, order of magnitude).

Language RAN ChrF rank ChrF score Human rank Human rating

Test submission per-row predictions

Per-row predictions on the 981-image test split, produced by our submitted configuration: gpt-5 one-step (single VLM call with image + Pydantic schema, structured output) paired with gpt-4o-mini back-translation from the structured intermediate.

Image Language ▾ ID ▾ Predicted ▾ Back-translated ▾

Recipe

  1. Generate the language package: uv run americasnlp generate-language --iso bzd runs an Opus 4.7 agent with web search, package read tools, and a validation harness over the 30-row training slice of dev. The agent emits a complete yaduha-bzd Python package (Pydantic Sentence classes + vocabulary).
  2. Caption the dev split: americasnlp evaluate --language bribri --method one-step --vlm gpt-5. The single API call sends the image plus the schema; gpt-5's structured-outputs API enforces the Literal[...] lemma constraint at validation time.
  3. Pick the best config per language from the dev matrix above. We chose the gpt-5 one-step config for all languages for its simplicity and strong performance.
  4. Submit on the test split: americasnlp submit --language bribri --method one-step --vlm gpt-5 --output ....