Abstract
We submit to all five language tracks of the AmericasNLP 2026 Shared
Task on Cultural Image Captioning — Bribri (bzd), Guaraní (grn),
Yucatec Maya (yua), Orizaba Nahuatl (nlv), and Wixárika (hch) —
using a single architecture: image + strict-Literal Pydantic schema
→ VLM emits SentenceList JSON via OpenAI structured
outputs → deterministic Python rendering into the target language.
The grammar packages are authored end-to-end by an Anthropic Opus 4.7
coding agent that consults the dev split and open-web references;
humans never edit the resulting yaduha-{iso} packages.
Our best-per-language configuration averages 16.86 ChrF++
across the four ChrF++-comparable languages (+2.44
vs the organizer baseline Qwen3-VL → NLLB at 14.42), without
ever generating target-language tokens directly. The system also
covers Yucatec Maya, which the organizer’s NLLB pipeline does
not. Lemma fields are typed as Pydantic Literal[...]
enums. The LLM cannot emit OOV terms, so output is grammatical
by construction (insofar as the LLM-generated grammar is correct).
A narrowly-prompted proper_noun escape hatch carries
genuine named entities through verbatim.
How the pipeline works
One LLM call per image. The VLM (gpt-5) sees the image and the
Pydantic schema and emits a SentenceList JSON via
OpenAI's native structured outputs API. The target string is
synthesized by Python from each Sentence's
__str__(). Again, the model never sees or generates
target-language tokens.
image
│ gpt-5 with strict-Literal schema (OpenAI structured outputs)
▼
list[Sentence]
│ Sentence.__str__() (deterministic Python)
▼
target caption
The yaduha-{iso} packages are authored by a coding agent
— not by us. The agent gets the dev captions, web search, and a
reference implementation (yaduha-hch, yaduha-ovp),
and produces a complete Pydantic schema with a vocabulary, sentence
types, and morphology. Lemma fields are typed as
Literal[...] enums so the LLM is forced to pick from the
package's vocabulary. Genuine proper nouns can pass through verbatim
via the proper_noun: Optional[str] slot — narrowly
prompted to discourage abuse.
Aggregate results: dev split, ChrF++
Per-language mean ChrF++ on the 50-row dev split. Each cell is the mean of per-row ChrF++ scores. The ★ rows are our configurations (the primary submission config is highlighted). The bottom row is the few-shot direct-prompting baseline.
| Configuration | mean* |
|---|
*Mean over the 4 ChrF++-comparable languages (bzd / grn / nlv / hch). The organizer’s NLLB baseline does not cover yua, so we exclude yua from the mean for direct comparison.
Translation explorer
Inspect every dev image and how each configuration captioned it. Each column header has a search box (per-column filter) and is click-sortable. Click a row to see the full image alongside the predicted caption, back-translated English, and (for the pipeline methods) the English intermediate the VLM produced.
| Image | Language ▾ | Configuration ▾ | ID ▾ | Gold caption ▾ | Predicted ▾ | Back-translated ▾ | ChrF++ ▾ |
|---|---|---|---|---|---|---|---|
ChrF++ by language mean ± 1 SEM
ChrF++ distribution
Final results
Shared-task organizers ranked systems by ChrF++ first, then took
the top-5 (one per team) per language to human evaluation. We
(team yaduha) won 2 of 5 languages on human eval
(Bribri, Nahuatl) and placed top-3 on Maya. We didn't place in the
top-5 ChrF for Guaraní or Wixárika so they weren't human-evaluated.
RAN = Resource Abundance
Notation (speakers / monolingual / bilingual partners,
order of magnitude).
| Language | RAN | ChrF rank | ChrF score | Human rank | Human rating |
|---|
Test submission per-row predictions
Per-row predictions on the 981-image test split, produced by our
submitted configuration: gpt-5 one-step
(single VLM call with image + Pydantic schema, structured output)
paired with gpt-4o-mini back-translation from the
structured intermediate.
| Image | Language ▾ | ID ▾ | Predicted ▾ | Back-translated ▾ |
|---|---|---|---|---|
Recipe
- Generate the language package:
uv run americasnlp generate-language --iso bzdruns an Opus 4.7 agent with web search, package read tools, and a validation harness over the 30-row training slice of dev. The agent emits a completeyaduha-bzdPython package (PydanticSentenceclasses + vocabulary). - Caption the dev split:
americasnlp evaluate --language bribri --method one-step --vlm gpt-5. The single API call sends the image plus the schema; gpt-5's structured-outputs API enforces theLiteral[...]lemma constraint at validation time. - Pick the best config per language from the dev matrix above. We chose the gpt-5 one-step config for all languages for its simplicity and strong performance.
- Submit on the test split:
americasnlp submit --language bribri --method one-step --vlm gpt-5 --output ....