Dotting Test

Turkish glyph benchmark for image models.

A glyph-level benchmark for Turkish text in AI-generated images. Human labels are ground truth; AI judge labels are auxiliary scale scans.

The Hugging Face Dataset is the canonical artifact. This Space for fge-auto/dotting-benchmark is a lightweight browser for leaderboard and example images.

8,400 generation rows

8,396 OCR/VLM rows

1,055 human labels

8,396 successful images · 4 generation errors status

AI-Estimated Model Ranking

Gemini 3.5 Flash labels cover the full corpus and are useful for scanning trends. Final claims should use the human-labeled subset.

Model	Images	Correct (AI-est.)	Dotted on legible dotless targets	Legible
GPT Image 2	210	97.1%	1.8%	100.0%
Nano Banana 2	210	93.3%	5.4%	100.0%
Nano Banana Pro	210	86.7%	6.0%	100.0%
GPT Image 1.5	210	83.8%	10.7%	100.0%
GPT Image 1 Mini	207	82.6%	7.9%	100.0%
GPT Image 1	210	82.4%	10.1%	100.0%
Grok Imagine Image Quality	210	78.1%	17.9%	100.0%
Ideogram 4.0	210	75.2%	13.9%	98.6%
FLUX.2 [flex]	210	65.7%	25.0%	100.0%
Krea 2 Medium	210	63.3%	19.2%	99.5%