GPT vs Claude vs Gemini vs DeepSeek for App Localization

I tested GPT-4o, Claude Sonnet, Gemini 2.5 Pro, and DeepSeek on the same app string catalog across 8 languages.

May 14, 2026 · 11 min read

The cheapest way to choose an LLM for localization is to read benchmarks. The honest way is to translate your own catalog with four of them and compare.

I did the latter. Same 200-key catalog from Cube. Same eight target languages: German, French, Spanish, Japanese, Korean, Russian, Portuguese (BR), and Arabic. Same prompt, same glossary, same temperature. Four runs: GPT-4o, Claude Sonnet 4.6, Gemini 2.5 Pro, and DeepSeek V3.

This post is the result. Cost numbers, quality observations, speed and reliability notes, and what I actually use day-to-day for my own apps.

The setup

To make the comparison fair, the variables were locked:

  • 200 keys drawn from the Cube source catalog — a mix of UI labels, button copy, error messages, and onboarding text. Average string length around 6 words.
  • 8 languages chosen for coverage: 4 European, 3 East Asian, and Arabic for RTL.
  • Identical system prompt describing the app, tone (indie-developer, technical, concise), and a small glossary of do-not-translate terms (Cube, Xcode, .xcstrings, App Store Connect).
  • Identical user prompt format: JSON in, JSON out, one batch per language.
  • Temperature 0.2 across all four (low enough for consistency, high enough to avoid weird artifacts in non-English output).

Each run produced 1,600 translated cells. I scored each cell on three things: format preservation (did %@, %lld, and markdown survive?), glossary compliance (did do-not-translate terms stay in source language?), and human-readable quality (rated by a native speaker for the language).

For quality I had native reviewers for German, Spanish, Japanese, Russian, and Portuguese. For French, Korean, and Arabic I relied on my own competence + back-translation spot checks. So treat those three with a grain of salt.

Cost

The first surprise: the cost spread is bigger than the quality spread.

ModelInputOutputTotal per run
GPT-4o$2.50/M$10/M$0.18
GPT-4o-mini$0.15/M$0.60/M$0.013
Claude Sonnet 4.6$3/M$15/M$0.24
Gemini 2.5 Pro$1.25/M$10/M$0.15
DeepSeek V3$0.27/M$1.10/M$0.024

Numbers are May 2026 list prices and will drift. The shape of the gap is what matters: DeepSeek and GPT-4o-mini are 10× cheaper than the flagship models for translation work.

For a small app catalog (500-1000 keys) the absolute cost is irrelevant either way — you’re spending pennies. For a large app catalog (10k+ keys) translated frequently as the app evolves, the ratio starts to matter. At 50k cells/month with GPT-4o-mini you’re looking at $4-5; with Claude you’re at $60.

I’ll come back to the question of “is the expensive model worth 10× more” in the quality section. Spoiler: usually no.

Quality, per language

Here’s where it gets interesting.

German

All four models produce shippable German. Differences are in register: Claude defaults to formal (Sie), GPT and Gemini split, DeepSeek defaults to informal (du). For an indie consumer app I want informal — Claude needed an explicit “use informal du” instruction.

Best: Gemini 2.5 Pro — most natural phrasing for short button labels. Avoid for German: none. All four are fine with the right prompt.

French

Similar to German. Tu vs vous — Claude and Gemini default to vous (formal); GPT and DeepSeek to tu. Same fix: explicit instruction.

Best: Claude Sonnet — handles the gender agreement on past participles more reliably (“Tu as supprimé” vs “Tu as supprimée” when subject is feminine). Avoid: DeepSeek occasionally truncated phrasing in a way that lost meaning. Recoverable with a review pass.

Spanish

All four solid. The interesting failure mode: GPT-4o-mini sometimes translated brand names despite the glossary. “Cube” became “Cubo” in one out of 200 strings. The flagship GPT-4o didn’t make that mistake. DeepSeek didn’t either.

Best: GPT-4o or Claude — both reliably preserve glossary terms. Avoid for production without a review: GPT-4o-mini.

Japanese

This is where the spread widens. Japanese has multiple speech levels (です/ます polite, plain だ form, humble, honorific), and the choice changes the feel of the app dramatically.

Best by a clear margin: Claude Sonnet 4.6. Most natural-feeling polite form, didn’t over-formalize. Gemini 2.5 Pro was a close second. GPT-4o was correct but stiffer. DeepSeek V3 was the weakest — recognizable as machine-translated to my Japanese reviewer.

Korean

Similar pattern to Japanese. Speech level (해요체 vs 합니다체 vs casual) is the lever.

Best: Claude, again. GPT-4o good. Gemini good with caveats — occasionally over-honorific. DeepSeek weakest, though usable.

Russian

Russian is plural-heavy (six forms in the strict sense, three commonly), and case-heavy (every noun changes by case). All four models got plurals right when the catalog used proper ICU plural forms — but only the catalog format guarantees that.

Best: GPT-4o — most idiomatic. Claude second. Gemini and DeepSeek both correct but a bit literal.

Portuguese (BR)

Brazilian Portuguese is its own thing — different from European Portuguese. The instruction “Brazilian Portuguese, not European” matters.

Best: tied between GPT-4o and Claude. Gemini sometimes slipped into European phrasing despite the instruction. DeepSeek good but with some Continental vocabulary.

Arabic

Right-to-left, plus diglossia (Modern Standard Arabic vs colloquial varieties). For an app, MSA is the right choice — and all four models default to MSA.

Best: Claude Sonnet — most natural MSA, didn’t over-formalize. GPT-4o good. Gemini good. DeepSeek weakest — occasionally produced unnatural phrasings according to my back-translation review.

Format preservation

This is the unsexy one that matters more than people think. When you have "%lld photos imported" and the model returns "%lid фото импортировано" (note the %lld%lid), you’ve shipped a crash.

I ran a regex pass over all 1,600 outputs per run looking for: format specifiers (%@, %lld, %d, %lf), placeholder syntax ({name}, :variable), and markdown (**bold**, _italic_, backticks).

Results:

ModelFormat errors / 1600
GPT-4o2
Claude Sonnet 4.61
Gemini 2.5 Pro4
DeepSeek V311
GPT-4o-mini8

DeepSeek and GPT-4o-mini are the noticeable outliers. Both are usable, but both demand an automated post-translation validator that flags broken specifiers. Most translation tools (Cube included) run this check automatically.

Speed

Single-batch, 200 keys × 1 language:

ModelWall-clock per language
GPT-4o-mini~6s
DeepSeek V3~10s
GPT-4o~15s
Gemini 2.5 Pro~18s
Claude Sonnet 4.6~22s

For 30 languages running in parallel, this is moot — the slowest one bounds the total wall-clock. For sequential single-language updates while iterating, the 4× spread between GPT-4o-mini and Claude is real.

Reliability

Across the four full runs, here’s the failure tally — anything from rate-limit errors to malformed JSON output that required a retry.

ModelRetries / 8 languages
GPT-4o0
Claude Sonnet 4.60
Gemini 2.5 Pro1 (rate-limit)
DeepSeek V32 (malformed JSON, retried successfully)
GPT-4o-mini1 (truncated output on a long language)

Real production usage at higher volume would surface different patterns. For 8-language one-shot translation: all four are fine.

When to pick which

The honest summary based on what I’ve observed:

Default for indie apps with mixed Asian + European languages: Claude Sonnet 4.6. Best quality on Japanese and Korean by a meaningful margin, very solid on everything else, near-zero format errors. Most expensive of the four — but the absolute cost is still pennies for typical catalogs. The quality difference shows up in the review pass: less to fix.

Default if you don’t ship to East Asia: GPT-4o. Slightly cheaper than Claude, comparable on European languages, less consistent on Japanese/Korean.

Pick Gemini 2.5 Pro when: you want long context (it handles large catalogs in a single batch better than the others) or you’re already in the Google Cloud ecosystem with credits to burn. Quality competitive with GPT-4o.

Pick DeepSeek V3 when: cost is the binding constraint, you have a strong human review process downstream, and you’re translating mostly European languages. Avoid for East Asian production work.

Pick GPT-4o-mini when: you’re doing iterative drafting and want fast cheap turnaround, and you’ll re-translate the final pass with a flagship model. Or for languages where “good enough” is the bar and a native review will fix the rest.

What I actually do

For Cube itself I use Claude Sonnet 4.6 for the initial translation pass on all 30 locales, then Claude Opus 4 for the review pass on the 5 languages that drive the most App Store impressions. The Opus pass catches the subtle register issues I’d miss otherwise.

The bill for translating Cube’s current catalog (~2,500 keys × 30 locales) is around $0.60 for the initial pass and another $0.30 for the review pass on five locales. Less than a coffee, and I sleep fine with the result.

That’s the bottom-line shift: model choice matters in nuance, but the cost spread doesn’t matter at indie-app scale. Pick the model that produces the best quality for your highest-value languages and stop optimizing for fractions of a cent.

If you’re using Cube you can pick the model per project and switch on the fly — different apps, different requirements, no lock-in. If you’re using something else, the framework above applies regardless of tool.

What I’d love to see next: a model trained specifically on app localization data, with native handling of .xcstrings plurals and device variants. The general-purpose models are already strong; a specialist would push the quality bar another step.