GPT vs Claude vs Gemini vs DeepSeek for App Localization
I tested GPT-4o, Claude Sonnet, Gemini 2.5 Pro, and DeepSeek on the same app string catalog across 8 languages.
The cheapest way to choose an LLM for localization is to read benchmarks. The honest way is to translate your own catalog with four of them and compare.
I did the latter. Same 200-key catalog from Cube. Same eight target languages: German, French, Spanish, Japanese, Korean, Russian, Portuguese (BR), and Arabic. Same prompt, same glossary, same temperature. Four runs: GPT-4o, Claude Sonnet 4.6, Gemini 2.5 Pro, and DeepSeek V3.
This post is the result. Cost numbers, quality observations, speed and reliability notes, and what I actually use day-to-day for my own apps.
The setup
To make the comparison fair, the variables were locked:
- 200 keys drawn from the Cube source catalog — a mix of UI labels, button copy, error messages, and onboarding text. Average string length around 6 words.
- 8 languages chosen for coverage: 4 European, 3 East Asian, and Arabic for RTL.
- Identical system prompt describing the app, tone (indie-developer, technical, concise), and a small glossary of do-not-translate terms (
Cube,Xcode,.xcstrings,App Store Connect). - Identical user prompt format: JSON in, JSON out, one batch per language.
- Temperature 0.2 across all four (low enough for consistency, high enough to avoid weird artifacts in non-English output).
Each run produced 1,600 translated cells. I scored each cell on three things: format preservation (did %@, %lld, and markdown survive?), glossary compliance (did do-not-translate terms stay in source language?), and human-readable quality (rated by a native speaker for the language).
For quality I had native reviewers for German, Spanish, Japanese, Russian, and Portuguese. For French, Korean, and Arabic I relied on my own competence + back-translation spot checks. So treat those three with a grain of salt.
Cost
The first surprise: the cost spread is bigger than the quality spread.
| Model | Input | Output | Total per run |
|---|---|---|---|
| GPT-4o | $2.50/M | $10/M | $0.18 |
| GPT-4o-mini | $0.15/M | $0.60/M | $0.013 |
| Claude Sonnet 4.6 | $3/M | $15/M | $0.24 |
| Gemini 2.5 Pro | $1.25/M | $10/M | $0.15 |
| DeepSeek V3 | $0.27/M | $1.10/M | $0.024 |
Numbers are May 2026 list prices and will drift. The shape of the gap is what matters: DeepSeek and GPT-4o-mini are 10× cheaper than the flagship models for translation work.
For a small app catalog (500-1000 keys) the absolute cost is irrelevant either way — you’re spending pennies. For a large app catalog (10k+ keys) translated frequently as the app evolves, the ratio starts to matter. At 50k cells/month with GPT-4o-mini you’re looking at $4-5; with Claude you’re at $60.
I’ll come back to the question of “is the expensive model worth 10× more” in the quality section. Spoiler: usually no.
Quality, per language
Here’s where it gets interesting.
German
All four models produce shippable German. Differences are in register: Claude defaults to formal (Sie), GPT and Gemini split, DeepSeek defaults to informal (du). For an indie consumer app I want informal — Claude needed an explicit “use informal du” instruction.
Best: Gemini 2.5 Pro — most natural phrasing for short button labels. Avoid for German: none. All four are fine with the right prompt.
French
Similar to German. Tu vs vous — Claude and Gemini default to vous (formal); GPT and DeepSeek to tu. Same fix: explicit instruction.
Best: Claude Sonnet — handles the gender agreement on past participles more reliably (“Tu as supprimé” vs “Tu as supprimée” when subject is feminine). Avoid: DeepSeek occasionally truncated phrasing in a way that lost meaning. Recoverable with a review pass.
Spanish
All four solid. The interesting failure mode: GPT-4o-mini sometimes translated brand names despite the glossary. “Cube” became “Cubo” in one out of 200 strings. The flagship GPT-4o didn’t make that mistake. DeepSeek didn’t either.
Best: GPT-4o or Claude — both reliably preserve glossary terms. Avoid for production without a review: GPT-4o-mini.
Japanese
This is where the spread widens. Japanese has multiple speech levels (です/ます polite, plain だ form, humble, honorific), and the choice changes the feel of the app dramatically.
Best by a clear margin: Claude Sonnet 4.6. Most natural-feeling polite form, didn’t over-formalize. Gemini 2.5 Pro was a close second. GPT-4o was correct but stiffer. DeepSeek V3 was the weakest — recognizable as machine-translated to my Japanese reviewer.
Korean
Similar pattern to Japanese. Speech level (해요체 vs 합니다체 vs casual) is the lever.
Best: Claude, again. GPT-4o good. Gemini good with caveats — occasionally over-honorific. DeepSeek weakest, though usable.
Russian
Russian is plural-heavy (six forms in the strict sense, three commonly), and case-heavy (every noun changes by case). All four models got plurals right when the catalog used proper ICU plural forms — but only the catalog format guarantees that.
Best: GPT-4o — most idiomatic. Claude second. Gemini and DeepSeek both correct but a bit literal.
Portuguese (BR)
Brazilian Portuguese is its own thing — different from European Portuguese. The instruction “Brazilian Portuguese, not European” matters.
Best: tied between GPT-4o and Claude. Gemini sometimes slipped into European phrasing despite the instruction. DeepSeek good but with some Continental vocabulary.
Arabic
Right-to-left, plus diglossia (Modern Standard Arabic vs colloquial varieties). For an app, MSA is the right choice — and all four models default to MSA.
Best: Claude Sonnet — most natural MSA, didn’t over-formalize. GPT-4o good. Gemini good. DeepSeek weakest — occasionally produced unnatural phrasings according to my back-translation review.
Format preservation
This is the unsexy one that matters more than people think. When you have "%lld photos imported" and the model returns "%lid фото импортировано" (note the %lld → %lid), you’ve shipped a crash.
I ran a regex pass over all 1,600 outputs per run looking for: format specifiers (%@, %lld, %d, %lf), placeholder syntax ({name}, :variable), and markdown (**bold**, _italic_, backticks).
Results:
| Model | Format errors / 1600 |
|---|---|
| GPT-4o | 2 |
| Claude Sonnet 4.6 | 1 |
| Gemini 2.5 Pro | 4 |
| DeepSeek V3 | 11 |
| GPT-4o-mini | 8 |
DeepSeek and GPT-4o-mini are the noticeable outliers. Both are usable, but both demand an automated post-translation validator that flags broken specifiers. Most translation tools (Cube included) run this check automatically.
Speed
Single-batch, 200 keys × 1 language:
| Model | Wall-clock per language |
|---|---|
| GPT-4o-mini | ~6s |
| DeepSeek V3 | ~10s |
| GPT-4o | ~15s |
| Gemini 2.5 Pro | ~18s |
| Claude Sonnet 4.6 | ~22s |
For 30 languages running in parallel, this is moot — the slowest one bounds the total wall-clock. For sequential single-language updates while iterating, the 4× spread between GPT-4o-mini and Claude is real.
Reliability
Across the four full runs, here’s the failure tally — anything from rate-limit errors to malformed JSON output that required a retry.
| Model | Retries / 8 languages |
|---|---|
| GPT-4o | 0 |
| Claude Sonnet 4.6 | 0 |
| Gemini 2.5 Pro | 1 (rate-limit) |
| DeepSeek V3 | 2 (malformed JSON, retried successfully) |
| GPT-4o-mini | 1 (truncated output on a long language) |
Real production usage at higher volume would surface different patterns. For 8-language one-shot translation: all four are fine.
When to pick which
The honest summary based on what I’ve observed:
Default for indie apps with mixed Asian + European languages: Claude Sonnet 4.6. Best quality on Japanese and Korean by a meaningful margin, very solid on everything else, near-zero format errors. Most expensive of the four — but the absolute cost is still pennies for typical catalogs. The quality difference shows up in the review pass: less to fix.
Default if you don’t ship to East Asia: GPT-4o. Slightly cheaper than Claude, comparable on European languages, less consistent on Japanese/Korean.
Pick Gemini 2.5 Pro when: you want long context (it handles large catalogs in a single batch better than the others) or you’re already in the Google Cloud ecosystem with credits to burn. Quality competitive with GPT-4o.
Pick DeepSeek V3 when: cost is the binding constraint, you have a strong human review process downstream, and you’re translating mostly European languages. Avoid for East Asian production work.
Pick GPT-4o-mini when: you’re doing iterative drafting and want fast cheap turnaround, and you’ll re-translate the final pass with a flagship model. Or for languages where “good enough” is the bar and a native review will fix the rest.
What I actually do
For Cube itself I use Claude Sonnet 4.6 for the initial translation pass on all 30 locales, then Claude Opus 4 for the review pass on the 5 languages that drive the most App Store impressions. The Opus pass catches the subtle register issues I’d miss otherwise.
The bill for translating Cube’s current catalog (~2,500 keys × 30 locales) is around $0.60 for the initial pass and another $0.30 for the review pass on five locales. Less than a coffee, and I sleep fine with the result.
That’s the bottom-line shift: model choice matters in nuance, but the cost spread doesn’t matter at indie-app scale. Pick the model that produces the best quality for your highest-value languages and stop optimizing for fractions of a cent.
If you’re using Cube you can pick the model per project and switch on the fly — different apps, different requirements, no lock-in. If you’re using something else, the framework above applies regardless of tool.
What I’d love to see next: a model trained specifically on app localization data, with native handling of .xcstrings plurals and device variants. The general-purpose models are already strong; a specialist would push the quality bar another step.