GPT vs Claude vs Gemini vs DeepSeek for App Localization

The cheapest way to choose an LLM for localization is to read benchmarks. The honest way is to translate your own catalog with four of them and compare.

I did the latter. Same 200-key catalog from Cube. Same eight target languages: German, French, Spanish, Japanese, Korean, Russian, Portuguese (BR), and Arabic. Same prompt, same glossary, same temperature. Four runs: GPT-4o, Claude Sonnet 4.6, Gemini 2.5 Pro, and DeepSeek V3.

This post is the result. Cost numbers, quality observations, speed and reliability notes, and what I actually use day-to-day for my own apps.

The setup

To make the comparison fair, the variables were locked:

200 keys drawn from the Cube source catalog — a mix of UI labels, button copy, error messages, and onboarding text. Average string length around 6 words.
8 languages chosen for coverage: 4 European, 3 East Asian, and Arabic for RTL.
Identical system prompt describing the app, tone (indie-developer, technical, concise), and a small glossary of do-not-translate terms (Cube, Xcode, .xcstrings, App Store Connect).
Identical user prompt format: JSON in, JSON out, one batch per language.
Temperature 0.2 across all four (low enough for consistency, high enough to avoid weird artifacts in non-English output).

Each run produced 1,600 translated cells. I scored each cell on three things: format preservation (did %@, %lld, and markdown survive?), glossary compliance (did do-not-translate terms stay in source language?), and human-readable quality (rated by a native speaker for the language).

For quality I had native reviewers for German, Spanish, Japanese, Russian, and Portuguese. For French, Korean, and Arabic I relied on my own competence + back-translation spot checks. So treat those three with a grain of salt.

Cost

The first surprise: the cost spread is bigger than the quality spread.

Model	Input	Output	Total per run
GPT-4o	$2.50/M	$10/M	$0.18
GPT-4o-mini	$0.15/M	$0.60/M	$0.013
Claude Sonnet 4.6	$3/M	$15/M	$0.24
Gemini 2.5 Pro	$1.25/M	$10/M	$0.15
DeepSeek V3	$0.27/M	$1.10/M	$0.024

Numbers are May 2026 list prices and will drift. The shape of the gap is what matters: DeepSeek and GPT-4o-mini are 10× cheaper than the flagship models for translation work.

For a small app catalog (500-1000 keys) the absolute cost is irrelevant either way — you’re spending pennies. For a large app catalog (10k+ keys) translated frequently as the app evolves, the ratio starts to matter. At 50k cells/month with GPT-4o-mini you’re looking at $4-5; with Claude you’re at $60.

I’ll come back to the question of “is the expensive model worth 10× more” in the quality section. Spoiler: usually no.

Quality, per language

Here’s where it gets interesting.

German

All four models produce shippable German. Differences are in register: Claude defaults to formal (Sie), GPT and Gemini split, DeepSeek defaults to informal (du). For an indie consumer app I want informal — Claude needed an explicit “use informal du” instruction.

Best: Gemini 2.5 Pro — most natural phrasing for short button labels. Avoid for German: none. All four are fine with the right prompt.

French

Similar to German. Tu vs vous — Claude and Gemini default to vous (formal); GPT and DeepSeek to tu. Same fix: explicit instruction.

Best: Claude Sonnet — handles the gender agreement on past participles more reliably (“Tu as supprimé” vs “Tu as supprimée” when subject is feminine). Avoid: DeepSeek occasionally truncated phrasing in a way that lost meaning. Recoverable with a review pass.

Spanish

All four solid. The interesting failure mode: GPT-4o-mini sometimes translated brand names despite the glossary. “Cube” became “Cubo” in one out of 200 strings. The flagship GPT-4o didn’t make that mistake. DeepSeek didn’t either.

Best: GPT-4o or Claude — both reliably preserve glossary terms. Avoid for production without a review: GPT-4o-mini.

Japanese

This is where the spread widens. Japanese has multiple speech levels (です/ます polite, plain だ form, humble, honorific), and the choice changes the feel of the app dramatically.

Best by a clear margin: Claude Sonnet 4.6. Most natural-feeling polite form, didn’t over-formalize. Gemini 2.5 Pro was a close second. GPT-4o was correct but stiffer. DeepSeek V3 was the weakest — recognizable as machine-translated to my Japanese reviewer.

Korean

Similar pattern to Japanese. Speech level (해요체 vs 합니다체 vs casual) is the lever.

Best: Claude, again. GPT-4o good. Gemini good with caveats — occasionally over-honorific. DeepSeek weakest, though usable.

Russian

Russian is plural-heavy (six forms in the strict sense, three commonly), and case-heavy (every noun changes by case). All four models got plurals right when the catalog used proper ICU plural forms — but only the catalog format guarantees that.

Best: GPT-4o — most idiomatic. Claude second. Gemini and DeepSeek both correct but a bit literal.

Portuguese (BR)

Brazilian Portuguese is its own thing — different from European Portuguese. The instruction “Brazilian Portuguese, not European” matters.

Best: tied between GPT-4o and Claude. Gemini sometimes slipped into European phrasing despite the instruction. DeepSeek good but with some Continental vocabulary.

Arabic

Right-to-left, plus diglossia (Modern Standard Arabic vs colloquial varieties). For an app, MSA is the right choice — and all four models default to MSA.

Best: Claude Sonnet — most natural MSA, didn’t over-formalize. GPT-4o good. Gemini good. DeepSeek weakest — occasionally produced unnatural phrasings according to my back-translation review.

Format preservation

This is the unsexy one that matters more than people think. When you have "%lld photos imported" and the model returns "%lid фото импортировано" (note the %lld → %lid), you’ve shipped a crash.

I ran a regex pass over all 1,600 outputs per run looking for: format specifiers (%@, %lld, %d, %lf), placeholder syntax ({name}, :variable), and markdown (**bold**, _italic_, backticks).

Results:

Model	Format errors / 1600
GPT-4o	2
Claude Sonnet 4.6	1
Gemini 2.5 Pro	4
DeepSeek V3	11
GPT-4o-mini	8

DeepSeek and GPT-4o-mini are the noticeable outliers. Both are usable, but both demand an automated post-translation validator that flags broken specifiers. Most translation tools (Cube included) run this check automatically.

Speed

Single-batch, 200 keys × 1 language:

Model	Wall-clock per language
GPT-4o-mini	~6s
DeepSeek V3	~10s
GPT-4o	~15s
Gemini 2.5 Pro	~18s
Claude Sonnet 4.6	~22s

For 30 languages running in parallel, this is moot — the slowest one bounds the total wall-clock. For sequential single-language updates while iterating, the 4× spread between GPT-4o-mini and Claude is real.

Reliability

Across the four full runs, here’s the failure tally — anything from rate-limit errors to malformed JSON output that required a retry.

Model	Retries / 8 languages
GPT-4o	0
Claude Sonnet 4.6	0
Gemini 2.5 Pro	1 (rate-limit)
DeepSeek V3	2 (malformed JSON, retried successfully)
GPT-4o-mini	1 (truncated output on a long language)

Real production usage at higher volume would surface different patterns. For 8-language one-shot translation: all four are fine.

When to pick which

The honest summary based on what I’ve observed:

Default for indie apps with mixed Asian + European languages: Claude Sonnet 4.6. Best quality on Japanese and Korean by a meaningful margin, very solid on everything else, near-zero format errors. Most expensive of the four — but the absolute cost is still pennies for typical catalogs. The quality difference shows up in the review pass: less to fix.

Default if you don’t ship to East Asia: GPT-4o. Slightly cheaper than Claude, comparable on European languages, less consistent on Japanese/Korean.

Pick Gemini 2.5 Pro when: you want long context (it handles large catalogs in a single batch better than the others) or you’re already in the Google Cloud ecosystem with credits to burn. Quality competitive with GPT-4o.

Pick DeepSeek V3 when: cost is the binding constraint, you have a strong human review process downstream, and you’re translating mostly European languages. Avoid for East Asian production work.

Pick GPT-4o-mini when: you’re doing iterative drafting and want fast cheap turnaround, and you’ll re-translate the final pass with a flagship model. Or for languages where “good enough” is the bar and a native review will fix the rest.

What I actually do

For Cube itself I use Claude Sonnet 4.6 for the initial translation pass on all 30 locales, then Claude Opus 4 for the review pass on the 5 languages that drive the most App Store impressions. The Opus pass catches the subtle register issues I’d miss otherwise.

The bill for translating Cube’s current catalog (~2,500 keys × 30 locales) is around $0.60 for the initial pass and another $0.30 for the review pass on five locales. Less than a coffee, and I sleep fine with the result.

That’s the bottom-line shift: model choice matters in nuance, but the cost spread doesn’t matter at indie-app scale. Pick the model that produces the best quality for your highest-value languages and stop optimizing for fractions of a cent.

If you’re using Cube you can pick the model per project and switch on the fly — different apps, different requirements, no lock-in. If you’re using something else, the framework above applies regardless of tool.

What I’d love to see next: a model trained specifically on app localization data, with native handling of .xcstrings plurals and device variants. The general-purpose models are already strong; a specialist would push the quality bar another step.