Gemma 4 for Telephony: From Two AI Models to One – Until I Switched to Chinese
Building a phone agent on a multimodal LLM: dropping faster-whisper and letting Gemma 4 hear the caller directly — a response-time and…
Opening excerpt (first ~120 words) tap to expand
Gemma 4 for Telephony: I Replaced Two AI Models With One in My Voice Phone Agent — Until I Switched to ChineseJiyao Weng9 min read·19 hours ago--ListenShareBuilding a phone agent on a multimodal LLM: dropping faster-whisper and letting Gemma 4 hear the caller directly — a response-time and reply-accuracy benchmark across English, French, and MandarinPress enter or click to view image in full sizeA telephony system using Gemma 4:12BMy voice phone agent uses two models: one to hear the caller, one to think. Gemma 4 can do both at once — so I tried deleting the speech-to-text model entirely. Across English, French, and Mandarin, here’s the head-to-head on response time and the thing that actually matters on a phone line: did it reply correctly.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Medium.