WeSearch

How to run a local coding agent with Gemma 4 and Pi

·6 min read · 0 reactions · 0 comments · 2 views
How to run a local coding agent with Gemma 4 and Pi

Set up Gemma 4 running in LM Studio, connected to Pi as the terminal agent

Original article
Patloeber
Read full at Patloeber →
Full article excerpt tap to expand

← Home How to run a local coding agent with Gemma 4 and Pi 2026 Apr 27 • 7 minutes • #llm I've been playing around with running coding agents fully locally. The setup I landed on is: LM Studio + Pi agent + Gemma 4 26B A4B (Q4_K_M) Gemma 4 running in LM Studio, connected to Pi as the terminal agent. It works surprisingly well, and this post walks through how to set it up. Here's what we'll cover: Install LM Studio Download Gemma 4 Start a local server Configure context size Install Pi Connect Pi to your local model Add skills and extensions 1) Install LM Studio # You need something to serve the model locally. I'm using LM Studio here — it's a desktop app that handles model downloads, quantization, and exposes a local OpenAI-compatible API server. Download it from lmstudio.ai (macOS, Windows, Linux). Ollama and llama-server (part of llama.cpp) work just as well if you prefer a CLI-first workflow. All three expose an OpenAI-compatible endpoint, so Pi doesn't care which one you use. The rest of this guide uses LM Studio, but the Pi configuration works with any of them — just swap out the server configuration. 2) Download Gemma 4 # Gemma 4 is Google's latest open-weight model family, released under the Apache 2.0 license. Compared to earlier Gemma versions, it's a real step change for coding and agentic use cases — it now has native function calling, system prompt support, and thinking modes, which makes it a genuinely good model for local coding agents. The family includes four sizes: Model Size Architecture Type Context Length Gemma 4 E2B Dense 128K tokens Gemma 4 E4B Dense 128K tokens Gemma 4 26B A4B Mixture of Experts (MoE) 256K tokens Gemma 4 31B Dense 256K tokens My recommendation: go with the 26B A4B. It's a Mixture-of-Experts model, which means it has 26B total parameters but only activates 4B per token. In practice, you get the quality of a much larger model with inference speeds closer to a small one. It handles text, image understanding, function calling, and thinking modes — which is exactly what you want for a coding agent. That said, the E4B is surprisingly capable for its size. If you're short on VRAM, it's worth trying — but it does need more guidance and more specific prompts to get good results. To download it, open LM Studio, search for gemma-4-26b-a4b, and download a quantized GGUF version (e.g., Q4_K_M). Choose the quantization based on your available VRAM: Quantization Download Size Quality Q4_K_M 18 GB Good balance Q6_K 24 GB Higher quality Q8_0 28 GB Near-original Note: Even though the model only activates 4B parameters per token, all 26B parameters must be loaded into memory for fast routing. That's why VRAM requirements are closer to a dense 26B model. If you're on a Mac, you can also check out the MLX versions of Gemma 4. MLX is natively optimized for Apple Silicon and can be faster than the GGUF format on M-series chips. 3) Start the server in LM Studio # Once the model is downloaded: Go to the Developer tab in LM Studio Select your downloaded Gemma 4 model Click Start Server The server runs at http://localhost:1234 by default and exposes an OpenAI-compatible API. You can verify it's running: curl http://localhost:1234/v1/models 4) Configure context size and GPU offload # Before you start working, check the context size and GPU offload settings under Model Settings in the Developer tab. Context size directly impacts VRAM usage. The model supports up to 256K tokens, but you probably don't need all of that for…

This excerpt is published under fair use for community discussion. Read the full article at Patloeber.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Email

Discussion

0 comments

More from Patloeber