Your dev loop is dragging. Long API waits, flaky networks, code you’d rather not send to a third-party server. If speed, privacy, and offline reliability matter to your workflow, the conversation around local models has changed significantly in the past twelve months, and the options available today are genuinely impressive.
This isn’t a list of models that “might work.” These are models with real benchmark numbers, known hardware requirements, and active community support. Here’s what each one actually buys you.
gpt-oss-120b: OpenAI’s First Real Open-Weight Release
OpenAI released gpt-oss-120b and gpt-oss-20b in August 2025 under the Apache 2.0 license, the first time OpenAI has shipped a capable open-weight model since GPT-2 in 2019. The 120B model uses a Mixture-of-Experts architecture that activates only 5.1B parameters per token, which is why it fits on a single 80GB H100 GPU despite having 117B total parameters, and benchmarks show it matching or exceeding o4-mini on competition coding, general problem solving, and tool use.
For local use, realistically you need an H100 or equivalent. It runs via Ollama (ollama pull gpt-oss:120b), vLLM, and llama.cpp. If your hardware can handle it, this is currently one of the strongest open-weight reasoning models available for coding tasks.
Best for: Production-grade local inference, agentic workflows, complex multi-step coding tasks where reasoning quality matters.
gpt-oss-20b: The Consumer-Friendly Sibling
gpt-oss-20b activates 3.6B parameters per token and runs on systems with as little as 16GB of memory, which puts it in reach of most modern developer machines. On benchmarks, it matches or exceeds o3-mini on MMLU, GPQA, and AIME 2025, a remarkable result for a model this size.
Pull it with ollama pull gpt-oss:20b and you have an OpenAI-quality reasoning model running fully offline. The 20B version is the practical choice for anyone who wants the gpt-oss architecture without needing data center hardware.
Best for: Developers on 16–24GB systems who want strong reasoning without cloud dependencies.
Kimi-Dev-72B: The Bug-Fixing Specialist
Released by Moonshot AI in June 2025, Kimi-Dev-72B was built specifically for software engineering tasks rather than general assistance. It achieves 60.4% on SWE-bench Verified, setting a new state-of-the-art among open-source models at the time of its release. The training methodology is what makes it different: the model was trained on roughly 150 billion tokens of GitHub issues and pull request commits, and during RL training it earned rewards only when entire test suites passed in isolated Docker environments, meaning it learned to produce working code, not just plausible-looking code.
The hardware requirement is significant. At Q4_K_M quantization, Kimi-Dev-72B weighs around 47GB, so you’re looking at multi-GPU setups or very high-end single GPUs for comfortable inference. The Q2_K version brings this down to about 30GB at the cost of some quality.
Best for: Developers who spend most of their time debugging existing codebases or writing test suites rather than greenfield development.
GLM-4.6: The Long-Context Powerhouse
GLM-4.6, released by Zhipu AI in September 2025, has 355 billion total parameters with 32 billion active, and a 200K-token context window. That context window is the headline feature: for developers working across large monorepos or multi-file refactors, 200K tokens means you can feed the model considerably more of your actual codebase than most alternatives allow.
On LiveCodeBench v6 it ranks first among open models at 82.8%, and on AIME 2025 it reaches 93.9% accuracy. The tradeoff is hardware: at this parameter count you need a multi-GPU setup for local inference, which puts it out of reach for most individual developers without a workstation. It’s better suited to teams running dedicated inference servers.
Best for: Large codebase work, combined coding and reasoning tasks, teams with the infrastructure to run it properly.
Qwen3-Coder-30B: The Best Value for Mid-Range Hardware
Qwen3-Coder-30B is a Mixture-of-Experts model with 30.5 billion total parameters but only 3.3 billion active per token, which dramatically reduces the hardware burden compared to a dense 30B model. An RTX 3090 or 4090 handles it well, and Apple Silicon Macs with 24GB+ of unified memory are competitive options too.
The model supports a native 256K context window, which is exceptional at this hardware tier. Community benchmarks show around 22 tokens per second generation speed running Q8 quantization on a dual-channel DDR5-6000 system with no discrete GPU: a CPU-only result that would have been unthinkable for a 30B-class model a year ago. If you want strong coding performance without spending on high-end GPU hardware, this is currently the strongest option in this range.
Best for: Developers on RTX 3090/4090 or Apple Silicon who need a capable everyday coding assistant with real context depth.
DeepSeek Coder V2 Lite: Efficient Inference, Solid Results
DeepSeek Coder V2 Lite uses the same MoE approach, activating a subset of parameters per token to keep inference fast on consumer hardware. It’s a well-tested model with broad community support across Ollama, LM Studio, and llama.cpp, and the DeepSeek ecosystem has shown strong multilingual coding support across Python, Go, Rust, TypeScript, and others.
It doesn’t match the benchmark numbers of the newer models above, but it’s arguably the most battle-tested option on this list and the easiest to get running correctly on the first try. For developers who want something reliable out of the box rather than the latest SOTA, it remains a solid pick.
Best for: Balanced coding and logic tasks on mid-range hardware, developers who prioritize stability over maximum benchmark performance.
aiXcoder-7B: Lightweight Autocomplete
aiXcoder-7B was trained specifically for code completion and structure understanding, and it shows. At 7B parameters it fits comfortably on most modern laptops and runs at usable speeds even without a discrete GPU. It isn’t built for complex multi-file tasks or deep reasoning, but for fast autocomplete during active development it performs well above its weight class.
Best for: Constrained hardware, developers who primarily want fast inline suggestions rather than conversational coding assistance.
Choosing Based on Your Hardware

The honest answer is that hardware is still the primary constraint. If you have an H100 or equivalent, gpt-oss-120b is the clear choice. If you’re on 16–24GB of consumer VRAM, gpt-oss-20b or Qwen3-Coder-30B give you the best capability-to-resource ratio right now. If you’re specifically debugging existing code, Kimi-Dev-72B’s SWE-bench numbers are hard to argue with if you have the memory for it.
Unlike API-hosted models, local inference means no rate limits and full control over latency, cost, and which code stays on your machine. For teams handling proprietary codebases, that last point tends to make the hardware investment straightforward to justify.
References
gpt-oss (120B e 20B)
- OpenAI: openai.com/index/introducing-gpt-oss
- Hugging Face: huggingface.co/openai/gpt-oss-120b
- GitHub oficial: github.com/openai/gpt-oss
- Paper (arXiv): arxiv.org/abs/2508.10925
Kimi-Dev-72B
- Hugging Face: huggingface.co/moonshotai/Kimi-Dev-72B
GLM-4.6
- Hugging Face (THUDM): huggingface.co/THUDM/GLM-4-6
Qwen3-Coder
- Hugging Face: huggingface.co/Qwen/Qwen3-Coder
DeepSeek Coder V2 Lite



