antirez ships DS4: a native DeepSeek V4 Flash inference engine for Apple Silicon
TL;DR
Salvatore Sanfilippo released DS4, a C and Metal inference engine that runs DeepSeek V4 Flash locally on Apple Silicon Macs with up to 1M token context.
What changed
Salvatore Sanfilippo (creator of Redis) published DS4, a small native inference engine written in C and Metal that runs DeepSeek V4 Flash on Apple Silicon. It is deliberately narrow: not a generic GGUF runner, not a wrapper, just one model on one hardware class. The 2-bit quantization fits on a 128GB MacBook, generates 26 t/s on M3 Max and 36 t/s on M3 Ultra, and ships a disk-based KV cache so 1M token sessions persist across restarts. OpenAI and Anthropic-compatible HTTP APIs let Claude Code, opencode, and Pi plug in directly.
Why it matters
For developers who want a frontier-grade open model on a local machine, the gap has been runtime quality: generic runners give up performance and feature parity to stay model-agnostic. DS4 trades generality for a tuned path on the exact hardware most developers already own, and exposes it through familiar agent APIs. It is also a vote of confidence that single-model engines are worth building.
What to watch for
Two things to track. First, whether antirez maintains DS4 the way he maintained Redis, or treats it as an experiment (the AGENT.md notes the project leaned heavily on GPT-5.5 for implementation). Second, the persistent KV cache: if it holds up across long agent sessions, it changes what a local coding agent can carry between days of work.
Who this matters for
- For developers running local models on Apple Silicon: 26-36 tokens/sec on M3 hardware with disk-persisted KV cache is the first time DeepSeek V4 Flash has been usable as a daily-driver coding model without cloud round-trips.
- For agent builders: OpenAI plus Anthropic-compatible HTTP APIs mean Claude Code, opencode, Pi, and any other agent harness that speaks those wire formats plug in unchanged. No proxy layer needed.
- For anyone watching the open-source AI tooling space: a Redis-creator-grade contributor working on a single-model engine is a credibility signal that single-purpose tuned runtimes are about to matter.
Harsh’s take
Watching antirez ship a hand-written Metal inference engine for one specific model on one specific hardware class feels like a quiet thesis. The generic GGUF runner era is hitting performance and feature parity ceilings; tuned single-purpose engines that exploit hardware specifics may be the next two years of serious local-AI work.
The fact that the README credits GPT-5.5 for most implementation is the part nobody is going to want to dwell on, but it is what 2026 looks like. A Redis-creator-grade contributor publishing a model-runtime that admits LLM-collaboration as a first-class engineering input changes the credibility math for everyone shipping AI-built infrastructure.
by Harsh Desai
More AI news
- FeatureHiggsfield Launches Supercomputer for Creative Pipelines
Higgsfield released Supercomputer. It runs entire creative pipelines from one chat agent.
- PricingClaude subscriptions get separate budgets for programmatic use, billed at full API prices
Starting June 15, Anthropic splits programmatic Claude usage from subscription quotas into separate $20-$200 monthly credits by plan. SDK and third-party requests bill at full API rates.
- FeatureArticle Details Agent Harness Components: Filesystems, Sandboxes, Memory
Agent harnesses turn AI models into autonomous work engines. Article covers core parts including filesystems, sandboxes, and memory.