Agent-harness-kit: Orchestrating Multi-Agent AI Workflows
Introducing the agent-harness-kit, a tool for scaffolding and managing complex multi-agent workflows powered by AI.

Forget the cloud. The future of powerful AI is landing squarely on your desk, and with DeepSeek 4 Flash, it’s running blazing fast on your Mac. Salvatore Sanfilippo, the architect behind Redis, has delivered ds4.c, a remarkably specialized inference engine designed exclusively for the DeepSeek V4 Flash model, and crucially, for Apple Silicon’s Metal GPU. This isn’t just another llama.cpp clone; it’s a laser-focused piece of engineering democratizing on-device AI.
At its heart, ds4.c is a native C inference engine that breathes DeepSeek V4 Flash. It bypasses general-purpose frameworks, opting instead for a Metal graph executor. This means direct, low-level access to the GPU for model loading, prompt rendering, and managing the KV state – the memory that holds the LLM’s understanding of your conversation. The star of the show here is DeepSeek V4 Flash itself, with its jaw-dropping 1 million token context window.
While running a full 1 million tokens requires Herculean RAM (around 81GB for 2-bit quantized weights plus 26GB for the indexer alone), ds4 makes this ambitious context feasible on high-end Macs. For many, a practical limit of 100-300k tokens on 128GB RAM machines is still monumental. What’s truly innovative is its “incredibly compressed” KV cache. This isn’t just about fitting more context; it enables on-disk persistence, meaning your LLM’s memory can survive across sessions, a game-changer for complex, multi-turn interactions.
ds4-serverThe ds4-server component is where the magic truly crystallizes for developers. This Metal-only server offers an OpenAI-compatible API, making integration a breeze. Imagine feeding an entire codebase or a massive documentation dump into an LLM that remembers every single word.
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
This command alone hints at the capabilities: setting a 100,000 token context, specifying a disk directory for the KV cache, and allocating 8GB of disk space. The server, while lacking batching (it serializes requests), is meticulously optimized for single-session, in-memory KV state. This is precisely what agentic workflows crave. Think of your AI assistant not just processing a single query but weaving together information from vast prior interactions, maintaining a rich, persistent understanding. Experimental speculative decoding (--mtp) further hints at performance tuning for even faster, more responsive inference.
For those building AI agents, tools like opencode can seamlessly connect to ds4-server by simply pointing to its baseURL. This direct on-device execution means faster response times, enhanced privacy, and freedom from network latency – all while harnessing the “thinking mode” capabilities of DeepSeek V4 Flash.
ds4Let’s be clear: ds4.c is not trying to be all things to all people. Its strength lies in its extreme specialization. It only supports DeepSeek V4 Flash. It only targets Apple Silicon and its Metal framework. This isn’t a limitation; it’s a design philosophy. By carving out such a specific niche, antirez has achieved a level of optimization that generic frameworks often struggle to match.
This is the future of local LLMs: bespoke engines tailored to specific hardware and models, unlocking capabilities previously confined to massive cloud clusters. While alternatives like llama.cpp offer broader model support and cross-platform compatibility, ds4 offers an unparalleled, deep dive into what’s possible with DeepSeek V4 Flash on Apple hardware. If you’re an AI developer craving privacy, speed, and the power of massive context windows on your personal machine, DeepSeek 4 Flash is not just an option – it’s becoming a necessity.