What Is Voxtral Mini 4B Realtime?
Mistral AI just dropped something that changes the game for local speech recognition: Voxtral Mini 4B Realtime, a 4-billion parameter speech-to-text model that delivers accuracy rivaling offline transcription systems — with under 500 milliseconds of latency. Released under the Apache 2.0 license in February 2026, it's one of the first open-source realtime ASR models that actually competes with commercial APIs.
But here's where it gets wild: within days of release, developers built implementations that run this model entirely in your browser via Rust and WebAssembly, and on bare CPU in pure C with zero dependencies. No cloud. No API keys. No data leaving your machine.
If you've been following the open-source AI revolution, Voxtral Mini 4B is exactly the kind of model that proves smaller, focused models can beat bloated general-purpose ones at specific tasks.
Voxtral Mini 4B: The Technical Specs
According to Mistral's official model card on Hugging Face, Voxtral Mini 4B Realtime consists of two architectural components:
| Component | Details |
|---|---|
| Language Model | ~3.4B parameters, 26 layers, 3072 dim, GQA 32Q/8KV |
| Audio Encoder | ~0.6B parameters, 32 layers, 1280 dim, causal attention, sliding window 750 |
| Total Parameters | ~4B (BF16 weights, ~9 GB full precision) |
| Quantized Size | ~2.5 GB (Q4 GGUF) |
| License | Apache 2.0 (commercial use allowed) |
| Languages | 13 languages including English, German, French, Spanish, Chinese, Japanese, Korean, Arabic, Hindi, and more |
| Latency | Configurable 80ms to 2.4s delay (480ms recommended sweet spot) |
| Throughput | Exceeds 12.5 tokens/second on minimal hardware |
The architecture is natively streaming: the audio encoder was trained from scratch with causal attention, and both the encoder and LLM backbone use sliding window attention. This means theoretically unlimited audio length — Mistral's default vLLM configuration supports roughly 3 hours of continuous recording.
How the Model Works: From Audio to Text
The inference pipeline follows a clear path, as documented in the TrevorS Rust implementation:
- Audio Input — 16kHz mono audio is fed in
- Mel Spectrogram — Converted to a [B, 128, T] mel spectrogram
- Causal Encoder — 32 layers with 1280 dimensions and sliding window attention (window size 750)
- Conv 4x Downsample — Reshape to [B, T/16, 5120]
- Adapter — Projects to [B, T/16, 3072] to match decoder dimensions
- Autoregressive Decoder — 26 layers, 3072 dim, grouped-query attention (32 queries, 8 key-value heads)
- Token Output — Token IDs decoded to text
What makes this remarkable is the causal attention in the encoder. Traditional speech models like Whisper use bidirectional attention — they need the full audio clip before transcribing. Voxtral's causal encoder processes audio left-to-right, enabling true realtime streaming.
Running Voxtral in Your Browser: The Rust/WASM Implementation
Developer TrevorS built voxtral-mini-realtime-rs, a pure Rust implementation using the Burn ML framework. It hit 169 points on Hacker News within a day of posting. The Q4 GGUF quantized version (2.5 GB) runs entirely client-side in a browser tab via WebAssembly and WebGPU.
You can try it live on HuggingFace Spaces — just open the page, let it download the model shards, and start talking into your microphone. All processing happens locally.
Five Hard Constraints Solved for Browser Deployment
Running a 4B parameter model in a browser tab is not trivial. TrevorS had to solve five specific engineering challenges:
| Constraint | Solution |
|---|---|
| 2 GB allocation limit | ShardedCursor reads across multiple Vec buffers |
| 4 GB address space | Two-phase loading: parse weights, drop reader, then finalize |
| 1.5 GiB embedding table | Q4 embeddings on GPU + CPU-side row lookups |
| No sync GPU readback | All tensor reads use async into_data_async().await |
| 256 workgroup invocation limit | Patched cubecl-wgpu to cap reduce kernel workgroups |
The GGUF file is split into 512 MB shards to stay under the browser's ArrayBuffer limit. Custom WGSL shaders handle fused dequantization and matrix multiplication on the GPU. This is serious systems engineering — not a toy demo.
Two Inference Paths
| Feature | F32 (Native) | Q4 GGUF (Native + Browser) |
|---|---|---|
| Weights | SafeTensors (~9 GB) | GGUF Q4_0 (~2.5 GB) |
| Linear ops | Burn tensor matmul | Custom WGSL shader (fused dequant + matmul) |
| Embeddings | f32 tensor (1.5 GiB) | Q4 on GPU (216 MB) + CPU bytes for lookups |
| Browser support | No | Yes (WASM + WebGPU) |
Pure C on CPU: Antirez's voxtral.c
Salvatore Sanfilippo (antirez), the creator of Redis, built voxtral.c — a pure C implementation of the full Voxtral inference pipeline with zero external dependencies beyond the C standard library. It trended on Hacker News with 91 points.
Antirez's motivation was blunt: while Mistral released great open weights, limiting inference to vLLM without a reference implementation "limits the model's actual reach." So he built one from scratch.
Key Features of voxtral.c
- Zero dependencies — Pure C, no Python runtime, no CUDA toolkit, no vLLM
- Metal GPU acceleration — Fused GPU operations on Apple Silicon, with BLAS fallback for Linux
- Memory-mapped weights — BF16 weights mmap'd directly from safetensors, near-instant loading
- Live microphone input —
--from-miccaptures and transcribes in real time (macOS) - Streaming output — Tokens printed to stdout as generated, word by word
- Chunked encoder — Processes audio in overlapping chunks, bounding memory regardless of input length
- Rolling KV cache — Automatically compacted at 8192 positions, enabling unlimited-length audio
- Stdin piping — Pipe any format via ffmpeg:
ffmpeg -i podcast.mp3 -f s16le -ar 16000 -ac 1 - | ./voxtral -d voxtral-model --stdin - Streaming C API —
vox_stream_tlets you feed audio incrementally and receive tokens
The build is dead simple:
make mps # Apple Silicon (fastest)
# or: make blas # Intel Mac / Linux with OpenBLAS
./download_model.sh
./voxtral -d voxtral-model -i audio.wav
On a 60-second audio clip, batch mode takes roughly 2.9 seconds for the encoder on Apple Silicon. The processing interval flag (-I) lets you control the latency-efficiency tradeoff for streaming use cases.
Voxtral vs. Whisper: Why This Matters
OpenAI's Whisper has been the go-to open-source speech model for years. But Whisper is an offline model — it needs the complete audio before transcribing. Voxtral Mini 4B changes the equation:
| Feature | Whisper | Voxtral Mini 4B Realtime |
|---|---|---|
| Mode | Offline (needs full audio) | Realtime streaming |
| Encoder attention | Bidirectional | Causal (streaming-native) |
| Latency | Full clip processing time | Configurable 80ms–2.4s |
| License | MIT | Apache 2.0 |
| Browser inference | whisper.cpp via WASM (offline) | Full realtime via WASM + WebGPU |
| Languages | 99+ | 13 |
At 480ms delay, Voxtral matches the accuracy of leading offline open-source models on the FLEURS benchmark. On English specifically, the Word Error Rate (WER) is 4.90% at 480ms delay — competitive with Whisper's best offline results.
Benchmark Results: FLEURS WER by Language
From Mistral's official benchmarks on the FLEURS dataset:
| Model | Delay | AVG WER | English WER |
|---|---|---|---|
| Voxtral Mini Transcribe 2.0 | Offline | 5.90% | 3.32% |
| Voxtral Mini 4B Realtime | 480ms | 8.72% | 4.90% |
| Voxtral Mini 4B Realtime | 160ms | 12.60% | 6.46% |
The tradeoff is clear: lower latency means higher error rates. At 480ms, you get near-offline quality with realtime capability. For most voice assistant and live subtitling applications, that's more than good enough.
Privacy and On-Device Deployment
This is where Voxtral Mini 4B gets really interesting for businesses and privacy-conscious users. Because the model runs entirely locally:
- No audio data leaves your device — critical for healthcare, legal, and financial applications
- No API costs — transcribe unlimited audio for free after the initial model download
- No internet required — works completely offline once loaded
- No vendor lock-in — Apache 2.0 license means full commercial freedom
At Serenities AI, we've been tracking the Mistral ecosystem closely. Voxtral Mini 4B represents exactly the kind of model that makes AI practical: small enough to deploy anywhere, accurate enough to trust, and open enough to build on.
How to Get Started
Option 1: Try It in Your Browser (Easiest)
- Visit the HuggingFace Spaces demo
- Wait for the ~2.5 GB model to download (one time)
- Click record and start talking
- Everything runs locally in your browser — check your network tab if you don't believe it
Option 2: Native Rust CLI
# Download model weights (~9 GB)
uv run --with huggingface_hub hf download mistralai/Voxtral-Mini-4B-Realtime-2602 --local-dir models/voxtral
# Transcribe an audio file
cargo run --release --features "wgpu,cli,hub" --bin voxtral-transcribe -- \
--audio audio.wav --model models/voxtral
Option 3: Pure C (Antirez's voxtral.c)
make mps # or: make blas
./download_model.sh
./voxtral -d voxtral-model -i audio.wav
# Live microphone
./voxtral -d voxtral-model --from-mic
If you're into AI-assisted development, our Claude Code tips and tricks guide covers how tools like Claude Code can help you integrate speech models into your own projects faster.
What This Means for the Future of Speech AI
Voxtral Mini 4B is a proof point for three converging trends:
- Models are shrinking to fit edge devices — 4B parameters running in a browser tab was unthinkable two years ago
- WebGPU is becoming a real ML deployment target — not just for demos, but for production-quality inference
- The "pure C" movement is alive — following llama.cpp and whisper.cpp, voxtral.c proves that stripping dependencies to zero makes AI accessible everywhere
We're entering an era where speech recognition is no longer a cloud service — it's a capability that ships with your application. The combination of Mistral's model quality, Rust's systems-level performance, and WebGPU's browser reach means private, realtime speech-to-text is now a solved problem for anyone willing to integrate it.
Frequently Asked Questions
What is Voxtral Mini 4B Realtime?
Voxtral Mini 4B Realtime is a 4-billion parameter speech-to-text model from Mistral AI. It's designed for realtime streaming transcription with configurable latency from 80ms to 2.4s. It supports 13 languages and is released under the Apache 2.0 license, making it free for commercial use.
Can Voxtral Mini 4B really run in a web browser?
Yes. The Q4 quantized version (2.5 GB) runs entirely client-side in a browser using WebAssembly and WebGPU. Developer TrevorS built a pure Rust implementation that solved five major technical constraints including WASM's 2 GB allocation limit and 4 GB address space. You can try it live at the HuggingFace Spaces demo.
How does Voxtral compare to Whisper for speech recognition?
Whisper is an offline model requiring the complete audio before transcription. Voxtral Mini 4B is a natively streaming model with causal attention, enabling realtime transcription. At a 480ms delay, Voxtral achieves 4.90% English WER on FLEURS — competitive with offline models. Whisper supports more languages (99+ vs. 13), but Voxtral is purpose-built for realtime use.
What hardware do I need to run Voxtral Mini 4B locally?
For the browser version, you need a WebGPU-capable browser (Chrome 113+ or Edge) and enough RAM to hold the 2.5 GB quantized model. For the pure C version (voxtral.c), Apple Silicon Macs get the best performance via Metal GPU acceleration. Linux systems can use OpenBLAS. The model downloads are 2.5 GB (quantized) or ~9 GB (full precision).
Is Voxtral Mini 4B free to use commercially?
Yes. Voxtral Mini 4B Realtime is released under the Apache 2.0 license, which permits commercial use, modification, and redistribution. Both the TrevorS Rust implementation and antirez's voxtral.c are also open source (Apache 2.0 and public domain/MIT respectively).