What Is Voxtral Mini 4B Realtime?

Mistral AI just dropped something that changes the game for local speech recognition: Voxtral Mini 4B Realtime, a 4-billion parameter speech-to-text model that delivers accuracy rivaling offline transcription systems — with under 500 milliseconds of latency. Released under the Apache 2.0 license in February 2026, it's one of the first open-source realtime ASR models that actually competes with commercial APIs.

But here's where it gets wild: within days of release, developers built implementations that run this model entirely in your browser via Rust and WebAssembly, and on bare CPU in pure C with zero dependencies. No cloud. No API keys. No data leaving your machine.

If you've been following the open-source AI revolution, Voxtral Mini 4B is exactly the kind of model that proves smaller, focused models can beat bloated general-purpose ones at specific tasks.

Voxtral Mini 4B: The Technical Specs

According to Mistral's official model card on Hugging Face, Voxtral Mini 4B Realtime consists of two architectural components:

Component	Details
Language Model	~3.4B parameters, 26 layers, 3072 dim, GQA 32Q/8KV
Audio Encoder	~0.6B parameters, 32 layers, 1280 dim, causal attention, sliding window 750
Total Parameters	~4B (BF16 weights, ~9 GB full precision)
Quantized Size	~2.5 GB (Q4 GGUF)
License	Apache 2.0 (commercial use allowed)
Languages	13 languages including English, German, French, Spanish, Chinese, Japanese, Korean, Arabic, Hindi, and more
Latency	Configurable 80ms to 2.4s delay (480ms recommended sweet spot)
Throughput	Exceeds 12.5 tokens/second on minimal hardware

The architecture is natively streaming: the audio encoder was trained from scratch with causal attention, and both the encoder and LLM backbone use sliding window attention. This means theoretically unlimited audio length — Mistral's default vLLM configuration supports roughly 3 hours of continuous recording.

How the Model Works: From Audio to Text

The inference pipeline follows a clear path, as documented in the TrevorS Rust implementation:

Audio Input — 16kHz mono audio is fed in
Mel Spectrogram — Converted to a [B, 128, T] mel spectrogram
Causal Encoder — 32 layers with 1280 dimensions and sliding window attention (window size 750)
Conv 4x Downsample — Reshape to [B, T/16, 5120]
Adapter — Projects to [B, T/16, 3072] to match decoder dimensions
Autoregressive Decoder — 26 layers, 3072 dim, grouped-query attention (32 queries, 8 key-value heads)
Token Output — Token IDs decoded to text

What makes this remarkable is the causal attention in the encoder. Traditional speech models like Whisper use bidirectional attention — they need the full audio clip before transcribing. Voxtral's causal encoder processes audio left-to-right, enabling true realtime streaming.

Running Voxtral in Your Browser: The Rust/WASM Implementation

Developer TrevorS built voxtral-mini-realtime-rs, a pure Rust implementation using the Burn ML framework. It hit 169 points on Hacker News within a day of posting. The Q4 GGUF quantized version (2.5 GB) runs entirely client-side in a browser tab via WebAssembly and WebGPU.

You can try it live on HuggingFace Spaces — just open the page, let it download the model shards, and start talking into your microphone. All processing happens locally.

Five Hard Constraints Solved for Browser Deployment

Running a 4B parameter model in a browser tab is not trivial. TrevorS had to solve five specific engineering challenges:

Constraint	Solution
2 GB allocation limit	ShardedCursor reads across multiple Vec buffers
4 GB address space	Two-phase loading: parse weights, drop reader, then finalize
1.5 GiB embedding table	Q4 embeddings on GPU + CPU-side row lookups
No sync GPU readback	All tensor reads use async `into_data_async().await`
256 workgroup invocation limit	Patched cubecl-wgpu to cap reduce kernel workgroups

The GGUF file is split into 512 MB shards to stay under the browser's ArrayBuffer limit. Custom WGSL shaders handle fused dequantization and matrix multiplication on the GPU. This is serious systems engineering — not a toy demo.

Two Inference Paths

Feature	F32 (Native)	Q4 GGUF (Native + Browser)
Weights	SafeTensors (~9 GB)	GGUF Q4_0 (~2.5 GB)
Linear ops	Burn tensor matmul	Custom WGSL shader (fused dequant + matmul)
Embeddings	f32 tensor (1.5 GiB)	Q4 on GPU (216 MB) + CPU bytes for lookups
Browser support	No	Yes (WASM + WebGPU)

Pure C on CPU: Antirez's voxtral.c

Salvatore Sanfilippo (antirez), the creator of Redis, built voxtral.c — a pure C implementation of the full Voxtral inference pipeline with zero external dependencies beyond the C standard library. It trended on Hacker News with 91 points.

Antirez's motivation was blunt: while Mistral released great open weights, limiting inference to vLLM without a reference implementation "limits the model's actual reach." So he built one from scratch.

Key Features of voxtral.c

Zero dependencies — Pure C, no Python runtime, no CUDA toolkit, no vLLM
Metal GPU acceleration — Fused GPU operations on Apple Silicon, with BLAS fallback for Linux
Memory-mapped weights — BF16 weights mmap'd directly from safetensors, near-instant loading
Live microphone input — --from-mic captures and transcribes in real time (macOS)
Streaming output — Tokens printed to stdout as generated, word by word
Chunked encoder — Processes audio in overlapping chunks, bounding memory regardless of input length
Rolling KV cache — Automatically compacted at 8192 positions, enabling unlimited-length audio
Stdin piping — Pipe any format via ffmpeg: ffmpeg -i podcast.mp3 -f s16le -ar 16000 -ac 1 - | ./voxtral -d voxtral-model --stdin
Streaming C API — vox_stream_t lets you feed audio incrementally and receive tokens

The build is dead simple:

make mps          # Apple Silicon (fastest)
# or: make blas   # Intel Mac / Linux with OpenBLAS
./download_model.sh
./voxtral -d voxtral-model -i audio.wav

On a 60-second audio clip, batch mode takes roughly 2.9 seconds for the encoder on Apple Silicon. The processing interval flag (-I) lets you control the latency-efficiency tradeoff for streaming use cases.

Voxtral vs. Whisper: Why This Matters

OpenAI's Whisper has been the go-to open-source speech model for years. But Whisper is an offline model — it needs the complete audio before transcribing. Voxtral Mini 4B changes the equation:

Feature	Whisper	Voxtral Mini 4B Realtime
Mode	Offline (needs full audio)	Realtime streaming
Encoder attention	Bidirectional	Causal (streaming-native)
Latency	Full clip processing time	Configurable 80ms–2.4s
License	MIT	Apache 2.0
Browser inference	whisper.cpp via WASM (offline)	Full realtime via WASM + WebGPU
Languages	99+	13

At 480ms delay, Voxtral matches the accuracy of leading offline open-source models on the FLEURS benchmark. On English specifically, the Word Error Rate (WER) is 4.90% at 480ms delay — competitive with Whisper's best offline results.

Benchmark Results: FLEURS WER by Language

From Mistral's official benchmarks on the FLEURS dataset:

Model	Delay	AVG WER	English WER
Voxtral Mini Transcribe 2.0	Offline	5.90%	3.32%
Voxtral Mini 4B Realtime	480ms	8.72%	4.90%
Voxtral Mini 4B Realtime	160ms	12.60%	6.46%

The tradeoff is clear: lower latency means higher error rates. At 480ms, you get near-offline quality with realtime capability. For most voice assistant and live subtitling applications, that's more than good enough.

Privacy and On-Device Deployment

This is where Voxtral Mini 4B gets really interesting for businesses and privacy-conscious users. Because the model runs entirely locally:

No audio data leaves your device — critical for healthcare, legal, and financial applications
No API costs — transcribe unlimited audio for free after the initial model download
No internet required — works completely offline once loaded
No vendor lock-in — Apache 2.0 license means full commercial freedom

At Serenities AI, we've been tracking the Mistral ecosystem closely. Voxtral Mini 4B represents exactly the kind of model that makes AI practical: small enough to deploy anywhere, accurate enough to trust, and open enough to build on.

How to Get Started

Option 1: Try It in Your Browser (Easiest)

Visit the HuggingFace Spaces demo
Wait for the ~2.5 GB model to download (one time)
Click record and start talking
Everything runs locally in your browser — check your network tab if you don't believe it

Option 2: Native Rust CLI

# Download model weights (~9 GB)
uv run --with huggingface_hub hf download mistralai/Voxtral-Mini-4B-Realtime-2602 --local-dir models/voxtral

# Transcribe an audio file
cargo run --release --features "wgpu,cli,hub" --bin voxtral-transcribe -- \
  --audio audio.wav --model models/voxtral

Option 3: Pure C (Antirez's voxtral.c)

make mps    # or: make blas
./download_model.sh
./voxtral -d voxtral-model -i audio.wav

# Live microphone
./voxtral -d voxtral-model --from-mic

If you're into AI-assisted development, our Claude Code tips and tricks guide covers how tools like Claude Code can help you integrate speech models into your own projects faster.

What This Means for the Future of Speech AI

Voxtral Mini 4B is a proof point for three converging trends:

Models are shrinking to fit edge devices — 4B parameters running in a browser tab was unthinkable two years ago
WebGPU is becoming a real ML deployment target — not just for demos, but for production-quality inference
The "pure C" movement is alive — following llama.cpp and whisper.cpp, voxtral.c proves that stripping dependencies to zero makes AI accessible everywhere

We're entering an era where speech recognition is no longer a cloud service — it's a capability that ships with your application. The combination of Mistral's model quality, Rust's systems-level performance, and WebGPU's browser reach means private, realtime speech-to-text is now a solved problem for anyone willing to integrate it.

Frequently Asked Questions

What is Voxtral Mini 4B Realtime?

Voxtral Mini 4B Realtime is a 4-billion parameter speech-to-text model from Mistral AI. It's designed for realtime streaming transcription with configurable latency from 80ms to 2.4s. It supports 13 languages and is released under the Apache 2.0 license, making it free for commercial use.

Can Voxtral Mini 4B really run in a web browser?

Yes. The Q4 quantized version (2.5 GB) runs entirely client-side in a browser using WebAssembly and WebGPU. Developer TrevorS built a pure Rust implementation that solved five major technical constraints including WASM's 2 GB allocation limit and 4 GB address space. You can try it live at the HuggingFace Spaces demo.

How does Voxtral compare to Whisper for speech recognition?

Whisper is an offline model requiring the complete audio before transcription. Voxtral Mini 4B is a natively streaming model with causal attention, enabling realtime transcription. At a 480ms delay, Voxtral achieves 4.90% English WER on FLEURS — competitive with offline models. Whisper supports more languages (99+ vs. 13), but Voxtral is purpose-built for realtime use.

What hardware do I need to run Voxtral Mini 4B locally?

For the browser version, you need a WebGPU-capable browser (Chrome 113+ or Edge) and enough RAM to hold the 2.5 GB quantized model. For the pure C version (voxtral.c), Apple Silicon Macs get the best performance via Metal GPU acceleration. Linux systems can use OpenBLAS. The model downloads are 2.5 GB (quantized) or ~9 GB (full precision).

Is Voxtral Mini 4B free to use commercially?

Yes. Voxtral Mini 4B Realtime is released under the Apache 2.0 license, which permits commercial use, modification, and redistribution. Both the TrevorS Rust implementation and antirez's voxtral.c are also open source (Apache 2.0 and public domain/MIT respectively).

Voxtral Mini 4B: Mistral Speech-to-Text That Runs in Your Browser