VoiceNative sits in your menu bar. It captures audio, transcribes locally on Apple's Neural Engine, and pastes the result into whatever app you're in. Audio never leaves your Mac.
First launch downloads the model (~950 MB) and compiles it for the Neural Engine. One-time cost. Recording limit is 5 minutes — the same default macOS dictation uses. A keepalive micro-inference runs every 2 minutes to prevent the Neural Engine from unloading the model.
AppState coordinates everything. No service knows about any other. Illegal state transitions are unrepresentable.
Every action checks state before executing. A model reload can't interrupt recording. An error during processing can't clobber an active session. The state machine prevents these at the type level.
Capture runs on AVAudioEngine's real-time thread. The ring buffer is lock-free. Resampling and normalization run on the transcription task's thread, not in the audio callback. This separation prevents audio glitches during transcription.
Two paths exist for running Whisper on macOS. We tried both.
Runs on the GPU via Metal. GGML format. Loads instantly — no compilation step. Widely used. But it competes for GPU time with everything else: your windows, scrolling, animations, any other app using Metal.
Runs on the Apple Neural Engine (ANE) — a dedicated inference accelerator on Apple Silicon. CoreML format. Does not compete for GPU or CPU time. Ideal for a background menu bar app. But CoreML compiles the model for your specific hardware on first load.
For a menu bar utility that runs while you do other things, the Neural Engine is the right call. It's an entirely separate chip on your SoC dedicated to ML inference. Your GPU stays free for rendering, your CPU stays free for whatever you're working on.
We use openai_whisper-large-v3_turbo (~955 MB quantized). We tested smaller models — they hallucinated more on short utterances. We tested unquantized — memory increased without a clear accuracy gain for our use case. Large-v3-turbo hit the sweet spot: fast enough for real-time, accurate enough for dictation, small enough to fit comfortably in memory.
CoreML compiles the model for your specific hardware the first time it loads. This takes 60–90 seconds. The compiled model is cached so subsequent loads take ~5 seconds. The difference between a broken app and a fast app was a single directory path.
WhisperKit stores compiled models at:
CoreML's cache is keyed by the exact compute unit configuration. Change which chip runs which part of the model, and CoreML discards the entire cache. Another 90-second recompilation.
We discovered this the hard way. We tried a hybrid setup — GPU for the encoder, ANE for the decoder. CoreML recompiled the full graph. We tried splitting the model across units. Same result. The cache is all-or-nothing for a given configuration.
We committed to a single configuration and never changed it:
macOS silently unloads CoreML models from the Neural Engine after a period of inactivity. There is no notification, no callback, no documentation on the exact timeout. We measured it at roughly 2 minutes.
When the model is unloaded, the next transcription takes 30+ seconds as CoreML reloads and re-initializes the pipeline. For a voice app, that means the first thing you say after a break takes forever.
Every 120 seconds, we transcribe 0.5 seconds of silence (8,000 zero-value Float32 samples). Invisible to the user. Costs negligible compute. Keeps the entire CoreML/ANE pipeline warm.
/// Keep CoreML/ANE pipeline warm with a silent micro-inference.
/// macOS unloads CoreML models from the Neural Engine after ~2min
/// of inactivity. This prevents the 30+s reload penalty.
func prewarm() async {
guard let wk = whisperKit,
isModelLoaded, !isTranscribing, !isLoading else { return }
let silence = [Float](repeating: 0, count: 8000)
let options = DecodingOptions(
task: .transcribe, language: "en",
withoutTimestamps: true, suppressBlank: true
)
_ = try? await wk.transcribe(
audioArray: silence, decodeOptions: options
)
}
The same problem occurs when the Mac wakes from sleep — the Neural Engine is cold. We listen for NSWorkspace.didWakeNotification, wait 3 seconds for the system to stabilize, and run the same prewarm. Users never notice. The pipeline is warm before they reach for the keyboard.
Whisper processes a maximum of 30 seconds per inference pass. A 2-minute recording needs 4 passes. At a real-time factor of 0.3, that's 36 seconds of waiting after you press stop. Unacceptable for a dictation tool.
Instead of waiting until the user stops, we transcribe chunks in the background while recording continues. A background task polls every 5 seconds. When accumulated audio exceeds 30 seconds of samples (480,000 at 16kHz), it snapshots that chunk, resamples it, and sends it to WhisperKit.
When the user finally stops, only the remaining tail — whatever accumulated since the last chunk — needs transcription. Completed chunk texts are concatenated with the tail result.
Wait = ceil(duration / 30) × 30 × RTF
2 min recording:
ceil(120/30) × 30 × 0.3 = 36s wait after stop
Wait = (duration mod 30) × RTF
2 min recording:
(120 mod 30) × 0.3 = 0–9s wait after stop
When recording stops, the pipeline task is cancelled and awaited. If there are completed chunks, only the remaining audio past the last chunk's end index is transcribed (minimum 1 second). The final text is all chunk texts joined with a space, plus the tail. If there are no completed chunks (short recordings), the full buffer is transcribed in one pass.
Transcription has a dynamic timeout: max(30s, audioDuration × 2). If WhisperKit hangs or takes too long, the task is cancelled and a timeout error is raised. This prevents the app from appearing frozen on edge-case audio.
Whisper expects 16kHz mono Float32. macOS microphones typically run at 48kHz. We had two options: capture directly at 16kHz using CoreAudio AUHAL (no resampling needed), or capture at the mic's native rate and resample on demand.
We chose native capture with AVAudioEngine. The reason: AVAudioEngine handles device switching gracefully and provides simpler buffer management. CoreAudio AUHAL at 16kHz would skip the resampling step but requires manual device handling.
Resampling from 48kHz to 16kHz happens on demand — not in the audio callback. The tap callback only appends raw samples to the ring buffer. When it's time to transcribe, prepareChunk() copies samples from the buffer, resamples via AVAudioConverter, and normalizes.
If the mic has multiple channels, we downmix to mono first by averaging all channels. The output capacity includes a 1,024-sample safety margin to prevent buffer underflow during conversion.
After resampling, we peak-normalize using vDSP from Apple's Accelerate framework. Without normalization, quiet microphones produce low-amplitude samples. Whisper interprets low-amplitude audio as silence — and hallucinates.
/// Peak normalization via Accelerate (vDSP)
var peak: Float = 0
vDSP_maxmgv(samples, 1, &peak, vDSP_Length(count))
if peak > 0 {
// Target peak: 0.95 (headroom). Max gain: 20× (safety limit).
var scale = min(0.95 / peak, 20.0)
// Skip if already normalized (within 1% of target)
if scale > 0.99 && scale < 1.01 { return samples }
vDSP_vsmul(samples, 1, &scale, &result, 1,
vDSP_Length(count))
}
The target peak is 0.95 (leaving headroom). The gain is capped at 20× to prevent amplifying noise floors into hallucination triggers. If the audio is already within 1% of the target, we skip normalization entirely.
macOS gives you two ways to detect global keystrokes. We tried both.
| Approach | CGEventTap | NSEvent monitors |
|---|---|---|
| Level | Core Graphics | AppKit |
| Permission | Input Monitoring | Accessibility |
| Self-window bug | Silently fails when app's own windows have focus | Works via local monitor |
| Registration | Manual event tap + run loop source | One-liner with block callback |
CGEventTap requires Input Monitoring permission and has a subtle bug: when the app's own popover or settings window has focus, the global tap stops receiving events. You need a separate local monitor for that case anyway.
NSEvent gives you both: globalMonitorForEvents for keystrokes in other apps, localMonitorForEvents for keystrokes in your own windows. Same API, same callbacks. We use both.
Right Shift (keycode 60) is a modifier, not a regular key. It doesn't fire .keyDown / .keyUp. It fires .flagsChanged. You check the event's modifier flags to determine whether the key went down or up. Toggle state is maintained manually: first press starts recording, second press stops.
Escape (keycode 53) is the cancel key. Left Shift (keycode 56) is ignored to avoid conflicts with typing.
Whisper was trained on YouTube transcripts. When it hears silence or near-silence, it doesn't output nothing — it outputs what it learned from YouTube: "Thank you for watching", "Please like and subscribe", "See you in the next video". Sometimes it outputs repeated n-grams. Sometimes it switches to Chinese or Cyrillic for no reason.
This isn't a bug — it's the model doing exactly what it was trained to do. The training data is full of these phrases, and silence provides no signal to override them.
We built a post-transcription filter with three checks. If any check triggers, the transcription is discarded entirely.
Exact match (case-insensitive, trimmed) against known hallucination phrases: "thank you for watching", "thanks for watching", "subscribe to my channel", "please like and subscribe", "see you in the next video", "don't forget to subscribe", "hit the bell", "leave a comment", "share this video", "thanks for listening", "goodbye", "bye bye", "you". Also matches with trailing period.
For transcriptions with 6+ words: check n-grams from size 1 to 5. If all n-grams of any size are identical and there are 3+ repetitions, it's a hallucination. Catches "the the the the the" and "thank you thank you thank you".
For English transcription: if fewer than half the characters are ASCII, discard. Catches CJK, Cyrillic, and other script hallucinations that Whisper sometimes produces when it loses confidence.
We experimented with feeding a technical dictionary as prompt tokens to bias Whisper toward domain-specific vocabulary. The result: more hallucinations, not fewer. Prompt tokens add latency and can bias the model toward generating text even when input is silence. We removed them and moved vocabulary handling to post-processing.
After transcription, the text needs to appear in whatever app the user was in. We save the frontmost application reference when recording starts (NSWorkspace.shared.frontmostApplication), write the transcribed text to NSPasteboard, and simulate Cmd+V.
Why paste instead of typing each character? Speed and reliability. Simulating individual keystrokes fails with special characters, non-ASCII text, and apps that have custom key handling. A single paste operation works everywhere that supports Cmd+V — which is everything.
Accessibility permission is required for simulating the paste keystroke. This is the same permission needed for the global hotkey — one grant covers both.
VAD provides automatic silence-based stop for hold-to-talk mode. It uses energy-based detection with a 100ms analysis window. On launch, a 0.5-second calibration period measures ambient noise to set the baseline.
After at least 3 seconds of recording (to avoid premature stops), if silence exceeds the timeout (default 3 seconds), VAD triggers stop. The minimum recording threshold prevents the app from stopping before the user has had a chance to speak.
Every value that ships with VoiceNative. No hidden configuration.
| Parameter | Value | Why |
|---|---|---|
| Target sample rate | 16,000 Hz | Whisper's expected input |
| Channels | 1 (mono) | Whisper is mono-only |
| Tap buffer size | 4,096 frames | Balance between latency and callback frequency |
| Max recording | 300s (5 min) | Same as macOS dictation default |
| Min duration for transcription | 0.3s | Below this, nothing useful to transcribe |
| Hotkey | Right Shift (60) | Rarely used alone; no conflict with typing |
| Cancel key | Escape (53) | Standard macOS cancel idiom |
| Keepalive interval | 120s | Just under ANE's unload timeout |
| Prewarm samples | 8,000 (0.5s) | Minimum to trigger pipeline; negligible cost |
| Wake delay | 3s | System stabilization after sleep |
| Streaming chunk | 30s at native rate | Whisper's max input window |
| Streaming poll | 5s | Balance between latency and CPU usage |
| Transcription timeout | max(30s, 2× audio) | Dynamic; prevents hangs on edge cases |
| Normalization peak | 0.95 | Headroom to prevent clipping |
| Max normalization gain | 20× | Safety limit; prevents noise amplification |
| VAD window | 100ms | Energy analysis resolution |
| VAD calibration | 0.5s | Ambient noise baseline measurement |
| Silence timeout | 3.0s | Hold-to-talk auto-stop threshold |
| Min recording before VAD | 3.0s | Prevents premature stop |
| Icon feedback | 1.5s | Duration of menu bar icon state feedback |
Requires macOS 15 (Sequoia) and Xcode 16+.
git clone https://github.com/triggeredcode/voicenative.git
cd voicenative
make dmg
| Permission | Why |
|---|---|
| Microphone | Audio capture via AVAudioEngine |
| Accessibility | Global hotkey detection (NSEvent) + simulated Cmd+V paste |
Both permissions are requested on first launch. Grant them in System Settings → Privacy & Security.