Download GitHub Share Link
Open Source

VoiceNative

Local voice-to-text for macOS.

MIT License macOS 15+ Apple Silicon 6 MB
How It Works

Press. Speak. Done.

Right Shift
Speak
Right Shift
Text at cursor

VoiceNative sits in your menu bar. It captures audio, transcribes locally on Apple's Neural Engine, and pastes the result into whatever app you're in. Audio never leaves your Mac.

What to Expect

The numbers.

~90s
First launch
~5s
Cached load
0.3
Real-time factor
150MB
Memory (idle)

First launch downloads the model (~950 MB) and compiles it for the Neural Engine. One-time cost. Recording limit is 5 minutes — the same default macOS dictation uses. A keepalive micro-inference runs every 2 minutes to prevent the Neural Engine from unloading the model.

Architecture

One state machine.
Six services.

AppState coordinates everything. No service knows about any other. Illegal state transitions are unrepresentable.

idle
loading
ready
listening
processing
ready

Every action checks state before executing. A model reload can't interrupt recording. An error during processing can't clobber an active session. The state machine prevents these at the type level.

AppState
Central coordinator. Owns all services. Drives state transitions. Manages recording timer, streaming pipeline, and keepalive.
AudioCapture
AVAudioEngine at 48kHz native rate. Lock-free ring buffer. On-demand resample to 16kHz + vDSP normalize.
Transcription
WhisperKit + CoreML on Neural Engine. Hallucination filter. Prewarm keepalive. Configurable timeout.
Hotkey
NSEvent global + local monitors. Toggle state tracking. Right Shift (keycode 60) via .flagsChanged.
TextInjection
NSPasteboard write + simulated Cmd+V into the frontmost app. Preserves source app reference.
VAD
Energy-based voice activity detection. 100ms window. 0.5s calibration. 3s silence timeout for hold-to-talk.
Permissions
Microphone + Accessibility status. Checked on launch. Required for capture and global paste.

Data flow

Mic 48kHz
Ring buffer
Resample 16kHz
vDSP normalize
WhisperKit (ANE)
Hallucination filter
Cmd+V

Capture runs on AVAudioEngine's real-time thread. The ring buffer is lock-free. Resampling and normalization run on the transcription task's thread, not in the audio callback. This separation prevents audio glitches during transcription.

Learning 1

GPU or Neural Engine?
We chose the Neural Engine.

Two paths exist for running Whisper on macOS. We tried both.

whisper.cpp + Metal

Runs on the GPU via Metal. GGML format. Loads instantly — no compilation step. Widely used. But it competes for GPU time with everything else: your windows, scrolling, animations, any other app using Metal.

WhisperKit + CoreML

Runs on the Apple Neural Engine (ANE) — a dedicated inference accelerator on Apple Silicon. CoreML format. Does not compete for GPU or CPU time. Ideal for a background menu bar app. But CoreML compiles the model for your specific hardware on first load.

For a menu bar utility that runs while you do other things, the Neural Engine is the right call. It's an entirely separate chip on your SoC dedicated to ML inference. Your GPU stays free for rendering, your CPU stays free for whatever you're working on.

Model selection

We use openai_whisper-large-v3_turbo (~955 MB quantized). We tested smaller models — they hallucinated more on short utterances. We tested unquantized — memory increased without a clear accuracy gain for our use case. Large-v3-turbo hit the sweet spot: fast enough for real-time, accurate enough for dictation, small enough to fit comfortably in memory.

Learning 2

First load: 97 seconds.
We got it to 5.

The cache path problem

CoreML compiles the model for your specific hardware the first time it loads. This takes 60–90 seconds. The compiled model is cached so subsequent loads take ~5 seconds. The difference between a broken app and a fast app was a single directory path.

WhisperKit stores compiled models at:

Cache location ~/Documents/huggingface/models/argmaxinc/whisperkit-coreml/{model} When WhisperKit finds compiled artifacts here, it skips the compile step entirely. First launch downloads and compiles. Every launch after reads from this cache.

The cache invalidation problem

CoreML's cache is keyed by the exact compute unit configuration. Change which chip runs which part of the model, and CoreML discards the entire cache. Another 90-second recompilation.

We discovered this the hard way. We tried a hybrid setup — GPU for the encoder, ANE for the decoder. CoreML recompiled the full graph. We tried splitting the model across units. Same result. The cache is all-or-nothing for a given configuration.

We committed to a single configuration and never changed it:

The configuration that stuck melCompute: .cpuAndGPU
audioEncoderCompute: .cpuAndNeuralEngine
textDecoderCompute: .cpuAndNeuralEngine
prefillCompute: .cpuOnly Mel spectrogram on GPU — fast, no ANE compilation needed. Encoder + decoder on Neural Engine — sustained inference without touching GPU. Prefill on CPU — avoids contention with the ANE during decode.
Learning 3

The model forgets you
after 2 minutes.

macOS silently unloads CoreML models from the Neural Engine after a period of inactivity. There is no notification, no callback, no documentation on the exact timeout. We measured it at roughly 2 minutes.

When the model is unloaded, the next transcription takes 30+ seconds as CoreML reloads and re-initializes the pipeline. For a voice app, that means the first thing you say after a break takes forever.

The fix: silent micro-inference

Every 120 seconds, we transcribe 0.5 seconds of silence (8,000 zero-value Float32 samples). Invisible to the user. Costs negligible compute. Keeps the entire CoreML/ANE pipeline warm.

/// Keep CoreML/ANE pipeline warm with a silent micro-inference.
/// macOS unloads CoreML models from the Neural Engine after ~2min
/// of inactivity. This prevents the 30+s reload penalty.
func prewarm() async {
    guard let wk = whisperKit,
          isModelLoaded, !isTranscribing, !isLoading else { return }

    let silence = [Float](repeating: 0, count: 8000)
    let options = DecodingOptions(
        task: .transcribe, language: "en",
        withoutTimestamps: true, suppressBlank: true
    )
    _ = try? await wk.transcribe(
        audioArray: silence, decodeOptions: options
    )
}

Wake from sleep

The same problem occurs when the Mac wakes from sleep — the Neural Engine is cold. We listen for NSWorkspace.didWakeNotification, wait 3 seconds for the system to stabilize, and run the same prewarm. Users never notice. The pipeline is warm before they reach for the keyboard.

Learning 4

A 2-minute recording
finishes in seconds.

The problem with Whisper

Whisper processes a maximum of 30 seconds per inference pass. A 2-minute recording needs 4 passes. At a real-time factor of 0.3, that's 36 seconds of waiting after you press stop. Unacceptable for a dictation tool.

The streaming solution

Instead of waiting until the user stops, we transcribe chunks in the background while recording continues. A background task polls every 5 seconds. When accumulated audio exceeds 30 seconds of samples (480,000 at 16kHz), it snapshots that chunk, resamples it, and sends it to WhisperKit.

When the user finally stops, only the remaining tail — whatever accumulated since the last chunk — needs transcription. Completed chunk texts are concatenated with the tail result.

Without streaming

Wait = ceil(duration / 30) × 30 × RTF

2 min recording:
ceil(120/30) × 30 × 0.3 = 36s wait after stop

With streaming

Wait = (duration mod 30) × RTF

2 min recording:
(120 mod 30) × 0.3 = 0–9s wait after stop

The actual transcription speed didn't change. We moved the work from "after you stop" to "while you're still talking."

Pipeline internals

Streaming constants chunk size: 30s at native rate (nativeSampleRate × 30)
poll interval: 5 seconds
minimum chunk: 5 seconds (shorter chunks skipped)
chunking strategy: .vad (voice activity detection splitting)
overlap: none — raw sample index advances past each chunk No overlap between chunks. The index advances cleanly. On stop, the final transcription calculates the offset using the sample rate ratio: processedOffset = lastRawEnd × (16000 / nativeSampleRate).

Final assembly

When recording stops, the pipeline task is cancelled and awaited. If there are completed chunks, only the remaining audio past the last chunk's end index is transcribed (minimum 1 second). The final text is all chunk texts joined with a space, plus the tail. If there are no completed chunks (short recordings), the full buffer is transcribed in one pass.

Timeout protection

Transcription has a dynamic timeout: max(30s, audioDuration × 2). If WhisperKit hangs or takes too long, the task is cancelled and a timeout error is raised. This prevents the app from appearing frozen on edge-case audio.

Learning 5

The audio pipeline
is the foundation.

Capture: native rate, not target rate

Whisper expects 16kHz mono Float32. macOS microphones typically run at 48kHz. We had two options: capture directly at 16kHz using CoreAudio AUHAL (no resampling needed), or capture at the mic's native rate and resample on demand.

We chose native capture with AVAudioEngine. The reason: AVAudioEngine handles device switching gracefully and provides simpler buffer management. CoreAudio AUHAL at 16kHz would skip the resampling step but requires manual device handling.

Capture configuration engine: AVAudioEngine
input: engine.inputNode
format: native output format (typically 48kHz)
tap buffer: 4,096 frames
storage: [Float] ring buffer with NSLock
pre-allocated capacity: nativeSampleRate × 300 (5 min)

Resampling: AVAudioConverter, not manual

Resampling from 48kHz to 16kHz happens on demand — not in the audio callback. The tap callback only appends raw samples to the ring buffer. When it's time to transcribe, prepareChunk() copies samples from the buffer, resamples via AVAudioConverter, and normalizes.

If the mic has multiple channels, we downmix to mono first by averaging all channels. The output capacity includes a 1,024-sample safety margin to prevent buffer underflow during conversion.

Normalization: the difference between speech and hallucination

After resampling, we peak-normalize using vDSP from Apple's Accelerate framework. Without normalization, quiet microphones produce low-amplitude samples. Whisper interprets low-amplitude audio as silence — and hallucinates.

/// Peak normalization via Accelerate (vDSP)
var peak: Float = 0
vDSP_maxmgv(samples, 1, &peak, vDSP_Length(count))

if peak > 0 {
    // Target peak: 0.95 (headroom). Max gain: 20× (safety limit).
    var scale = min(0.95 / peak, 20.0)

    // Skip if already normalized (within 1% of target)
    if scale > 0.99 && scale < 1.01 { return samples }

    vDSP_vsmul(samples, 1, &scale, &result, 1,
               vDSP_Length(count))
}

The target peak is 0.95 (leaving headroom). The gain is capped at 20× to prevent amplifying noise floors into hallucination triggers. If the audio is already within 1% of the target, we skip normalization entirely.

Learning 6

Global hotkeys are
harder than they look.

NSEvent monitors, not CGEventTap

macOS gives you two ways to detect global keystrokes. We tried both.

ApproachCGEventTapNSEvent monitors
LevelCore GraphicsAppKit
PermissionInput MonitoringAccessibility
Self-window bugSilently fails when app's own windows have focusWorks via local monitor
RegistrationManual event tap + run loop sourceOne-liner with block callback

CGEventTap requires Input Monitoring permission and has a subtle bug: when the app's own popover or settings window has focus, the global tap stops receiving events. You need a separate local monitor for that case anyway.

NSEvent gives you both: globalMonitorForEvents for keystrokes in other apps, localMonitorForEvents for keystrokes in your own windows. Same API, same callbacks. We use both.

Right Shift is not a key press

Right Shift (keycode 60) is a modifier, not a regular key. It doesn't fire .keyDown / .keyUp. It fires .flagsChanged. You check the event's modifier flags to determine whether the key went down or up. Toggle state is maintained manually: first press starts recording, second press stops.

Escape (keycode 53) is the cancel key. Left Shift (keycode 56) is ignored to avoid conflicts with typing.

Learning 7

Whisper hallucinates.
A lot.

The problem

Whisper was trained on YouTube transcripts. When it hears silence or near-silence, it doesn't output nothing — it outputs what it learned from YouTube: "Thank you for watching", "Please like and subscribe", "See you in the next video". Sometimes it outputs repeated n-grams. Sometimes it switches to Chinese or Cyrillic for no reason.

This isn't a bug — it's the model doing exactly what it was trained to do. The training data is full of these phrases, and silence provides no signal to override them.

The filter: three layers

We built a post-transcription filter with three checks. If any check triggers, the transcription is discarded entirely.

1. Pattern matching

Exact match (case-insensitive, trimmed) against known hallucination phrases: "thank you for watching", "thanks for watching", "subscribe to my channel", "please like and subscribe", "see you in the next video", "don't forget to subscribe", "hit the bell", "leave a comment", "share this video", "thanks for listening", "goodbye", "bye bye", "you". Also matches with trailing period.

2. N-gram repetition

For transcriptions with 6+ words: check n-grams from size 1 to 5. If all n-grams of any size are identical and there are 3+ repetitions, it's a hallucination. Catches "the the the the the" and "thank you thank you thank you".

3. Character set validation

For English transcription: if fewer than half the characters are ASCII, discard. Catches CJK, Cyrillic, and other script hallucinations that Whisper sometimes produces when it loses confidence.

What we tried and abandoned: prompt tokens

We experimented with feeding a technical dictionary as prompt tokens to bias Whisper toward domain-specific vocabulary. The result: more hallucinations, not fewer. Prompt tokens add latency and can bias the model toward generating text even when input is silence. We removed them and moved vocabulary handling to post-processing.

Learning 8

Paste, don't type.

Text injection via pasteboard

After transcription, the text needs to appear in whatever app the user was in. We save the frontmost application reference when recording starts (NSWorkspace.shared.frontmostApplication), write the transcribed text to NSPasteboard, and simulate Cmd+V.

Why paste instead of typing each character? Speed and reliability. Simulating individual keystrokes fails with special characters, non-ASCII text, and apps that have custom key handling. A single paste operation works everywhere that supports Cmd+V — which is everything.

Accessibility permission is required for simulating the paste keystroke. This is the same permission needed for the global hotkey — one grant covers both.

Voice Activity Detection

VAD provides automatic silence-based stop for hold-to-talk mode. It uses energy-based detection with a 100ms analysis window. On launch, a 0.5-second calibration period measures ambient noise to set the baseline.

After at least 3 seconds of recording (to avoid premature stops), if silence exceeds the timeout (default 3 seconds), VAD triggers stop. The minimum recording threshold prevents the app from stopping before the user has had a chance to speak.

Reference

System defaults.

Every value that ships with VoiceNative. No hidden configuration.

ParameterValueWhy
Target sample rate16,000 HzWhisper's expected input
Channels1 (mono)Whisper is mono-only
Tap buffer size4,096 framesBalance between latency and callback frequency
Max recording300s (5 min)Same as macOS dictation default
Min duration for transcription0.3sBelow this, nothing useful to transcribe
HotkeyRight Shift (60)Rarely used alone; no conflict with typing
Cancel keyEscape (53)Standard macOS cancel idiom
Keepalive interval120sJust under ANE's unload timeout
Prewarm samples8,000 (0.5s)Minimum to trigger pipeline; negligible cost
Wake delay3sSystem stabilization after sleep
Streaming chunk30s at native rateWhisper's max input window
Streaming poll5sBalance between latency and CPU usage
Transcription timeoutmax(30s, 2× audio)Dynamic; prevents hangs on edge cases
Normalization peak0.95Headroom to prevent clipping
Max normalization gain20×Safety limit; prevents noise amplification
VAD window100msEnergy analysis resolution
VAD calibration0.5sAmbient noise baseline measurement
Silence timeout3.0sHold-to-talk auto-stop threshold
Min recording before VAD3.0sPrevents premature stop
Icon feedback1.5sDuration of menu bar icon state feedback
Get Started

Install.

From DMG

  1. Download from Releases
  2. Open the DMG, drag VoiceNative to Applications
  3. Grant Microphone and Accessibility permissions when prompted
  4. Wait ~90 seconds on first launch for model download + Neural Engine compilation

Build from source

Requires macOS 15 (Sequoia) and Xcode 16+.

git clone https://github.com/triggeredcode/voicenative.git
cd voicenative
make dmg

Permissions

PermissionWhy
MicrophoneAudio capture via AVAudioEngine
AccessibilityGlobal hotkey detection (NSEvent) + simulated Cmd+V paste

Both permissions are requested on first launch. Grant them in System Settings → Privacy & Security.