Run Hermes Agent 24/7 on Oracle Cloud's free ARM VPS with Coolify and Docker. Includes Oracle's June 2026 limit change: 2 OCPUs and 12 GB RAM.
Your best prompts stay in your head. Not because you lack ideas. Because typing them all out, every single time, takes too long.
You speak at 130 words per minute. You type at 40. Every time you open a text box, you are already deciding what to leave out. The context gets cut. The detail stays in your head. The AI gets a shorter version of what you actually meant.
Voicebox Captures is built to close that gap. Hold a chord anywhere on your screen, speak naturally, release. A local Whisper model transcribes your voice on your hardware. A local Qwen3 model cleans the result before it lands in your active app.
No cloud. No subscription. No audio leaving your machine.
⚡ Direct answer Voicebox Captures is the system-wide dictation feature inside Voicebox, a free, open-source AI voice studio (MIT license, nearly 30K GitHub stars as of June 2026). Hold a chord anywhere on your computer, speak, release. Whisper transcribes your audio locally, a bundled Qwen3 LLM refines the output, and the polished text auto-pastes into your active text field. Available for macOS, Windows, and Linux. No internet connection required during use.Voicebox is a local-first AI voice studio built by Jamie Pine, the developer behind Spacedrive. It started as a free, open-source alternative to ElevenLabs: voice cloning, text-to-speech, and seven TTS engines, all running entirely on your own hardware.
The Captures and Dictation feature, introduced in version 0.5.0, is the input side of that loop.
While TTS and voice cloning handle what Voicebox speaks out, Captures handles what you speak in. You talk, and the app turns that into clean, refined text that appears in whichever app you already had open.
Hold your push-to-talk chord anywhere on your system. A small floating pill appears over your active app.
The pill walks through three states: Recording (with a live waveform and timer), then Transcribing while Whisper processes your audio, then Refining while the Qwen3 LLM cleans up the transcript.
When refinement finishes, the pill disappears and the text pastes into the field you had focused. Focus is locked at the moment you press the chord, so the paste lands in the right place even if your eyes drifted to a different window while you were speaking.
Captures tab backupEvery dictation session is saved in the Captures tab with the original audio paired alongside the transcript. If auto-paste fails for any reason, open the tab and copy the text from there. Every capture can also be re-transcribed or re-refined later with different models and settings, without losing the original recording.The result is a workflow that feels invisible. You speak. Clean text appears. No copy-pasting, no app-switching, no correcting a raw transcript full of filler words and half-sentences.
Voicebox does not make you a better writer. It removes the part where your typing speed decides how much of your thinking actually reaches the text field.
Most people do not run out of ideas. They run out of patience for typing them.
Think about the last time you opened Claude or ChatGPT. You knew exactly what you wanted to say. You started typing. By sentence three, you were already rewriting sentence one. The prompt you sent was shorter, simpler, and less useful than what you actually meant.
That is not a writing problem. That is a speed limit.
A good AI prompt needs context. It needs the audience, the format, the tone, the constraints, the goal, and what to avoid. Typing all of that from scratch, every single time, adds friction to every interaction. Most people quietly cut the detail because it takes too long.
Voice removes that friction at the source.
When you speak, a full context-rich prompt comes out in 20 to 30 seconds. All the detail that usually gets dropped because typing it felt like too much effort? It comes out naturally when you are just talking.
This matters beyond AI prompting. Voice dictation makes a real difference for a lot of daily writing:
The refinement model is what makes this practical. Spoken language is messy. People repeat themselves, correct themselves mid-sentence, and trail off without finishing the thought.
A raw transcript of natural speech is almost never paste-ready. Voicebox cleans the output before it reaches the text field. That is what separates this from a basic transcription tool.
Setup is beginner-friendly, but you need two things before dictation works: a Whisper transcription model and a Qwen3 refinement model. Both are downloaded inside the app after you install it.
1 ~2 minutesGo to voicebox.sh and download the app. It ships as a DMG for macOS, an MSI for Windows, and a pre-built binary for Linux. No account needed. Run the installer and open Voicebox.
2 ~1 minuteInside Voicebox, go to Settings and find the Captures section. This is where you configure your transcription model, your refinement model, and your global keyboard chord shortcut.
3 ~3 minutes (download time varies)In Settings → Captures → Transcription, choose your Whisper model and download it. For most users, Whisper Turbo is the right starting point. It is faster than the large models, handles most accents and technical terms well, and is more stable on difficult recordings than the Base model.
4 ~2 minutes (download time varies)In Settings → Captures → Refinement, choose your Qwen3 model and download it. Start with Qwen3 0.6B (around 400 MB) to test without heavy memory use. You can switch to a larger model later if the cleanup quality is not sufficient for your use case.
5 ~2 minutesIn Settings → Captures → Dictation, review or customize your push-to-talk shortcut. On macOS the default is Right ⌘ + Right ⌥. On Windows and Linux the default is Right Ctrl + Right Shift. Open any text field in any app, hold your chord, say a short sentence, and release. The refined text should paste in automatically.
macOS: Accessibility permission required for auto-pasteIf dictation runs but nothing pastes into your text field on macOS, check System Settings → Privacy and Security → Accessibility and enable Voicebox. Without this permission, transcripts still save to the Captures tab but automatic paste into other apps will not work. On Windows and Linux, this permission step is not required.The model you pick affects speed, accuracy, and how much memory Voicebox uses while running. The right choice depends on your hardware and what you plan to dictate most often.
Voicebox uses OpenAI’s Whisper for transcription. On Apple Silicon it runs through MLX, which is roughly 8x faster than the standard PyTorch path. On Windows and Linux it uses CUDA, ROCm, DirectML, or CPU depending on your GPU setup.
According to the official Voicebox documentation, the Base model can hallucinate on difficult inputs, producing repeated phrases in a loop. Turbo handles these situations more reliably. Large gives the highest accuracy but is the slowest option.
For most users, start with Whisper Turbo. It balances speed and accuracy well enough for emails, AI prompts, and general dictation. Move to Large only if you regularly dictate dense technical content where accuracy errors are costly.
The refinement step is what turns raw transcription into paste-ready text. Voicebox ships three bundled Qwen3 sizes and you choose one in Settings → Captures → Refinement.
| Model | Size | Speed | Best For | Watch Out For |
|---|---|---|---|---|
| Qwen3 0.6B | ~400 MB | Very fast | Casual dictation, short prompts, emails, notes | May miss nuanced self-corrections in long or complex speech |
| Qwen3 1.7B | ~1.1 GB | Fast | Technical content, code identifiers, longer structured prompts | Requires more RAM; slower first load than the 0.6B |
| Qwen3 4B | ~2.5 GB | Slowest | Full quality, complex language, dense professional content | Noticeably heavier on memory; not necessary for most everyday use |
The practical rule: start with 0.6B. Test it with a real email or a real AI prompt from your actual daily workflow, not a short demo sentence. If technical terms are getting mangled or self-corrections are not being caught, move to 1.7B.
One LLM, two usesThe Qwen3 model you download for refinement is the same model used by Voicebox’s voice personality feature. You are not downloading two separate LLMs. If you already use voice personalities in Voicebox, the refinement model is already on your machine.Once Voicebox is set up, dictation becomes one action: hold the chord, speak, release.
The two chord modes give you flexibility. Push-to-talk stops recording the moment you release the shortcut, which works well for short bursts and single prompts. Toggle-to-talk keeps recording until you tap the chord again, which is better for longer narration or when you want your hands free.
You can also upgrade from push-to-talk to toggle mid-hold. Just tap Space while holding the chord. The recording continues without any gap in the audio. You do not have to decide which mode you want before you start speaking.
Voicebox auto-paste works across your entire system. Apps where it works reliably include:
Your clipboard is preserved throughout. Voicebox saves whatever you had copied before the paste and restores it immediately after. Nothing you copied before dictating gets overwritten.
Shortcut customizationYou can change both chords in Settings → Captures → Dictation. Voicebox records whether each modifier key is the left or right variant, so you can bind dictation to just Right Option without affecting Left Option shortcuts. Changes take effect immediately without restarting the app.Here is a practical example. Instead of typing a long AI prompt character by character, dictate it in one breath:
Hold your chord and speak into Claude or ChatGPT: "Help me write a reply to a client email. The client asked for an update on their project timeline. The project is on schedule but we need one more week for final quality checks. Keep the tone professional but warm. Three short paragraphs maximum. Do not use bullet points."Speaking that takes about 20 seconds. Typing it usually takes 90. The output that arrives in the text field is clean enough to send directly or with a quick edit.
Refinement is the step that separates Voicebox from a basic speech-to-text app. After Whisper produces the raw transcript, the Qwen3 LLM runs a cleanup pass.
The goal is not to rewrite what you said. It is to remove the verbal clutter nobody means to say out loud.
Voicebox also strips Whisper’s hallucination loops before the LLM ever sees the transcript. Whisper can sometimes echo a phrase dozens of times in a row, a known quirk sometimes called the “thanks for watching” loop.
Voicebox collapses these repetitions at a threshold of six identical tokens in a row, so the refinement model never amplifies them back into the output. Even if Whisper has a bad moment on a difficult recording, the final text stays clean.
Re-refine any capture laterThree refinement flags are saved per capture:smart_cleanup, self_correction, and preserve_technical. You can re-run refinement on the raw transcript at any time with different settings, without losing the original recording. You are never locked into a single result. You can also use voice to dictate structured prompts for technical work:
Dictate this before asking Claude to review code: "Before suggesting changes, check the current documentation for this library. I want you to confirm the method I am using is still valid, not deprecated, and matches the current API. If anything has changed since your training, tell me what changed first, then suggest the fix."Voicebox Captures has real strengths. It also has real limits worth knowing before you build it into your daily workflow.
The Whisper and Qwen3 models need to load into memory the first time you use them after launching Voicebox. This initial load can take several seconds or more depending on model size and your hardware.
After the first load, the experience becomes much faster. Expect a slower startup every time you open the app fresh, not every time you dictate.
Even the smallest Qwen3 model (0.6B, around 400 MB) adds to your active memory alongside the Whisper model. On a machine with 8 GB of RAM, test with the smallest models before committing.
Larger Whisper models in particular can claim a significant portion of available memory. If the app feels slow, drop down a model size before changing anything else.
On Apple Silicon, Whisper runs through MLX at roughly 8x the speed of CPU processing. Windows and Linux machines with a CUDA-capable GPU also perform well.
If you are on a CPU-only setup on any platform, expect noticeably slower transcription times, especially with larger Whisper models. The tool still works. It just takes longer per recording.
For recordings under about five seconds, Whisper’s automatic language detection can be unreliable. If you are dictating very short phrases and noticing language errors, set a default language in Settings → Captures → Transcription → Language.
Voicebox saves every audio capture locally alongside the transcript. This is useful for reviewing past dictations, but it adds up over time.
If you dictate heavily, clear old captures periodically so your disk does not quietly fill up in the background.
Duplicate paste behaviorIn some setups, the transcribed text may paste twice into the same field. This is a known edge case rather than a systematic issue. If it happens, update Voicebox to the latest version from voicebox.sh and check the GitHub issues page for any active fix. The developers use that tracker actively.The best way to start with Voicebox is not to replace all your typing on day one.
Build it into your workflow in three stages. Only move to the next stage after the previous one works cleanly with your voice and your hardware.
Open a plain text editor. Hold your chord, say one clear sentence, release. Check that the text appears and looks correct.
If nothing pastes on macOS, check the Accessibility permission in System Settings. If the transcript appears in the Captures tab but does not paste into the text field, that is the issue to fix first.
Hold the chord and dictate a short paragraph that includes a product name, a number, and at least one technical term or abbreviation. Check whether the refinement preserved those details correctly.
If technical terms are getting mangled, switch from Qwen3 0.6B to 1.7B. The larger model handles technical vocabulary more reliably.
Open your preferred AI tool. Dictate a full, structured prompt with context, goal, format, tone, and at least one constraint. Release, read the pasted result, and review before sending.
This is the test that tells you whether the workflow actually saves you time with your real content.
Here is a prompt structure that works well when spoken:
Dictate this as your Stage 3 test into any AI tool: "I want you to write a beginner-friendly explanation of [topic]. Assume the reader has never heard of it. Include one real-world example. Keep each paragraph to two sentences maximum. Use simple words throughout. End with one practical tip the reader can use today."That is about 20 seconds of speech. Once that test passes, you can start replacing typed prompts with dictated ones. Pick one workflow to start with. Build the habit there first, not everywhere at once.
The clearest sign that voice dictation is working is when you finish speaking a prompt and realize you said more than you would have typed. That is when the bottleneck is gone.
For readers who want to go deeper into AI workflow setup, the Claude Setup Guide covers how to configure Claude for daily use, and the Best AI Research Tools guide covers tools that pair well with faster input workflows like this one.
Voicebox Captures is worth installing if you regularly write AI prompts, send emails, or draft content during your workday. The core workflow is solid: hold the chord, speak, release, get refined text in your active app.
The Qwen3 refinement cleans up spoken language well. On Apple Silicon, the speed feels close to native. On Windows and Linux with a capable GPU, it is also strong. The fact that everything runs locally, with no account and no cloud upload, is a real practical advantage for anyone who dictates anything sensitive.
Where to be careful: if your machine has limited RAM, start with the smallest models and test before committing. CPU-only setups will be slower, especially with larger Whisper models. And if you have never used push-to-talk dictation before, give yourself a few days to adjust to the rhythm. The first session often feels awkward. The third session usually does not.
Use it if: you write detailed AI prompts, long emails, or regular documents and have noticed that typing slows your ideas down. Available now on macOS, Windows, and Linux.
Wait if: you have under 8 GB of RAM and have not tested the hardware limits first, or your workflow is mostly short responses where typing is already fast enough.
The tool is free. Setup takes under ten minutes. Try the three test stages before deciding. That is all the commitment it asks for upfront.
Run these three tests before building dictation into your daily workflow. A short sentence, a technical paragraph, and a full AI prompt. All three passing means Voicebox is working for your voice and your hardware.
AI App Studio shares AI tools, app reviews, and practical insights — discover what to use, how it works, and what's new in AI.