extract — Transcription + Speaker Diarization¶
Transcribes audio with word-level timestamps and labels each word with the speaker who said it. Powered by WhisperX: Whisper transcription → wav2vec2 forced alignment → pyannote speaker diarization.
Supported audio formats: .wav, .mp3, .ogg, .flac, .m4a, .webm
CLI¶
Options¶
| Flag | Default | Description |
|---|---|---|
--hf-token TOKEN |
(required) | HuggingFace access token |
--model SIZE |
large-v3 |
Whisper model size |
--device |
auto |
auto, cpu, or cuda |
--compute-type |
float16 |
float16 (GPU), float32 (CPU), int8 |
--num-speakers N |
— | Exact speaker count (best accuracy when known) |
--min-speakers N |
1 |
Minimum expected speakers |
--max-speakers N |
— | Maximum expected speakers |
--batch-size N |
16 |
Transcription batch size (reduce if out of memory) |
--language CODE |
— | Language code e.g. en, fr (auto-detected if omitted) |
--verbose |
— | Enable verbose logging |
Examples¶
# Basic (CPU)
speech-mine extract interview.mp3 output.csv \
--hf-token YOUR_TOKEN \
--compute-type float32
# 2-person interview with known speaker count
speech-mine extract interview.wav output.csv \
--hf-token YOUR_TOKEN \
--num-speakers 2 \
--compute-type float32
# GPU with best accuracy model
speech-mine extract meeting.wav output.csv \
--hf-token YOUR_TOKEN \
--model large-v3 \
--device cuda \
--compute-type float16 \
--num-speakers 4
# Speaker range when count is unknown
speech-mine extract conference.wav output.csv \
--hf-token YOUR_TOKEN \
--min-speakers 2 \
--max-speakers 8 \
--compute-type float32
# Known language — skips auto-detection, slightly faster
speech-mine extract interview.wav output.csv \
--hf-token YOUR_TOKEN \
--language en \
--compute-type float32
Warning
Always use --compute-type float32 when running on CPU. The default (float16) requires a GPU and will raise an error on CPU.
Library¶
from speech_mine.diarizer.processor import SpeechDiarizationProcessor
processor = SpeechDiarizationProcessor(
hf_token="YOUR_TOKEN",
num_speakers=2,
whisper_model_size="large-v3",
language="en", # optional
batch_size=16, # reduce if OOM
)
# Full pipeline in one call
processor.process_audio_file("interview.mp3", "output.csv")
Individual pipeline steps¶
# Step 1: Transcribe
audio, result = processor.transcribe_audio("interview.mp3")
# Step 2: Forced alignment (word-level timestamps via wav2vec2)
result = processor.align(audio, result)
# Step 3: Speaker diarization + word assignment
result = processor.diarize(audio, result)
# Step 4: Save to CSV
processor.save_to_csv(result, "output.csv", {"audio_file": "interview.mp3", ...})
Output¶
Two files are written:
output.csv— segment and word-level transcript data (see example)output_metadata.json— language, duration, speaker list, processing info (see example)
See Output Format for full column/field reference.
Speaker Count Tips¶
Specifying --num-speakers when you know the exact count improves diarization accuracy.
| Parameter | When to use |
|---|---|
--num-speakers N |
You know exactly how many people speak |
--min-speakers N |
You know there are at least N speakers |
--max-speakers N |
You want to cap false speaker detection |
See Model Options for Whisper model and compute type guidance.