extract — Transcription + Speaker Diarization¶
Transcribes audio and labels each segment with the speaker who said it. Uses faster-whisper for transcription and pyannote for speaker diarization.
Supported audio formats: .wav, .mp3, .ogg, .flac
CLI¶
Options¶
| Flag | Default | Description |
|---|---|---|
--hf-token TOKEN |
(required) | HuggingFace access token |
--model SIZE |
large-v3 |
Whisper model size |
--device |
auto |
auto, cpu, or cuda |
--compute-type |
float16 |
float16 (GPU), float32 (CPU), int8 |
--num-speakers N |
— | Exact speaker count (best accuracy when known) |
--min-speakers N |
1 |
Minimum expected speakers |
--max-speakers N |
— | Maximum expected speakers |
--verbose |
— | Enable verbose logging |
Examples¶
# Basic (CPU)
speech-mine extract interview.mp3 output.csv \
--hf-token YOUR_TOKEN \
--compute-type float32
# 2-person interview with known speaker count
speech-mine extract interview.wav output.csv \
--hf-token YOUR_TOKEN \
--num-speakers 2 \
--compute-type float32
# GPU with best accuracy model
speech-mine extract meeting.wav output.csv \
--hf-token YOUR_TOKEN \
--model large-v3 \
--device cuda \
--compute-type float16 \
--num-speakers 4
# Speaker range when count is unknown
speech-mine extract conference.wav output.csv \
--hf-token YOUR_TOKEN \
--min-speakers 2 \
--max-speakers 8 \
--compute-type float32
Warning
Always use --compute-type float32 when running on CPU. The default (float16) requires a GPU and will raise an error on CPU.
Library¶
from speech_mine.diarizer.processor import SpeechDiarizationProcessor
processor = SpeechDiarizationProcessor(
hf_token="YOUR_TOKEN",
num_speakers=2,
whisper_model_size="large-v3",
)
# Full pipeline in one call
processor.process_audio_file("interview.mp3", "output.csv")
Individual pipeline steps¶
# Step 1: Transcribe
segments, info = processor.transcribe_audio("interview.mp3")
# Step 2: Diarize
diarization = processor.perform_speaker_diarization("interview.mp3")
# Step 3: Align
aligned = processor.align_transcription_with_speakers(segments, diarization)
# Step 4: Save
processor.save_to_csv(aligned, "output.csv", info)
Output¶
Two files are written:
output.csv— segment and word-level transcript data (see example)output_metadata.json— language, duration, speaker list, processing info (see example)
See Output Format for full column/field reference.
Speaker Count Tips¶
Specifying --num-speakers when you know the exact count improves diarization accuracy.
| Parameter | When to use |
|---|---|
--num-speakers N |
You know exactly how many people speak |
--min-speakers N |
You know there are at least N speakers |
--max-speakers N |
You want to cap false speaker detection |
See Model Options for whisper model and compute type guidance.