extract — Transcription + Speaker Diarization¶

extract diagram

Transcribes audio with word-level timestamps and labels each word with the speaker who said it. Powered by WhisperX: Whisper transcription → wav2vec2 forced alignment → pyannote speaker diarization.

Supported audio formats: .wav, .mp3, .ogg, .flac, .m4a, .webm

CLI¶

speech-mine extract <audio> <output.csv> --hf-token TOKEN [options]

Options¶

Flag	Default	Description
`--hf-token TOKEN`	(required)	HuggingFace access token
`--model SIZE`	`large-v3`	Whisper model size
`--device`	`auto`	`auto`, `cpu`, or `cuda`
`--compute-type`	`float16`	`float16` (GPU), `float32` (CPU), `int8`
`--num-speakers N`	—	Exact speaker count (best accuracy when known)
`--min-speakers N`	`1`	Minimum expected speakers
`--max-speakers N`	—	Maximum expected speakers
`--batch-size N`	`16`	Transcription batch size (reduce if out of memory)
`--language CODE`	—	Language code e.g. `en`, `fr` (auto-detected if omitted)
`--verbose`	—	Enable verbose logging

Examples¶

# Basic (CPU)
speech-mine extract interview.mp3 output.csv \
  --hf-token YOUR_TOKEN \
  --compute-type float32

# 2-person interview with known speaker count
speech-mine extract interview.wav output.csv \
  --hf-token YOUR_TOKEN \
  --num-speakers 2 \
  --compute-type float32

# GPU with best accuracy model
speech-mine extract meeting.wav output.csv \
  --hf-token YOUR_TOKEN \
  --model large-v3 \
  --device cuda \
  --compute-type float16 \
  --num-speakers 4

# Speaker range when count is unknown
speech-mine extract conference.wav output.csv \
  --hf-token YOUR_TOKEN \
  --min-speakers 2 \
  --max-speakers 8 \
  --compute-type float32

# Known language — skips auto-detection, slightly faster
speech-mine extract interview.wav output.csv \
  --hf-token YOUR_TOKEN \
  --language en \
  --compute-type float32

Warning

Always use --compute-type float32 when running on CPU. The default (float16) requires a GPU and will raise an error on CPU.

Library¶

from speech_mine.diarizer.processor import SpeechDiarizationProcessor

processor = SpeechDiarizationProcessor(
    hf_token="YOUR_TOKEN",
    num_speakers=2,
    whisper_model_size="large-v3",
    language="en",   # optional
    batch_size=16,   # reduce if OOM
)

# Full pipeline in one call
processor.process_audio_file("interview.mp3", "output.csv")

Individual pipeline steps¶

# Step 1: Transcribe
audio, result = processor.transcribe_audio("interview.mp3")

# Step 2: Forced alignment (word-level timestamps via wav2vec2)
result = processor.align(audio, result)

# Step 3: Speaker diarization + word assignment
result = processor.diarize(audio, result)

# Step 4: Save to CSV
processor.save_to_csv(result, "output.csv", {"audio_file": "interview.mp3", ...})

Output¶

Two files are written:

output.csv — segment and word-level transcript data (see example)
output_metadata.json — language, duration, speaker list, processing info (see example)

See Output Format for full column/field reference.

Speaker Count Tips¶

Specifying --num-speakers when you know the exact count improves diarization accuracy.

Parameter	When to use
`--num-speakers N`	You know exactly how many people speak
`--min-speakers N`	You know there are at least N speakers
`--max-speakers N`	You want to cap false speaker detection

See Model Options for Whisper model and compute type guidance.