Skip to content

extract — Transcription + Speaker Diarization

extract diagram

Transcribes audio and labels each segment with the speaker who said it. Uses faster-whisper for transcription and pyannote for speaker diarization.

Supported audio formats: .wav, .mp3, .ogg, .flac

CLI

speech-mine extract <audio> <output.csv> --hf-token TOKEN [options]

Options

Flag Default Description
--hf-token TOKEN (required) HuggingFace access token
--model SIZE large-v3 Whisper model size
--device auto auto, cpu, or cuda
--compute-type float16 float16 (GPU), float32 (CPU), int8
--num-speakers N Exact speaker count (best accuracy when known)
--min-speakers N 1 Minimum expected speakers
--max-speakers N Maximum expected speakers
--verbose Enable verbose logging

Examples

# Basic (CPU)
speech-mine extract interview.mp3 output.csv \
  --hf-token YOUR_TOKEN \
  --compute-type float32

# 2-person interview with known speaker count
speech-mine extract interview.wav output.csv \
  --hf-token YOUR_TOKEN \
  --num-speakers 2 \
  --compute-type float32

# GPU with best accuracy model
speech-mine extract meeting.wav output.csv \
  --hf-token YOUR_TOKEN \
  --model large-v3 \
  --device cuda \
  --compute-type float16 \
  --num-speakers 4

# Speaker range when count is unknown
speech-mine extract conference.wav output.csv \
  --hf-token YOUR_TOKEN \
  --min-speakers 2 \
  --max-speakers 8 \
  --compute-type float32

Warning

Always use --compute-type float32 when running on CPU. The default (float16) requires a GPU and will raise an error on CPU.

Library

from speech_mine.diarizer.processor import SpeechDiarizationProcessor

processor = SpeechDiarizationProcessor(
    hf_token="YOUR_TOKEN",
    num_speakers=2,
    whisper_model_size="large-v3",
)

# Full pipeline in one call
processor.process_audio_file("interview.mp3", "output.csv")

Individual pipeline steps

# Step 1: Transcribe
segments, info = processor.transcribe_audio("interview.mp3")

# Step 2: Diarize
diarization = processor.perform_speaker_diarization("interview.mp3")

# Step 3: Align
aligned = processor.align_transcription_with_speakers(segments, diarization)

# Step 4: Save
processor.save_to_csv(aligned, "output.csv", info)

Output

Two files are written:

  • output.csv — segment and word-level transcript data (see example)
  • output_metadata.json — language, duration, speaker list, processing info (see example)

See Output Format for full column/field reference.

Speaker Count Tips

Specifying --num-speakers when you know the exact count improves diarization accuracy.

Parameter When to use
--num-speakers N You know exactly how many people speak
--min-speakers N You know there are at least N speakers
--max-speakers N You want to cap false speaker detection

See Model Options for whisper model and compute type guidance.