search — Fuzzy Transcript Search¶
Searches a transcript CSV for words or phrases using fuzzy matching. Returns ranked matches with timestamps, speaker context, and similarity scores.
CLI¶
Options¶
| Flag | Default | Description |
|---|---|---|
--similarity-range MIN MAX |
0.0 1.0 |
Filter results by similarity score |
--top-k N |
10 |
Maximum results to return |
--output-type |
utterance |
utterance or timestamp |
--save-path FILE |
— | Save results as JSON (default: stdout) |
--pretty |
— | Display results in formatted, colored output |
Examples¶
# Search and print JSON to stdout
speech-mine search "childhood abuse" output.csv
# Formatted colored output
speech-mine search "childhood abuse" output.csv --pretty
# High-confidence matches, save to file
speech-mine search "radio career" output.csv output_metadata.json \
--similarity-range 0.8 1.0 \
--top-k 5 \
--output-type timestamp \
--save-path results.json
Output types¶
utterance (default) — Returns the utterance number, word positions within the utterance, and full segment context:
{
"similarity_score": 0.923,
"utterance_number": 14,
"matched_text": "radio career",
"time_span": { "start": 42.1, "end": 43.0, "duration": 0.9 },
"context": {
"speaker": "SPEAKER_00",
"full_segment_text": "Sure, I started out in radio..."
}
}
timestamp — Returns a time window with per-word start/end times:
{
"similarity_score": 0.923,
"matched_text": "radio career",
"time_window": { "start_time": 42.1, "end_time": 43.0, "duration": 0.9 },
"word_details": [
{ "word": "radio", "start": 42.1, "end": 42.5, "confidence": 0.98 },
{ "word": "career", "start": 42.6, "end": 43.0, "confidence": 0.95 }
]
}
Library¶
from speech_mine.access import TranscriptionAccessTool
from speech_mine.fuzz import speech_fuzzy_match
# Load transcript
tool = TranscriptionAccessTool()
tool.load_from_files("output.csv", "output_metadata.json")
# Search
matches = speech_fuzzy_match(
word_list=tool.words,
query="radio career",
similarity_range=(0.8, 1.0),
top_k=5,
)
# matches: list of (start_index, end_index, similarity_score)
for start_idx, end_idx, score in matches:
words = tool.words[start_idx:end_idx + 1]
print(f"{score:.2f}: {' '.join(w.word for w in words)}")
How matching works¶
The fuzzy matcher uses rapidfuzz to compare your query against sliding windows of words in the transcript. Window sizes of query_length - 1, query_length, and query_length + 1 words are all tested. Overlapping matches are deduplicated, keeping the highest-scoring one.
Similarity scores range from 0.0 (no match) to 1.0 (exact match).