Toolsmith: HTK ↔ CSLU Label File Converter — How to Build It
Goal
Build a small, reliable command-line tool that converts label/transcription files between HTK and CSLU formats (both directions), handles common edge cases, and supports batch processing.
Formats (assumed)
- HTK label: typically plain text with “start end label” per line (times in 100ns or 100ns? HTK often uses 100ns units or 100ns? assume 100ns units).
- CSLU label: typically “label start end” or a similar variant; common CSLU formats include HTK-like ones and ones with integer frame indices.
Note: There are multiple variants in the wild; choose one variant and document it.
Design decisions (reasonable defaults)
- Time unit: treat HTK times as 100ns units (HTK standard: 100ns ticks) and convert to/from seconds. Allow CLI flag to override (e.g., –units [100ns|ms|s|frames]).
- Input parsing: robust whitespace handling, ignore comment lines starting with #, skip empty lines.
- Overlap/adjacency: optionally merge adjacent segments with identical labels (–merge-identical) and optionally clip/resolve overlaps (–resolve-overlaps [trim|split|error]).
- Output options: set precision for times, choose zero-based vs one-based frame indexing, preserve label casing or force lower/upper.
Features
- Bidirectional conversion: htk2cslu, cslu2htk, or autodetect by file pattern/first token.
- Batch processing for directories, with pattern matching and recursion.
- Dry-run mode to show changes without writing.
- Unit tests for parsing and formatting edge cases.
- Logging with levels: info/warn/error.
- Small, dependency-light implementation (Python 3.10+, stdlib only).
Example command-line interface
- convert –in file.lab –out file.cslu –from htk –to cslu –units 100ns
- convert.lab –out-dir converted/ –merge-identical –resolve-overlaps trim
Implementation sketch (Python)
- Use argparse for CLI, pathlib for paths, logging for logs.
- Parser functions: parse_htk_line(line, units) -> (start_s, end_s, label); parse_cslu_line(…)
- Formatter functions: format_htk(start_s, end_s, label, units) -> line; format_cslu(…)
- Processing pipeline: read file -> parse all lines -> optional merge/resolve -> write formatted lines.
Example parsing rules (concrete)
- HTK line regex: r’^\s(\d+)\s+(\d+)\s+(.+?)\s$’ -> start, end as ints -> seconds = int / (1e7 if 100ns) or based on units flag.
- CSLU line regex: either “label start end” or “start end label”; autodetect by token types (if first token is non-numeric → label-first).
- If times missing in CSLU variant, support frame-indexed labels: treat start/end as ints and convert with frame-rate flag.
Edge cases & recommendations
- Validate monotonicity: ensure segment starts < ends; report or fix.
- Handle unknown/multiple whitespace-separated label tokens (join remaining tokens as label).
- Preserve UTF-8 encoding; handle BOM.
- Provide example input/output pairs in the repo README.
Testing & validation
- Include unit tests for:
- parsing HTK and both CSLU variants
- conversion round-trip (htk -> cslu -> htk)
- overlap resolution strategies
- Include a small test corpus and expected outputs.
Deliverables checklist
- CLI tool script (convert.py)
- README with format assumptions and examples
- Unit tests
- Example files and CI config for linting/tests
Leave a Reply