🎙️ Audio Data Quality Toolkit for TTS/ASR Training Pipelines
Detect clipping, silence, noisy samples, duplicate clips, transcript mismatch, speaker imbalance, and synthetic-data artifacts in speech datasets.
Designed for TTS, ASR, voice-cloning, and synthetic speech evaluation workflows.
Lint your audio datasets before training. Training-readiness checks for TTS, ASR, and voice-cloning pipelines, with roadmap support for duplicate detection, speaker balance, and ASR-based transcript alignment. No GPU required. All checks run on CPU with numpy/scipy/librosa.
Unlike perceptual scoring tools such as NISQA, PESQ, or UTMOS, which answer "how good does this sound?", this toolkit answers "is this file ready for training?" by catching the data-engineering issues that silently degrade model quality.
Upload one audio clip and inspect training-readiness quality signals.
Upload multiple clips to generate a dataset-level QA report for TTS, ASR, voice-cloning, or synthetic speech pipelines.
Evaluate generated speech samples for clipping, silence, noise, duration anomalies, and transcript consistency.
What this tool checks
- Clipping: waveform peaks too close to maximum amplitude
- Silence: long leading, trailing, or internal silent regions
- Noise: low signal quality, background hum, hiss, or abnormal energy profile
- Transcript mismatch: audio duration may not match the expected text length
- Speaker imbalance: some speakers may dominate the dataset (roadmap / metadata-dependent)
- Duplicates: repeated or near-identical clips (roadmap / fingerprinting-dependent)
- Synthetic artifacts: robotic, metallic, repeated, or degraded generated speech patterns
Why this matters
Data quality directly affects TTS/ASR model stability, pronunciation, speaker consistency, alignment, and long-form generation quality. This Space is designed as a practical QA dashboard for speech datasets used in training and evaluating voice AI systems.
Current checks
| # | Check | What It Catches | GPU? |
|---|---|---|---|
| 1 | SNR Estimation | Background noise, hum, hiss | No |
| 2 | Clipping Detection | Consecutive samples at max amplitude | No |
| 3 | Silence Analysis | Excessive leading, trailing, or internal silence | No |
| 4 | Sample Rate Validation | Non-standard or unexpected rates | No |
| 5 | Duration Bounds | Too short or too long clips | No |
| 6 | Loudness (LUFS) | Audio far from target loudness | No |
| 7 | Metallic Artifacts | Robotic/metallic TTS artifacts | No |
| 8 | Repetition Detection | Word/phrase loops via autocorrelation | No |
| 9 | Channel Issues | Stereo, silent channels, phase inversion | No |
| 10 | Upsampling Detection | Fake sample rates, e.g. 8kHz upsampled to 22kHz | No |
| 11 | Transcript Ratio | Misaligned transcripts using chars-per-second | No |
| 12 | Duplicate Detection | Near-duplicate files via fingerprinting | No |
| 13 | Transcript Alignment | Audio vs text mismatch with optional ASR | Optional |
How is this different from NISQA / PESQ / DataSpeech?
| Tool | What it does | GPU | Output |
|---|---|---|---|
| NISQA | Perceptual MOS score | Yes | Quality score |
| PESQ | Reference-based quality score | No | Quality score |
| DataSpeech | Annotate datasets for Parler-TTS training | Yes | Natural-language descriptions |
| This toolkit | Pass/fail lint for training readiness | No | Report + clean manifest |
DataSpeech answers: "describe this audio's characteristics for TTS conditioning."
This toolkit answers: "should I include this file in my training set at all?"
Install: pip install audio-data-quality-toolkit
GitHub
Python API: from audio_qa import check_file, check_directory, audit_hf_dataset