Audio Transcription

Audio Transcription



Building a custom Automated Speech Recognition (ASR) model using machine learning to transcribe audio recordings for applications like medical or legal transcriptions.


Audio Recording

Conversion to Waveform

Input into the ASR model

Text Output from the ASR model

Quality Check and Editing by human transcriptionist

Final Transcription

3x to 4x increase in transcription productivity, enabling much lower operational costs and competitive edge over rivals.

Approach: Automated Speech Recognition (ASR) – Modeling Approach

Labelled Raw Voice Data

Training Data
(E.g. 80% Split)

ASR model built using Training Data

Cross Validation Data
(E.g. 10% Split)

ASR model optimized using Cross Validation data

Test Data
(E.g. 10% Split)

ASR model accuracy evaluated on Test data

Raw Voice Data (Model Input)

Random noise may be added to the training data to make the model more robust

Acoustic Model

E.g. Deep Bidirectional LSTM RNN trained using Connectionist Temporal Classification)

Feature Extraction

From speech frames (e.g. Mel-Frequency Cepstral Coefficients)

Linguistic Model

Transcribed Text
(Model Output)

Automated Speech Recognition (ASR) – Acoustic Modeling candidates

  • Bidirectional RNN/LSTM
  • One candidate architecture has 4 bidirectional layers, each layer containing 320 memory cells
  • CTC training at frame level
  • Sequence to sequence networks
  • Mostly used in machine translation
  • Features mapped to a fixed size vector and then decoded
  • Candidate model has 3 layers of BLSTM with 256 nodes in each direction
  • Decoder has 2 LSTM layers with 512 nodes
  • Trained using asynchronous stochastic gradient descent