Running the Baseline Recognizer

The software that processes lecture audio into a textual transcript is comprised of a series of scripts that marshall input files and parameters to a speech recognition engine.  Interestingly, since the engine is data driven, its code seldom changes; improvements in performance and accuracy are achieved by refining the data it uses to perform its tasks.

There are two steps to produce the transcript.  The first creates an audio file in the correct format for speech recognition.  The second processes that audio file into the transcript.


The scripts require a specific folder hierarchy.  All paths are rooted here:


unless a fully qualified path is specified, assume this is “home” for paths that occur below.  For example, when I refer to the lectures folder, since I don’t explicitly specify its full path, I am referring to this folder:


The scripts require that each video is found in a child folder named for its lecture ordinal.  The parent is a folder named for the course of these videos.  The child folder is referred to as the “lecture” folder, the parent folder is the “course” folder.

For example, in the path


The course is OCW-18.01-f07 and the lecture is L01. The video file itself is in the L01 folder.

So, In the lectures folder, create a folder for the course and a folder for the lecture.  Drop the video file there in the lecture folder.

Create the Audio file

The audio file that is the input for speech recognition must be a .WAV file (16 bit PCM, bit rate of 256 kbits/sec).  The script that creates this file is:


and is located in the scripts folder

The script takes three parameters:

  • course folder name
  • lecture folder name
  • format of the video file, expressed as its file extension

For example, to run the script from the lectures folder, execute this command:

../create_wavefile.cmd OCW-18.01-f07 L01 mp4

Note that the command as well as two of the parameters are given relative to the folder where the command is executed.  The following is the terminal output of a successful run of the audio file creation script.

pwilkins@sls:/usr/users/mckinney/sm/lectures$ ../scripts/create_wavefile.cmd OCW-18.01-f07 L01 mp4
*** Creating wave file /usr/users/mckinney/sm/lectures/OCW-18.01-f07/L01/OCW-18.01-f07-L01.wav OCW-18.01-f07-L01
FFmpeg version r11872+debian_0.svn20080206-18+lenny1, Copyright (c) 2000-2008 Fabrice Bellard, et al.
configuration: --enable-gpl --enable-libfaad --enable-pp --enable-swscaler --enable-x11grab --prefix=/usr --enable-libgsm --enable-libtheora --enable-libvorbis --enable-pthreads --disable-strip --enable-libdc1394 --enable-shared --disable-static
libavutil version: 49.6.0
libavcodec version: 51.50.0
libavformat version: 52.7.0
libavdevice version: 52.0.0
built on Jan 25 2010 18:27:39, gcc: 4.3.2

Seems stream 1 codec frame rate differs from container frame rate: 30000.00 (30000/1) -> 14.99 (15000/1001)
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/usr/users/mckinney/sm/lectures/OCW-18.01-f07/L01/OCW-18.01-f07-L01.mp4':
Duration: 00:51:32.2, start: 0.000000, bitrate: 303 kb/s
Stream #0.0(und): Audio: mpeg4aac, 44100 Hz, stereo
Stream #0.1(und): Video: mpeg4, yuv420p, 480x360 [PAR 1:1 DAR 4:3], 14.99 tb(r)
Stream #0.2(und): Data: mp4s / 0x7334706D
Stream #0.3(und): Data: mp4s / 0x7334706D
Output #0, wav, to '/usr/users/mckinney/sm/lectures/OCW-18.01-f07/L01/OCW-18.01-f07-L01.wav':
Stream #0.0(und): Audio: pcm_s16le, 16000 Hz, mono, 256 kb/s
Stream mapping:
Stream #0.0 -> #0.0
Press [q] to stop encoding
size= 96633kB time=3092.2 bitrate= 256.0kbits/s
video:0kB audio:96633kB global headers:0kB muxing overhead 0.000044%

Process the Audio file into a Transcript

Creative Commons License Unless otherwise specified, the Spoken Media Website by the MIT Office of Digital Learning, Strategic Education Initiatives is licensed under a Creative Commons Attribution 4.0 International License.