Converting .sbv to .trans/continuous text

As a step in comparing the output from YouTube’s Autocaptioning, we need to transform their .sbv file into something we can use in our comparison tests (a .trans file). We needed to strip the hours out of the timecode, drop the end time, and bring everything to a single line.

Update: It turns out we needed a continuous text file. So these have been updated accordingly.


We needed to convert:

0:00:01.699,0:00:06.450
okay guys were going to get through this first
lecture for some reason there was a major

0:00:06.450,0:00:08.130
scheduling of

0:00:08.130,0:00:12.590
screw up so they have our their schedule classes
in our to overflow room so we're going to

Source: UC Berkeley, Biology 1B, Spring 2009, Lecture 1

to:

00:01.699 okay guys were going to get through this first lecture for some reason there was a major
00:06.450 scheduling of
00:08.130 screw up so they have our their schedule classes in our to overflow room so we're going to

I used the grep features of BBEdit’s search and replace, though I’d guess this can be done directly in grep on the command line.

  • To remove the end time, search ,........... and replace with , strips off the end time.
  • To remove the line breaks between segments, search rr (two new lines) and replace with r (single new line).
  • To put the timecode and text on one line, search ,r (comma and new line) and replace with   (single space).
  • This is the step that’s different if you want a .trans file or continuous text. For a .trans file you just need to strip the leading ^0:. To strip the leading hour from the timecode, search r^0: (line break to beginning of line and 0:) and replace with xxxx (where xxxx can be anything, it’s just used as a temporary placeholder, and it might need to be different if the text appears in the transcript).
  • To remove the line breaks in a multi-line segment, search r and replace with   (space).
  • To remove xxxx placeholder an put every segment on its own line, search xxxx and replace with r.
  • Edit the first line to remove the 0: by hand.

The continuous text version will look like:

okay guys were going to get through this first lecture for some reason there was a major scheduling of screw up so they have our their schedule classes in our to overflow room so we're going to

Creative Commons License Unless otherwise specified, the Spoken Media Website by the MIT Office of Digital Learning, Strategic Education Initiatives is licensed under a Creative Commons Attribution 4.0 International License.