YouTube – SpokenMedia

Converting .sbv to .trans/continuous text

Brandon Muramatsu — Sat, 24 Jul 2010 18:08:59 +0000

As a step in comparing the output from YouTube’s Autocaptioning, we need to transform their .sbv file into something we can use in our comparison tests (a .trans file). We needed to strip the hours out of the timecode, drop the end time, and bring everything to a single line.

Update: It turns out we needed a continuous text file. So these have been updated accordingly.

We needed to convert:

0:00:01.699,0:00:06.450 okay guys were going to get through this first lecture for some reason there was a major
0:00:06.450,0:00:08.130 scheduling of
0:00:08.130,0:00:12.590 screw up so they have our their schedule classes in our to overflow room so we're going to

Source: UC Berkeley, Biology 1B, Spring 2009, Lecture 1

to:

00:01.699 okay guys were going to get through this first lecture for some reason there was a major 00:06.450 scheduling of 00:08.130 screw up so they have our their schedule classes in our to overflow room so we're going to

I used the grep features of BBEdit’s search and replace, though I’d guess this can be done directly in grep on the command line.

To remove the end time, search ,........... and replace with , strips off the end time.
To remove the line breaks between segments, search rr (two new lines) and replace with r (single new line).
To put the timecode and text on one line, search ,r (comma and new line) and replace with (single space).
This is the step that’s different if you want a .trans file or continuous text. For a .trans file you just need to strip the leading ^0:. To strip the leading hour from the timecode, search r^0: (line break to beginning of line and 0:) and replace with xxxx (where xxxx can be anything, it’s just used as a temporary placeholder, and it might need to be different if the text appears in the transcript).
To remove the line breaks in a multi-line segment, search r and replace with (space).
To remove xxxx placeholder an put every segment on its own line, search xxxx and replace with r.
Edit the first line to remove the 0: by hand.

The continuous text version will look like:

okay guys were going to get through this first lecture for some reason there was a major scheduling of screw up so they have our their schedule classes in our to overflow room so we're going to

Caption File Formats

Brandon Muramatsu — Mon, 19 Jul 2010 17:23:45 +0000

There’s been some discussion on the Matterhorn list recently about caption file formats, and I thought it might be useful to describe what we’re doing with file formats for SpokenMedia.

SpokenMedia uses two file formats, our original .wrd files output from the recognition process and Timed Text Markup Language (TTML). We also need to handle two other caption file formats .srt and .sbv.

There is a nice discussion of the YouTube format at SBV file format for Youtube Subtitles and Captions and a link to a web-based tool to convert .srt files to .sbv files.

We’ll cover our implementation of TTML in a separate post.

.wrd: A “time-aligned word transcription” file that is the ouput of SpokenMedia’s speech recognizer output format. This file displays the start time and end time in milliseconds along with the corresponding recognized word. (More Info)

Format:

startTime endTime word

Example:
666 812 i'm 812 1052 walter 1052 1782 lewin 1782 1912 i 1912 2017 will 2017 2192 be 2192 2337 your 2337 2817 lecturer

.srt: SubRip’s caption file format. This file displays the start time and end time in hh:mm:ss,milliseconds separated by a “-->”, along with a corresponding caption number and phrase. (Note the use of commas to separate seconds from milliseconds.) Each caption phrase is separated by a single line. (More Info)

Format:

Caption Number hh:mm:ss,mmm --> hh:mm:ss,sss Text of Sub Title (one or more lines, including punctuation and optionally sound effects) Blank Line

Example:
1 0:00:00,766 --> 0:00:02,033 I'm Walter Lewin.
2 0:00:02,033 --> 0:00:04,766 I will be your lecturer this term.

.sbv: Google/YouTube’s caption file format. This file format is similar to the .srt format but contains some notable differences in syntax (the use of periods and commas as separators). Additionally both formats support identification of the speaker and other cues like laughter, applause, etc.–but of course both are in slightly different ways.

According to Google (More Info):

We currently support a simple caption format that is compatible with the formats known as SubViewer (*.SUB) and SubRip (*.SRT). Although you can upload your captions/subtitles in any format, only supported formats will be displayed properly on the playback page.

Here’s what a (*.SBV) caption file might look like:

0:00:03.490,0:00:07.430 >> FISHER: All right. So, let's begin. This session is: Going Social
0:00:07.430,0:00:11.600 with the YouTube APIs. I am Jeff Fisher, 0:00:11.600,0:00:14.009 and this is Johann Hartmann, we're presenting today. 0:00:14.009,0:00:15.889 [pause]

Here are also some common captioning practice that help readability:

Descriptions inside square brackets like [music] or [laughter] can help people with hearing disabilities to understand what is happening in your video.

You can also add tags like >> at the beginning of a new line to identify speakers or change of speaker.

The format can be described by looking at YouTube examples.

Format:

Caption Number H:MM:SS.000,H:MM:SS.000 Text of Sub Title (one or more lines, including punctuation and optionally sound effects) Blank Line

Example:
1 0:00:00.766,0:00:02.033 I'm Walter Lewin.
0:00:02.033,0:00:04.766 I will be your lecturer this term.

YouTube Auto-Captions

Brandon Muramatsu — Tue, 20 Apr 2010 14:30:36 +0000

YouTube announced in early March that they would be extending their pilot program to enable auto-captioning for all channels.

The highlights…YouTube has announced that they’re doing this to improve accessibility, and…

Captions will initially only be available for English videos.

Auto-captions requires clearly spoken audio/

Auto-captions aren’t perfect, the owner will need to check that they’re accurate.

Auto-captions will be available for all channels.

We think this is great, if YouTube can automatically caption files, at scale and with high accuracy, that’s a great step forward for all videos, and definitely the lecture videos that we’ve been interested in the SpokenMedia project.

Though, as with SpokenMedia’s approach that builds on Jim Glass’ Spoken Language Systems research, they still have a ways to go on accuracy.

At this early date though, we can still see some significant advantages to our approach:

You don’t have to host your videos through YouTube to use the service SpokenMedia is developing. (YouTube locks the videos you upload into their player and service.)
SpokenMedia will provide a timed-aligned transcript file that you can download and use in other applications. (YouTube allows the channel publishers to download a transcript, edit it, and then reupload it for time code alignment. However, they don’t allow the public at large to download the transcript.)
SpokenMedia will provide an editor to improve the accuracy of the transcripts.
SpokenMedia will enable you to use the transcripts in other applications like search, and will let you start playing a segment within a video. (Though I’m pretty sure YouTube will be using transcripts to help users find videos–and I personally think that’s the real driver behind auto-captions search and keyword advertising. And if you know how to do it, you can construct a URL to link to a particular timepoint in a YouTube-hosted video.)

In any event, if you’ve watched the recent slidecasts of the last couple SpokenMedia presentations, you’ll see that we’ve included the impact of Auto-Captions on SpokenMedia.