Brandon Muramatsu – SpokenMedia

How Google Translate Works

Brandon Muramatsu — Thu, 12 Aug 2010 23:11:08 +0000

Google posted a high level overview of how Google Translate works.

Source: Google

An interesting hack from Yahoo! Openhack India

Brandon Muramatsu — Wed, 28 Jul 2010 22:53:17 +0000

Sound familiar?

Automatic, Real-time close captioning/translation for flickr videos.

How?
We captured the audio stream that comes out to speaker and gave as input to mic. Used Microsoft Speech API and Julius to convert the speech to text. Used a GreaseMonkey script to sync with transcription server(our local box) and video and displayed the transcribed text on the video. Before displaying the actual text on the video, based on the user’s choice we translate the text and show it on video. (We used Google’s Translate API for this).

Srithar, B. (2010). Yahoo! Openhack India 2010- FlicksubZ. Retrieved on July 28, 2010 from Srithar’s Blog Website: http://babusri.blogspot.com/2010/07/yahoo-openhack-india-2010-flicksubz.html

Check out the whole post.

Converting .sbv to .trans/continuous text

Brandon Muramatsu — Sat, 24 Jul 2010 18:08:59 +0000

As a step in comparing the output from YouTube’s Autocaptioning, we need to transform their .sbv file into something we can use in our comparison tests (a .trans file). We needed to strip the hours out of the timecode, drop the end time, and bring everything to a single line.

Update: It turns out we needed a continuous text file. So these have been updated accordingly.

We needed to convert:

0:00:01.699,0:00:06.450 okay guys were going to get through this first lecture for some reason there was a major
0:00:06.450,0:00:08.130 scheduling of
0:00:08.130,0:00:12.590 screw up so they have our their schedule classes in our to overflow room so we're going to

Source: UC Berkeley, Biology 1B, Spring 2009, Lecture 1

to:

00:01.699 okay guys were going to get through this first lecture for some reason there was a major 00:06.450 scheduling of 00:08.130 screw up so they have our their schedule classes in our to overflow room so we're going to

I used the grep features of BBEdit’s search and replace, though I’d guess this can be done directly in grep on the command line.

To remove the end time, search ,........... and replace with , strips off the end time.
To remove the line breaks between segments, search rr (two new lines) and replace with r (single new line).
To put the timecode and text on one line, search ,r (comma and new line) and replace with (single space).
This is the step that’s different if you want a .trans file or continuous text. For a .trans file you just need to strip the leading ^0:. To strip the leading hour from the timecode, search r^0: (line break to beginning of line and 0:) and replace with xxxx (where xxxx can be anything, it’s just used as a temporary placeholder, and it might need to be different if the text appears in the transcript).
To remove the line breaks in a multi-line segment, search r and replace with (space).
To remove xxxx placeholder an put every segment on its own line, search xxxx and replace with r.
Edit the first line to remove the 0: by hand.

The continuous text version will look like:

okay guys were going to get through this first lecture for some reason there was a major scheduling of screw up so they have our their schedule classes in our to overflow room so we're going to

Caption File Formats

Brandon Muramatsu — Mon, 19 Jul 2010 17:23:45 +0000

There’s been some discussion on the Matterhorn list recently about caption file formats, and I thought it might be useful to describe what we’re doing with file formats for SpokenMedia.

SpokenMedia uses two file formats, our original .wrd files output from the recognition process and Timed Text Markup Language (TTML). We also need to handle two other caption file formats .srt and .sbv.

There is a nice discussion of the YouTube format at SBV file format for Youtube Subtitles and Captions and a link to a web-based tool to convert .srt files to .sbv files.

We’ll cover our implementation of TTML in a separate post.

.wrd: A “time-aligned word transcription” file that is the ouput of SpokenMedia’s speech recognizer output format. This file displays the start time and end time in milliseconds along with the corresponding recognized word. (More Info)

Format:

startTime endTime word

Example:
666 812 i'm 812 1052 walter 1052 1782 lewin 1782 1912 i 1912 2017 will 2017 2192 be 2192 2337 your 2337 2817 lecturer

.srt: SubRip’s caption file format. This file displays the start time and end time in hh:mm:ss,milliseconds separated by a “-->”, along with a corresponding caption number and phrase. (Note the use of commas to separate seconds from milliseconds.) Each caption phrase is separated by a single line. (More Info)

Format:

Caption Number hh:mm:ss,mmm --> hh:mm:ss,sss Text of Sub Title (one or more lines, including punctuation and optionally sound effects) Blank Line

Example:
1 0:00:00,766 --> 0:00:02,033 I'm Walter Lewin.
2 0:00:02,033 --> 0:00:04,766 I will be your lecturer this term.

.sbv: Google/YouTube’s caption file format. This file format is similar to the .srt format but contains some notable differences in syntax (the use of periods and commas as separators). Additionally both formats support identification of the speaker and other cues like laughter, applause, etc.–but of course both are in slightly different ways.

According to Google (More Info):

We currently support a simple caption format that is compatible with the formats known as SubViewer (*.SUB) and SubRip (*.SRT). Although you can upload your captions/subtitles in any format, only supported formats will be displayed properly on the playback page.

Here’s what a (*.SBV) caption file might look like:

0:00:03.490,0:00:07.430 >> FISHER: All right. So, let's begin. This session is: Going Social
0:00:07.430,0:00:11.600 with the YouTube APIs. I am Jeff Fisher, 0:00:11.600,0:00:14.009 and this is Johann Hartmann, we're presenting today. 0:00:14.009,0:00:15.889 [pause]

Here are also some common captioning practice that help readability:

Descriptions inside square brackets like [music] or [laughter] can help people with hearing disabilities to understand what is happening in your video.

You can also add tags like >> at the beginning of a new line to identify speakers or change of speaker.

The format can be described by looking at YouTube examples.

Format:

Caption Number H:MM:SS.000,H:MM:SS.000 Text of Sub Title (one or more lines, including punctuation and optionally sound effects) Blank Line

Example:
1 0:00:00.766,0:00:02.033 I'm Walter Lewin.
0:00:02.033,0:00:04.766 I will be your lecturer this term.

SpokenMedia at T4E 2010 Conference

Brandon Muramatsu — Wed, 14 Jul 2010 15:23:59 +0000

Brandon Muramatsu presented on SpokenMedia at the Technology for Education 2010 Conference in Mumbai, India on July 1, 2010.

Implementing SpokenMedia for the Indian Institute for Human Settlements from Brandon Muramatsu

Source: Brandon Muramatsu

Download Video (MP4, 230MB)

View more presentations from Brandon Muramatsu.

Cite as: Muramatsu, B., McKinney, A. & Wilkins, P. (2010, July 1). Implementing SpokenMedia for the Indian Institute for Human Settlements. Presentation at Technology for Education Conference: Mumbai, India. July 1, 2010. Retrieved July 14, 2010 from Slideshare Web site: http://www.slideshare.net/bmuramatsu/implementing-spokenmedia-for-the-indian-institute-for-human-settlements

Towards cross-video search

Brandon Muramatsu — Mon, 21 Jun 2010 14:30:57 +0000

Here’s a workflow diagram I put together to demonstrate how we’re approaching the problem of searching over the transcripts of multiple videos and ultimately returning search results that maintain time-alignment for playback.

Source: Brandon Muramatsu

Preparing Transcripts for Search Across Multiple Videos

You’ll notice I included using OCW on lecture slides to help in search and retrieval–this is not an area we’re currently focusing on, but we have been asked about it. A number of researchers and developers have looked at this area–if/when we include it, we’d work with folks like Matterhorn (or perhaps others) to integrate the solutions that they’ve implemented.

Making Progress

Brandon Muramatsu — Thu, 17 Jun 2010 23:48:25 +0000

In the last month or two we’ve made some good progress with getting additional parts of the SpokenMedia workflow into a working state.

Here’s a workflow diagram showing what we can do with SpokenMedia today.

Source: Brandon Muramatsu

SpokenMedia Workflow, June 2010

(The bright yellow indicates features working in the last two months, the gray indicates features we’ve had working since December 2009, and the light yellow indicates features on which we’ve just started working.)

To recap, since December 2009, we’ve been able to:

Rip audio from video files and prepare it for the speech recognizer.
Process the audio through the speech recognizer locally within the SpokenMedia project using domain and acoustic models.
Present output transcript files (.WRD) through the SpokenMedia player.

Recently, we’ve added the ability to:

Create domain models (or augment existing domain models from files.
Create unsupervised acoustic models from input audio files. (Typically 10 hours of audio by the same speaker are required to create “good” acoustic model–certainly for American’s speaking English. We’re still not sure how well this capability will allow us to handle Indian-English speakers.)
Use a selected domain or acoustic model from a pre-existing set, in addition to creating a new one.
Process audio through an “upgraded” speech recognizer, using the custom domain and acoustic models. Though this recognition is being performed on Jim Glass’ research cluster.

We still have a ways to go–we still need to better understand the potential accuracy of our approach. The critical blocker is now a means to compare a known accurate transcript with the output of the speech recognizer (it’s a matter of transforming existing transcripts into time-aligned ones of the right format). And then there are the two challenges of automating the software and getting it running on OEIT servers (we’ve reverted to using Jim Glass’ research cluster to get some of the other pieces up and running).

Video from OCWC Global Presentation

Brandon Muramatsu — Wed, 26 May 2010 18:15:52 +0000

The OpenCourseWare Consortium has posted the video of our talk during OCWC Global 2010 in Hanoi, Vietnam.

SpokenMedia Project from OpenCourseWare Consortium on Vimeo.

Cite as: Muramatsu, B., McKinney, A. & Wilkins, P. (2010, May 5). Opening Up IIHS Video with SpokenMedia. Presentation at OCWC Global 2010: Hanoi, Vietnam, May 5, 2010. Retrieved May 6, 2010 from Vimeo Web site: http://vimeo.com/11969270

PageLayout as a step towards Rich Media Notebooks

Brandon Muramatsu — Tue, 11 May 2010 14:30:47 +0000

During a meeting with our collaborators at ICAP, of the Universite de Lyon 1 in France, Fabien Bizot demonstrated the PageLayout Flash/Air app that he was working on for Spiral Connect.

When we launched the SpokenMedia project, we knew that we wanted to ultimately focus on how learners and educators use video accompanied by transcripts. Over the last year, we’ve focused on the automatic lecture transcription technology that was developed in the Spoken Lecture project–as a means to enable the notions of rich media notebook we had been discussing.

Fabien Bizot’s work with PageLayout may be the first step to a user interface learners and educators might use to interact with video linked with transcripts.

Source: Brandon Muramatsu/Fabien Bizot PageLayout

PageLayout Towards a Rich Media Notebook

SpokenMedia at OCW Consortium Global 2010 Conference

Brandon Muramatsu — Fri, 07 May 2010 03:57:22 +0000

Brandon Muramatsu presented on SpokenMedia at the OCW Consortium Global 2010 Conference in Hanoi, Vietnam on May 7, 2010.

Opening Up IIHS Video with SpokenMedia from Brandon Muramatsu

View more presentations from Brandon Muramatsu.