SpokenMedia

Customizing the Google Search Appliance for Greenfield OCW

Peter Wilkins — Tue, 24 Aug 2010 14:58:16 +0000

MIT’s OpenCourseWare uses MIT’s Google Search Appliance (GSA) to search its content. MIT supports customization of GSA results through XSL transformation. This post describes how we plan to use GSA to search lecture transcripts and return results containing the lecture videos that the search terms appear in. Since OCW publishes static content, it doesn’t incorporate an integral search engine. Search is provided through MIT’s Google Search Appliance (GSA) which indexes all OCW content and provides a familiar search paradigm to users. For the purpose of searching lecture transcripts though the default behavior is unsatisfactory. The reason is that searching for words that appear in a lecture transcript will return not only pages containing transcripts containing the words, it will return every occurrence of the words in the OCW site. In addition, the current pages that contain lecture transcripts provide no way to see the portion of the lecture video where the words occur. Changing the way GSA returns its results is one way to remedy this situation without changing the static nature of the OCW content.

The target use case, then, is searching for words that occur in lecture transcripts with the search results conveniently allowing a user to see and hear the lecture at the point that the word occurs. We can make this happen by integrating a number of independent processes.

Obtain or generate a transcript for the lecture video.
Format the transcript so that it is an appropriate target for GSA search.
Modify the default Google XSL transformation so that it returns our custom results page.
Modify a video player so that it plays the lecture from near the point that the search term occurs.

Generate a Transcript for a Lecture Video

Generating transcripts from lecture videos is described here.

Format a Transcript as a GSA Search Target

The speech recognition process produces a variety of transcript files. Some are just text, some are text and time codes. None of the generated files include metadata describing the transcript they contain. For the transcripts to serve as search targets they must include additional data that the XSL can use to produce the search results page. This metadata is the minimum that the search target must include:

path to lecture video file
speaker
order of lecture within its course
video title

Customize Google Search XSL

Google returns its search results as a structured document that contains no presentation markup. Each page that contains a search input control also specifies an XSL transformation to present the search results. This is an example of the markup that appears in the page:

The default transformation creates the familiar Google search results page, but this transformation may be customized. This is our strategy; our custom transformation will use GSA to search the transcripts and return the hits in a page that will allow viewing the lectures.

Edit Video Lecture Pages

GSA indexes pages by following links. Since we will add a page to the site for each lecture transcript

Global Changes to Google search textbox

Google provides a XSL stylesheet for customizing the search results page. Substitute your own custom xsl page for Google’s default stylesheet and change the reference to the stylesheet on every page that contains a search box to cause your customization take effect.

These command line commands makes this change. (A Unix-like OS is assumed.) Each command line follows a similar pattern. They first use the find utility to create a list of the files to change. This list is passed to the grep utility to filter the file list down to those files that contain the text to change. This filtered list is passed to the perl utility to perform the change by performing a global substitution.

The first changes the XSL file from the default to your customized XSL. Note that the argument to grep simply identifies those files that contain a search box. (This argument could as easily have been ‘google-ocw.xsl’ which would been simpler and as effective.)

find . -name '*.htm*' | xargs grep -l '' | xargs perl -p -i.bak -w -e 's/google-ocw.xsl/google-greenfield.xsl/g'

[update: the replacement above is incorrect. It fails to update the enough of the URL. Here’s a fix if you ran that replace script.]
find . -name 'index.htm' | xargs grep -l 'http://ocw.mit.edu/search/google-greenfield.xsl' | xargs perl -p -i.bak -w -e 's|http://ocw.mit.edu/search/google-greenfield.xsl|http://greenfield.mit.edu/oeit/OcwWeb/search/google-greenfield.xsl|g'

The next changes the name of the site to search. You may not have to change this value. In our case, we created our own Google search “collection” that contained our copy of the OCW site.

find . -name '*.htm*' | xargs grep -l '' | xargs perl -p -i.bak -w -e 's/input type="hidden" value="ocw" name="site"/input type="hidden" value="oeit-ocw" name="site"/g'

The last specifies the URL of our Google search engine. The reason for this change is that the courses in our OCW instance were collected by using the “download course materials” zip that many OCW courses provide. Since these are intended for download, their search feature is disabled.

find . -name '*.htm*' | xargs grep -l 'form method="get" action="(../){1,3}common/search/AdvancedSearch.htm"' | xargs perl -p -i.bak -w -e 's|form method="get" action="(../){1,3}common/search/AdvancedSearch.htm"|form method="get" action="http://search.mit.edu/search"|g'

find . -name '*.htm' | xargs perl -p -i.bak -w -e 's|||g'

How Google Translate Works

Brandon Muramatsu — Thu, 12 Aug 2010 23:11:08 +0000

Google posted a high level overview of how Google Translate works.

Source: Google

Running the Baseline Recognizer

Peter Wilkins — Sat, 07 Aug 2010 01:59:27 +0000

The software that processes lecture audio into a textual transcript is comprised of a series of scripts that marshall input files and parameters to a speech recognition engine. Interestingly, since the engine is data driven, its code seldom changes; improvements in performance and accuracy are achieved by refining the data it uses to perform its tasks.

There are two steps to produce the transcript. The first creates an audio file in the correct format for speech recognition. The second processes that audio file into the transcript.

Setup

The scripts require a specific folder hierarchy. All paths are rooted here:

/usr/users/mckinney/sm

unless a fully qualified path is specified, assume this is “home” for paths that occur below. For example, when I refer to the lectures folder, since I don’t explicitly specify its full path, I am referring to this folder:

/usr/users/mckinney/sm/lectures

The scripts require that each video is found in a child folder named for its lecture ordinal. The parent is a folder named for the course of these videos. The child folder is referred to as the “lecture” folder, the parent folder is the “course” folder.

For example, in the path

lectures/OCW-18.01-f07/L01

The course is OCW-18.01-f07 and the lecture is L01. The video file itself is in the L01 folder.

So, In the lectures folder, create a folder for the course and a folder for the lecture. Drop the video file there in the lecture folder.

Create the Audio file

The audio file that is the input for speech recognition must be a .WAV file (16 bit PCM, bit rate of 256 kbits/sec). The script that creates this file is:

create_wavefile.cmd

and is located in the scripts folder

The script takes three parameters:

course folder name
lecture folder name
format of the video file, expressed as its file extension

For example, to run the script from the lectures folder, execute this command:

../create_wavefile.cmd OCW-18.01-f07 L01 mp4

Note that the command as well as two of the parameters are given relative to the folder where the command is executed. The following is the terminal output of a successful run of the audio file creation script.

pwilkins@sls:/usr/users/mckinney/sm/lectures$ ../scripts/create_wavefile.cmd OCW-18.01-f07 L01 mp4 *** Creating wave file /usr/users/mckinney/sm/lectures/OCW-18.01-f07/L01/OCW-18.01-f07-L01.wav OCW-18.01-f07-L01 FFmpeg version r11872+debian_0.svn20080206-18+lenny1, Copyright (c) 2000-2008 Fabrice Bellard, et al. configuration: --enable-gpl --enable-libfaad --enable-pp --enable-swscaler --enable-x11grab --prefix=/usr --enable-libgsm --enable-libtheora --enable-libvorbis --enable-pthreads --disable-strip --enable-libdc1394 --enable-shared --disable-static libavutil version: 49.6.0 libavcodec version: 51.50.0 libavformat version: 52.7.0 libavdevice version: 52.0.0 built on Jan 25 2010 18:27:39, gcc: 4.3.2

Seems stream 1 codec frame rate differs from container frame rate: 30000.00 (30000/1) -> 14.99 (15000/1001) Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/usr/users/mckinney/sm/lectures/OCW-18.01-f07/L01/OCW-18.01-f07-L01.mp4': Duration: 00:51:32.2, start: 0.000000, bitrate: 303 kb/s Stream #0.0(und): Audio: mpeg4aac, 44100 Hz, stereo Stream #0.1(und): Video: mpeg4, yuv420p, 480x360 [PAR 1:1 DAR 4:3], 14.99 tb(r) Stream #0.2(und): Data: mp4s / 0x7334706D Stream #0.3(und): Data: mp4s / 0x7334706D Output #0, wav, to '/usr/users/mckinney/sm/lectures/OCW-18.01-f07/L01/OCW-18.01-f07-L01.wav': Stream #0.0(und): Audio: pcm_s16le, 16000 Hz, mono, 256 kb/s Stream mapping: Stream #0.0 -> #0.0 Press [q] to stop encoding size= 96633kB time=3092.2 bitrate= 256.0kbits/s video:0kB audio:96633kB global headers:0kB muxing overhead 0.000044%

Process the Audio file into a Transcript

An interesting hack from Yahoo! Openhack India

Brandon Muramatsu — Wed, 28 Jul 2010 22:53:17 +0000

Sound familiar?

Automatic, Real-time close captioning/translation for flickr videos.

How?
We captured the audio stream that comes out to speaker and gave as input to mic. Used Microsoft Speech API and Julius to convert the speech to text. Used a GreaseMonkey script to sync with transcription server(our local box) and video and displayed the transcribed text on the video. Before displaying the actual text on the video, based on the user’s choice we translate the text and show it on video. (We used Google’s Translate API for this).

Srithar, B. (2010). Yahoo! Openhack India 2010- FlicksubZ. Retrieved on July 28, 2010 from Srithar’s Blog Website: http://babusri.blogspot.com/2010/07/yahoo-openhack-india-2010-flicksubz.html

Check out the whole post.

Converting .sbv to .trans/continuous text

Brandon Muramatsu — Sat, 24 Jul 2010 18:08:59 +0000

As a step in comparing the output from YouTube’s Autocaptioning, we need to transform their .sbv file into something we can use in our comparison tests (a .trans file). We needed to strip the hours out of the timecode, drop the end time, and bring everything to a single line.

Update: It turns out we needed a continuous text file. So these have been updated accordingly.

We needed to convert:

0:00:01.699,0:00:06.450 okay guys were going to get through this first lecture for some reason there was a major
0:00:06.450,0:00:08.130 scheduling of
0:00:08.130,0:00:12.590 screw up so they have our their schedule classes in our to overflow room so we're going to

Source: UC Berkeley, Biology 1B, Spring 2009, Lecture 1

to:

00:01.699 okay guys were going to get through this first lecture for some reason there was a major 00:06.450 scheduling of 00:08.130 screw up so they have our their schedule classes in our to overflow room so we're going to

I used the grep features of BBEdit’s search and replace, though I’d guess this can be done directly in grep on the command line.

To remove the end time, search ,........... and replace with , strips off the end time.
To remove the line breaks between segments, search rr (two new lines) and replace with r (single new line).
To put the timecode and text on one line, search ,r (comma and new line) and replace with (single space).
This is the step that’s different if you want a .trans file or continuous text. For a .trans file you just need to strip the leading ^0:. To strip the leading hour from the timecode, search r^0: (line break to beginning of line and 0:) and replace with xxxx (where xxxx can be anything, it’s just used as a temporary placeholder, and it might need to be different if the text appears in the transcript).
To remove the line breaks in a multi-line segment, search r and replace with (space).
To remove xxxx placeholder an put every segment on its own line, search xxxx and replace with r.
Edit the first line to remove the 0: by hand.

The continuous text version will look like:

okay guys were going to get through this first lecture for some reason there was a major scheduling of screw up so they have our their schedule classes in our to overflow room so we're going to

Caption File Formats

Brandon Muramatsu — Mon, 19 Jul 2010 17:23:45 +0000

There’s been some discussion on the Matterhorn list recently about caption file formats, and I thought it might be useful to describe what we’re doing with file formats for SpokenMedia.

SpokenMedia uses two file formats, our original .wrd files output from the recognition process and Timed Text Markup Language (TTML). We also need to handle two other caption file formats .srt and .sbv.

There is a nice discussion of the YouTube format at SBV file format for Youtube Subtitles and Captions and a link to a web-based tool to convert .srt files to .sbv files.

We’ll cover our implementation of TTML in a separate post.

.wrd: A “time-aligned word transcription” file that is the ouput of SpokenMedia’s speech recognizer output format. This file displays the start time and end time in milliseconds along with the corresponding recognized word. (More Info)

Format:

startTime endTime word

Example:
666 812 i'm 812 1052 walter 1052 1782 lewin 1782 1912 i 1912 2017 will 2017 2192 be 2192 2337 your 2337 2817 lecturer

.srt: SubRip’s caption file format. This file displays the start time and end time in hh:mm:ss,milliseconds separated by a “-->”, along with a corresponding caption number and phrase. (Note the use of commas to separate seconds from milliseconds.) Each caption phrase is separated by a single line. (More Info)

Format:

Caption Number hh:mm:ss,mmm --> hh:mm:ss,sss Text of Sub Title (one or more lines, including punctuation and optionally sound effects) Blank Line

Example:
1 0:00:00,766 --> 0:00:02,033 I'm Walter Lewin.
2 0:00:02,033 --> 0:00:04,766 I will be your lecturer this term.

.sbv: Google/YouTube’s caption file format. This file format is similar to the .srt format but contains some notable differences in syntax (the use of periods and commas as separators). Additionally both formats support identification of the speaker and other cues like laughter, applause, etc.–but of course both are in slightly different ways.

According to Google (More Info):

We currently support a simple caption format that is compatible with the formats known as SubViewer (*.SUB) and SubRip (*.SRT). Although you can upload your captions/subtitles in any format, only supported formats will be displayed properly on the playback page.

Here’s what a (*.SBV) caption file might look like:

0:00:03.490,0:00:07.430 >> FISHER: All right. So, let's begin. This session is: Going Social
0:00:07.430,0:00:11.600 with the YouTube APIs. I am Jeff Fisher, 0:00:11.600,0:00:14.009 and this is Johann Hartmann, we're presenting today. 0:00:14.009,0:00:15.889 [pause]

Here are also some common captioning practice that help readability:

Descriptions inside square brackets like [music] or [laughter] can help people with hearing disabilities to understand what is happening in your video.

You can also add tags like >> at the beginning of a new line to identify speakers or change of speaker.

The format can be described by looking at YouTube examples.

Format:

Caption Number H:MM:SS.000,H:MM:SS.000 Text of Sub Title (one or more lines, including punctuation and optionally sound effects) Blank Line

Example:
1 0:00:00.766,0:00:02.033 I'm Walter Lewin.
0:00:02.033,0:00:04.766 I will be your lecturer this term.

SpokenMedia at T4E 2010 Conference

Brandon Muramatsu — Wed, 14 Jul 2010 15:23:59 +0000

Brandon Muramatsu presented on SpokenMedia at the Technology for Education 2010 Conference in Mumbai, India on July 1, 2010.

Implementing SpokenMedia for the Indian Institute for Human Settlements from Brandon Muramatsu

Source: Brandon Muramatsu

Download Video (MP4, 230MB)

View more presentations from Brandon Muramatsu.

Cite as: Muramatsu, B., McKinney, A. & Wilkins, P. (2010, July 1). Implementing SpokenMedia for the Indian Institute for Human Settlements. Presentation at Technology for Education Conference: Mumbai, India. July 1, 2010. Retrieved July 14, 2010 from Slideshare Web site: http://www.slideshare.net/bmuramatsu/implementing-spokenmedia-for-the-indian-institute-for-human-settlements

Towards cross-video search

Brandon Muramatsu — Mon, 21 Jun 2010 14:30:57 +0000

Here’s a workflow diagram I put together to demonstrate how we’re approaching the problem of searching over the transcripts of multiple videos and ultimately returning search results that maintain time-alignment for playback.

Source: Brandon Muramatsu

Preparing Transcripts for Search Across Multiple Videos

You’ll notice I included using OCW on lecture slides to help in search and retrieval–this is not an area we’re currently focusing on, but we have been asked about it. A number of researchers and developers have looked at this area–if/when we include it, we’d work with folks like Matterhorn (or perhaps others) to integrate the solutions that they’ve implemented.

Making Progress

Brandon Muramatsu — Thu, 17 Jun 2010 23:48:25 +0000

In the last month or two we’ve made some good progress with getting additional parts of the SpokenMedia workflow into a working state.

Here’s a workflow diagram showing what we can do with SpokenMedia today.

Source: Brandon Muramatsu

SpokenMedia Workflow, June 2010

(The bright yellow indicates features working in the last two months, the gray indicates features we’ve had working since December 2009, and the light yellow indicates features on which we’ve just started working.)

To recap, since December 2009, we’ve been able to:

Rip audio from video files and prepare it for the speech recognizer.
Process the audio through the speech recognizer locally within the SpokenMedia project using domain and acoustic models.
Present output transcript files (.WRD) through the SpokenMedia player.

Recently, we’ve added the ability to:

Create domain models (or augment existing domain models from files.
Create unsupervised acoustic models from input audio files. (Typically 10 hours of audio by the same speaker are required to create “good” acoustic model–certainly for American’s speaking English. We’re still not sure how well this capability will allow us to handle Indian-English speakers.)
Use a selected domain or acoustic model from a pre-existing set, in addition to creating a new one.
Process audio through an “upgraded” speech recognizer, using the custom domain and acoustic models. Though this recognition is being performed on Jim Glass’ research cluster.

We still have a ways to go–we still need to better understand the potential accuracy of our approach. The critical blocker is now a means to compare a known accurate transcript with the output of the speech recognizer (it’s a matter of transforming existing transcripts into time-aligned ones of the right format). And then there are the two challenges of automating the software and getting it running on OEIT servers (we’ve reverted to using Jim Glass’ research cluster to get some of the other pieces up and running).

Using Lucene/Solr for Transcript Search

Peter Wilkins — Wed, 16 Jun 2010 20:36:19 +0000

Overview

In any but a trivial implementation, searching lecture transcripts presents challenges not found in other search targets. Major among them is that each transcript word requires its own metadata (start and stop times). Solr, a web application that derives its search muscle from Apache Lucene, has a query interface that is both rich and flexible. It doesn’t hurt that it’s also very fast. Properly configured, it provides an able platform to support lecture transcript searching. Although Solr is the server, the search itself is performed by Lucene so much of the discussion will address Lucene specifically. The integration with the server will be discussed in a subsequent posting.

Objective

We want to implement an automated work flow that can take a file that contains all the words spoken in the lecture, along with their start and stop times and persist them into a repository that will allow us to:

search all transcripts for a word, phrase, or keyword with factored searches, word-stemming, result ranking, and spelling correction.
Have the query result include metadata that will allow us to show a video clip mapping the word to the place in the video where it is uttered.
Allow a transcript editing application to modify the content of the word file, as well as the time codes, in real-time.
Dependably maintain mapping between words and their time codes.

Technique

The WRD file contains the transcript for a single lecture, one word to a line. Preceding each word on its line is its start and stop time in milliseconds. We call this format a WRD file[1] and you can see a snippit of it below.

To use Lucene/Solr, the first task is loading the data. Lucene reads the data and creates an index.

The first transformation of the workflow is to convert the WRD file into a format that may be easily POST’ed into Solr. (Solr allows uploading documents by passing a URL to the Solr server with an appended query string specifying the upload document.) I’m currently performing the transformation from WRD to the required XML format manually in an editor. I’m taking a series of lines like this:

... 6183 6288 in 6288 6868 physics 7186 7342 we 7342 8013 explore 9091 9181 the 9181 9461 very 9461 9956 small 10741 10862 to 10862 10946 the 10946 11226 very 11226 11686 large ...

to this:

... In physics we explore the very small to the very large ...

Automating the transformation as shown above will be trivial, especially since there are libraries available for a variety of programming language. What you will have noticed is that the uploaded content doesn’t include the time codes. Snap!

So it seems that it is easy to use Lucene/Solr to perform a full featured search of transcripts, and it’s easy to use a database to search for single transcript words and retrieve their timecodes, but there isn’t a tool that allow me to integrate these two requirements out-of-the-box.

Ideally, we want to store the word together with its time codes while still permitting Solr to work its search magic.

There are a couple of ways to perform this integration as an additional step, which isn’t my preference.[2] Once uploaded, I can use a test tool to return character offsets for words in transcripts. That’s the raw data I need to work with individual words and still have useful full-text searching, now I’ve got to figure out what to do with it.
Here’s what the data looks like for the word “yes” in Lewin’s first lecture:
... 3 6402 6405 20045 20048 22858 22861 1 ...

“tf” is term frequency
“offsets” are the start and end position in characters.

Lucene Payloads

Lucene has an advanced feature that stores additional metadata with each indexed token (in our case, that is a transcript word). The payload feature is usually used to influence the ranking of search results, but it can be used for other purposes as well.[3] What recommends it for our use is that it stores the data in the index, so it is readily available as part of the search result. The preparation of the transcript would change also. Using the WRD file example from the Technique section above, we would produce this file:

" in|6183 physics|6288 we|7186 explore|7342 the|9091 very|9181 small|9461 to|10741 the|10862 very|10946 large|11226 "

[1] The .WRD extension is an artifact of previous development and has no relationship to desktop software applications that may use this same file extension.

[2] The reason I don’t favor integrating these two as an extra step is that I would need to maintain referential integrity between two independent datastores. (Remember, that we also have the requirement of editing both transcript words timecodes.)

[3] Update: Lucene has a feature called payloads which will allow me to store the timecodes with the words. Here is a link to a blog post that explains the technique.