Developer – SpokenMedia

Customizing the Google Search Appliance for Greenfield OCW

Peter Wilkins — Tue, 24 Aug 2010 14:58:16 +0000

MIT’s OpenCourseWare uses MIT’s Google Search Appliance (GSA) to search its content. MIT supports customization of GSA results through XSL transformation. This post describes how we plan to use GSA to search lecture transcripts and return results containing the lecture videos that the search terms appear in. Since OCW publishes static content, it doesn’t incorporate an integral search engine. Search is provided through MIT’s Google Search Appliance (GSA) which indexes all OCW content and provides a familiar search paradigm to users. For the purpose of searching lecture transcripts though the default behavior is unsatisfactory. The reason is that searching for words that appear in a lecture transcript will return not only pages containing transcripts containing the words, it will return every occurrence of the words in the OCW site. In addition, the current pages that contain lecture transcripts provide no way to see the portion of the lecture video where the words occur. Changing the way GSA returns its results is one way to remedy this situation without changing the static nature of the OCW content.

The target use case, then, is searching for words that occur in lecture transcripts with the search results conveniently allowing a user to see and hear the lecture at the point that the word occurs. We can make this happen by integrating a number of independent processes.

Obtain or generate a transcript for the lecture video.
Format the transcript so that it is an appropriate target for GSA search.
Modify the default Google XSL transformation so that it returns our custom results page.
Modify a video player so that it plays the lecture from near the point that the search term occurs.

Generate a Transcript for a Lecture Video

Generating transcripts from lecture videos is described here.

Format a Transcript as a GSA Search Target

The speech recognition process produces a variety of transcript files. Some are just text, some are text and time codes. None of the generated files include metadata describing the transcript they contain. For the transcripts to serve as search targets they must include additional data that the XSL can use to produce the search results page. This metadata is the minimum that the search target must include:

path to lecture video file
speaker
order of lecture within its course
video title

Customize Google Search XSL

Google returns its search results as a structured document that contains no presentation markup. Each page that contains a search input control also specifies an XSL transformation to present the search results. This is an example of the markup that appears in the page:

The default transformation creates the familiar Google search results page, but this transformation may be customized. This is our strategy; our custom transformation will use GSA to search the transcripts and return the hits in a page that will allow viewing the lectures.

Edit Video Lecture Pages

GSA indexes pages by following links. Since we will add a page to the site for each lecture transcript

Global Changes to Google search textbox

Google provides a XSL stylesheet for customizing the search results page. Substitute your own custom xsl page for Google’s default stylesheet and change the reference to the stylesheet on every page that contains a search box to cause your customization take effect.

These command line commands makes this change. (A Unix-like OS is assumed.) Each command line follows a similar pattern. They first use the find utility to create a list of the files to change. This list is passed to the grep utility to filter the file list down to those files that contain the text to change. This filtered list is passed to the perl utility to perform the change by performing a global substitution.

The first changes the XSL file from the default to your customized XSL. Note that the argument to grep simply identifies those files that contain a search box. (This argument could as easily have been ‘google-ocw.xsl’ which would been simpler and as effective.)

find . -name '*.htm*' | xargs grep -l '' | xargs perl -p -i.bak -w -e 's/google-ocw.xsl/google-greenfield.xsl/g'

[update: the replacement above is incorrect. It fails to update the enough of the URL. Here’s a fix if you ran that replace script.]
find . -name 'index.htm' | xargs grep -l 'http://ocw.mit.edu/search/google-greenfield.xsl' | xargs perl -p -i.bak -w -e 's|http://ocw.mit.edu/search/google-greenfield.xsl|http://greenfield.mit.edu/oeit/OcwWeb/search/google-greenfield.xsl|g'

The next changes the name of the site to search. You may not have to change this value. In our case, we created our own Google search “collection” that contained our copy of the OCW site.

find . -name '*.htm*' | xargs grep -l '' | xargs perl -p -i.bak -w -e 's/input type="hidden" value="ocw" name="site"/input type="hidden" value="oeit-ocw" name="site"/g'

The last specifies the URL of our Google search engine. The reason for this change is that the courses in our OCW instance were collected by using the “download course materials” zip that many OCW courses provide. Since these are intended for download, their search feature is disabled.

find . -name '*.htm*' | xargs grep -l 'form method="get" action="(../){1,3}common/search/AdvancedSearch.htm"' | xargs perl -p -i.bak -w -e 's|form method="get" action="(../){1,3}common/search/AdvancedSearch.htm"|form method="get" action="http://search.mit.edu/search"|g'

find . -name '*.htm' | xargs perl -p -i.bak -w -e 's|||g'

Running the Baseline Recognizer

Peter Wilkins — Sat, 07 Aug 2010 01:59:27 +0000

The software that processes lecture audio into a textual transcript is comprised of a series of scripts that marshall input files and parameters to a speech recognition engine. Interestingly, since the engine is data driven, its code seldom changes; improvements in performance and accuracy are achieved by refining the data it uses to perform its tasks.

There are two steps to produce the transcript. The first creates an audio file in the correct format for speech recognition. The second processes that audio file into the transcript.

Setup

The scripts require a specific folder hierarchy. All paths are rooted here:

/usr/users/mckinney/sm

unless a fully qualified path is specified, assume this is “home” for paths that occur below. For example, when I refer to the lectures folder, since I don’t explicitly specify its full path, I am referring to this folder:

/usr/users/mckinney/sm/lectures

The scripts require that each video is found in a child folder named for its lecture ordinal. The parent is a folder named for the course of these videos. The child folder is referred to as the “lecture” folder, the parent folder is the “course” folder.

For example, in the path

lectures/OCW-18.01-f07/L01

The course is OCW-18.01-f07 and the lecture is L01. The video file itself is in the L01 folder.

So, In the lectures folder, create a folder for the course and a folder for the lecture. Drop the video file there in the lecture folder.

Create the Audio file

The audio file that is the input for speech recognition must be a .WAV file (16 bit PCM, bit rate of 256 kbits/sec). The script that creates this file is:

create_wavefile.cmd

and is located in the scripts folder

The script takes three parameters:

course folder name
lecture folder name
format of the video file, expressed as its file extension

For example, to run the script from the lectures folder, execute this command:

../create_wavefile.cmd OCW-18.01-f07 L01 mp4

Note that the command as well as two of the parameters are given relative to the folder where the command is executed. The following is the terminal output of a successful run of the audio file creation script.

pwilkins@sls:/usr/users/mckinney/sm/lectures$ ../scripts/create_wavefile.cmd OCW-18.01-f07 L01 mp4 *** Creating wave file /usr/users/mckinney/sm/lectures/OCW-18.01-f07/L01/OCW-18.01-f07-L01.wav OCW-18.01-f07-L01 FFmpeg version r11872+debian_0.svn20080206-18+lenny1, Copyright (c) 2000-2008 Fabrice Bellard, et al. configuration: --enable-gpl --enable-libfaad --enable-pp --enable-swscaler --enable-x11grab --prefix=/usr --enable-libgsm --enable-libtheora --enable-libvorbis --enable-pthreads --disable-strip --enable-libdc1394 --enable-shared --disable-static libavutil version: 49.6.0 libavcodec version: 51.50.0 libavformat version: 52.7.0 libavdevice version: 52.0.0 built on Jan 25 2010 18:27:39, gcc: 4.3.2

Seems stream 1 codec frame rate differs from container frame rate: 30000.00 (30000/1) -> 14.99 (15000/1001) Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/usr/users/mckinney/sm/lectures/OCW-18.01-f07/L01/OCW-18.01-f07-L01.mp4': Duration: 00:51:32.2, start: 0.000000, bitrate: 303 kb/s Stream #0.0(und): Audio: mpeg4aac, 44100 Hz, stereo Stream #0.1(und): Video: mpeg4, yuv420p, 480x360 [PAR 1:1 DAR 4:3], 14.99 tb(r) Stream #0.2(und): Data: mp4s / 0x7334706D Stream #0.3(und): Data: mp4s / 0x7334706D Output #0, wav, to '/usr/users/mckinney/sm/lectures/OCW-18.01-f07/L01/OCW-18.01-f07-L01.wav': Stream #0.0(und): Audio: pcm_s16le, 16000 Hz, mono, 256 kb/s Stream mapping: Stream #0.0 -> #0.0 Press [q] to stop encoding size= 96633kB time=3092.2 bitrate= 256.0kbits/s video:0kB audio:96633kB global headers:0kB muxing overhead 0.000044%

Process the Audio file into a Transcript

SpokenMedia Transcript Editor

Brandon Muramatsu — Mon, 12 Apr 2010 14:30:32 +0000

We’re working on a javascript-based transcript editor with our developer Ryan Lee at voccs.com.

The goals of the editor project are:

Low and high accuracy editors–We believe the best approach to transcript editing involves separating the editing into two distinct phases. In cases where the transcript is mostly accurate, we want to retain the time/word relationships. That is, for every word, we want to make sure we retain the timecode associated with that word. In cases where the transcript is mostly inaccurate, we believe it’s best to just edit the transcript as a single block of text. And that we’ll take the edited transcript and align it to the audio after the transcript editing is completed. Unfortunately, this will require a time delay (best case is about 1:1.5) to reprocess the video.
Be simple and intuitive to use.
Be a clean design.
Support the user with a limited amount of extra mousing and/or clicking (this is the one compelling reason for us to have the “low” and “high” accuracy editors).
Integrate an audio/video player within the UI of the transcript editor (instead of running the video/audio as a separate application, or in a separate window, from the editor).
An editing communication protocol to be implemented between the server and client browser.

We’ve seen some initial designs from Ryan, and once we have this design phase completed, we’ll post the editors with transcripts and go into a testing phase.

Extending the Spiral Connect Player

Brandon Muramatsu — Mon, 05 Apr 2010 14:30:08 +0000

Christophe Battier, Jean Baptiste Nallet and Alexandre Louys from ICAP at the Universite de Lyon 1 in France visited the SpokenMedia team in February 2010.

They are working on a new version of their virtual learning environment (VLE) — a learning management system (LMS) in American-speak — that has an integrated video player with a number of features of interest to SpokenMedia.

The player is Flash-based and provides the ability for users to create “bubbles” — or annotations/bookmarks — that overlay the video. These bubbles can be seen along a timeline, and can be used to provide feedback from teacher to student or highlight interesting aspects of the video.

Here’s a screenshot from the current version of the Spiral player:

Source: Christophe Battier/Spiral

Spiral Player with Bubbles

We discussed with them, integrating the aspects of the transcript display in the player we’ve been developing.

The user can watch the video and see the transcript with a “bouncing ball” highlighting the phrase being said. The user can switch between transcripts in multiple languages. And, the user can search through the transcript and playback the video by double clicking on the search result.

We talked about how the SpiralConnect team might extend their player to integrate transcripts and also create annotations that could be displayed below (or to the side of the video) and not just overlay the video image.

Here’s a mockup of what we discussed.

Source: Brandon Muramatsu

SpiralConnect plus SpokenMedia Transcript Mockup

SpokenMedia Player

Brandon Muramatsu — Mon, 29 Mar 2010 14:30:20 +0000

We’ve developed a first pass of a new video player for SpokenMedia that integrates video playback and transcript display. (Well, ok we did this in late-December and initially demo’d this in January with IIHS.)

Our goals with the player development:

Javascript-player
Play multiple videos on the same page
Highlight the phrase corresponding to relevant point in the video
Be able to switch between multiple audio tracks (if they are available)
Be able to switch between transcripts in multiple languages (if they are available)
Be able to search through a transcript, and play the video by clicking on the search result
Be able to open source the player
Include the OEIT logo

We worked with a great developer, Ryan Lee at voccs.com, to develop the player.

We used the player as part of our demo for the Indian Institute for Human Settlements.