The target use case, then, is searching for words that occur in lecture transcripts with the search results conveniently allowing a user to see and hear the lecture at the point that the word occurs. We can make this happen by integrating a number of independent processes.
Generating transcripts from lecture videos is described here.
The speech recognition process produces a variety of transcript files. Some are just text, some are text and time codes. None of the generated files include metadata describing the transcript they contain. For the transcripts to serve as search targets they must include additional data that the XSL can use to produce the search results page. This metadata is the minimum that the search target must include:
Google returns its search results as a structured document that contains no presentation markup. Each page that contains a search input control also specifies an XSL transformation to present the search results. This is an example of the markup that appears in the page:
<input name="proxystylesheet" type="hidden" value="/oeit/OcwWeb/search/google-ocw.xsl" />
The default transformation creates the familiar Google search results page, but this transformation may be customized. This is our strategy; our custom transformation will use GSA to search the transcripts and return the hits in a page that will allow viewing the lectures.
GSA indexes pages by following links. Since we will add a page to the site for each lecture transcript
Google provides a XSL stylesheet for customizing the search results page. Substitute your own custom xsl page for Google’s default stylesheet and change the reference to the stylesheet on every page that contains a search box to cause your customization take effect.
These command line commands makes this change. (A Unix-like OS is assumed.) Each command line follows a similar pattern. They first use the find utility to create a list of the files to change. This list is passed to the grep utility to filter the file list down to those files that contain the text to change. This filtered list is passed to the perl utility to perform the change by performing a global substitution.
The first changes the XSL file from the default to your customized XSL. Note that the argument to grep simply identifies those files that contain a search box. (This argument could as easily have been ‘google-ocw.xsl’ which would been simpler and as effective.)
find . -name '*.htm*' | xargs grep -l '<input type="hidden" value="ocw" name="site">' | xargs perl -p -i.bak -w -e 's/google-ocw.xsl/google-greenfield.xsl/g'
[update: the replacement above is incorrect. It fails to update the enough of the URL. Here’s a fix if you ran that replace script.]
find . -name 'index.htm' | xargs grep -l 'http://ocw.mit.edu/search/google-greenfield.xsl' | xargs perl -p -i.bak -w -e 's|http://ocw.mit.edu/search/google-greenfield.xsl|http://greenfield.mit.edu/oeit/OcwWeb/search/google-greenfield.xsl|g'
The next changes the name of the site to search. You may not have to change this value. In our case, we created our own Google search “collection” that contained our copy of the OCW site.
find . -name '*.htm*' | xargs grep -l '<input type="hidden" value="ocw" name="site">' | xargs perl -p -i.bak -w -e 's/input type="hidden" value="ocw" name="site"/input type="hidden" value="oeit-ocw" name="site"/g'
The last specifies the URL of our Google search engine. The reason for this change is that the courses in our OCW instance were collected by using the “download course materials” zip that many OCW courses provide. Since these are intended for download, their search feature is disabled.
find . -name '*.htm*' | xargs grep -l 'form method="get" action="(../){1,3}common/search/AdvancedSearch.htm"' | xargs perl -p -i.bak -w -e 's|form method="get" action="(../){1,3}common/search/AdvancedSearch.htm"|form method="get" action="http://search.mit.edu/search"|g'
find . -name '*.htm' | xargs perl -p -i.bak -w -e 's|<input type="hidden" name="proxystylesheet" value="/oeit/OcwWeb/search/google-ocw.xsl" />|<input type="hidden" name="proxystylesheet" value="http://greenfield.mit.edu/oeit/OcwWeb/search/google-greenfield.xsl" />|g'
Along with Peter Wilkins, we developed a demonstration of SpokenMedia technology using automatic lecture transcription to transcribe videos from IIHS. We developed a new JavaScript player that allowed us to view and search transcripts, and that supports transcripts in multiple languages. View the demo.
In the uStream video below, the SpokenMedia presentation starts at about 19:30 in. The first part of the presentation is Mara Hancock from UC Berkeley talking about Opencast Matterhorn. (Unfortunately they forgot to start saving the stream at the start of her talk.)
The presentation to the IEEE-CS Bangalore Section was also the best presentation of the three–this presentation really wants to be an hour long, and we got great questions from the audience. Unfortunately I forgot to record the presentation, it would have made a great slidecast.
Embedded below is the presentation to the Technology for Education 2009 Conference, the one with a slidecast.