Customizing the Google Search Appliance for Greenfield OCW

MIT’s OpenCourseWare uses MIT’s Google Search Appliance (GSA) to search its content.  MIT supports customization of GSA results through XSL transformation.  This post describes how we plan to use GSA to search lecture transcripts and return results containing the lecture videos that the search terms appear in.  Since OCW publishes static content, it doesn’t incorporate an integral search engine.  Search is provided through MIT’s Google Search Appliance (GSA) which indexes all OCW content and provides a familiar search paradigm to users.  For the purpose of searching lecture transcripts though the default behavior is unsatisfactory.  The reason is that searching for words that appear in a lecture transcript will return not only pages containing transcripts containing the words, it will return every occurrence of the words in the OCW site.  In addition, the current pages that contain lecture transcripts provide no way to see the portion of the lecture video where the words occur.  Changing the way GSA returns its results is one way to remedy this situation without changing the static nature of the OCW content.

The target use case, then, is searching for words that occur in lecture transcripts with the search results conveniently allowing a user to see and hear the lecture at the point that the word occurs.  We can make this happen by integrating a number of independent processes.

  • Obtain or generate a transcript for the lecture video.
  • Format the transcript so that it is an appropriate target for GSA search.
  • Modify the default Google XSL transformation so that it returns our custom results page.
  • Modify a video player so that it plays the lecture from near the point that the search term occurs.

Generate a Transcript for a Lecture Video

Generating transcripts from lecture videos is described here.

Format a Transcript as a GSA Search Target

The speech recognition process produces a variety of transcript files.  Some are just text, some are text and time codes.  None of the generated files include metadata describing the transcript they contain.  For the transcripts to serve as search targets they must include additional data that the XSL can use to produce the search results page.  This metadata is the minimum that the search target must include:

  • path to lecture video file
  • speaker
  • order of lecture within its course
  • video title

Customize Google Search XSL

Google returns its search results as a structured document that contains no presentation markup.  Each page that contains a search input control also specifies an XSL transformation to present the search results.  This is an example of the markup that appears in the page:

<input name="proxystylesheet" type="hidden" value="/oeit/OcwWeb/search/google-ocw.xsl" />

The default transformation creates the familiar Google search results page, but this transformation may be customized.  This is our strategy; our custom transformation will use GSA to search the transcripts and return the hits in a page that will allow viewing the lectures.

Edit Video Lecture Pages

GSA indexes pages by following links.  Since we will add a page to the site for each lecture transcript

Global Changes to Google search textbox

Google provides a XSL stylesheet for customizing the search results page.  Substitute your own custom xsl page for Google’s default stylesheet and change the reference to the stylesheet on every page that contains a search box to cause your customization take effect.

These command line commands makes this change.  (A Unix-like OS is assumed.)  Each command line follows a similar pattern.  They first use the find utility to create a list of the files to change.  This list is passed to the grep utility to filter the file list down to those files that contain the text to change.  This filtered list is passed to the perl utility to perform the change by performing a global substitution.

The first changes the XSL file from the default to your customized XSL.  Note that the argument to grep simply identifies those files that contain a search box.  (This argument could as easily have been ‘google-ocw.xsl’ which would been simpler and as effective.)

find .  -name '*.htm*' | xargs grep -l '<input type="hidden" value="ocw" name="site">' | xargs perl -p -i.bak -w -e 's/google-ocw.xsl/google-greenfield.xsl/g'

[update: the replacement above is incorrect. It fails to update the enough of the URL. Here’s a fix if you ran that replace script.]
find . -name 'index.htm' | xargs grep -l '' | xargs perl -p -i.bak -w -e 's|||g'

The next changes the name of the site to search.  You may not have to change this value.  In our case, we created our own Google search “collection” that contained our copy of the OCW site.

find . -name '*.htm*' | xargs grep -l '<input type="hidden" value="ocw" name="site">' | xargs perl -p -i.bak -w -e 's/input type="hidden" value="ocw" name="site"/input type="hidden" value="oeit-ocw" name="site"/g'

The last specifies the URL of our Google search engine.  The reason for this change is that the courses in our OCW instance were collected by using the “download course materials” zip that many OCW courses provide.  Since these are intended for download, their search feature is disabled.

find . -name '*.htm*' | xargs grep -l 'form method="get" action="(../){1,3}common/search/AdvancedSearch.htm"' | xargs  perl -p -i.bak -w -e 's|form method="get" action="(../){1,3}common/search/AdvancedSearch.htm"|form method="get" action=""|g'

find . -name '*.htm' | xargs perl -p -i.bak -w -e 's|<input type="hidden" name="proxystylesheet" value="/oeit/OcwWeb/search/google-ocw.xsl" />|<input type="hidden" name="proxystylesheet" value="" />|g'

Creative Commons License Unless otherwise specified, the Spoken Media Website by the MIT Office of Digital Learning, Strategic Education Initiatives is licensed under a Creative Commons Attribution 4.0 International License.