MIT – SpokenMedia https://spokenmedia.mit.edu Transcripts for Lecture Video Mon, 23 May 2016 13:11:39 +0000 en-US hourly 1 https://wordpress.org/?v=4.5.3 Customizing the Google Search Appliance for Greenfield OCW https://spokenmedia.mit.edu/2010/08/customizing-the-google-search-appliance-for-greenfield-ocw/ Tue, 24 Aug 2010 14:58:16 +0000 http://spokenmedia.mit.edu/?p=794 MIT’s OpenCourseWare uses MIT’s Google Search Appliance (GSA) to search its content.  MIT supports customization of GSA results through XSL transformation.  This post describes how we plan to use GSA to search lecture transcripts and return results containing the lecture videos that the search terms appear in.  Since OCW publishes static content, it doesn’t incorporate an integral search engine.  Search is provided through MIT’s Google Search Appliance (GSA) which indexes all OCW content and provides a familiar search paradigm to users.  For the purpose of searching lecture transcripts though the default behavior is unsatisfactory.  The reason is that searching for words that appear in a lecture transcript will return not only pages containing transcripts containing the words, it will return every occurrence of the words in the OCW site.  In addition, the current pages that contain lecture transcripts provide no way to see the portion of the lecture video where the words occur.  Changing the way GSA returns its results is one way to remedy this situation without changing the static nature of the OCW content.

The target use case, then, is searching for words that occur in lecture transcripts with the search results conveniently allowing a user to see and hear the lecture at the point that the word occurs.  We can make this happen by integrating a number of independent processes.

  • Obtain or generate a transcript for the lecture video.
  • Format the transcript so that it is an appropriate target for GSA search.
  • Modify the default Google XSL transformation so that it returns our custom results page.
  • Modify a video player so that it plays the lecture from near the point that the search term occurs.

Generate a Transcript for a Lecture Video

Generating transcripts from lecture videos is described here.

Format a Transcript as a GSA Search Target

The speech recognition process produces a variety of transcript files.  Some are just text, some are text and time codes.  None of the generated files include metadata describing the transcript they contain.  For the transcripts to serve as search targets they must include additional data that the XSL can use to produce the search results page.  This metadata is the minimum that the search target must include:

  • path to lecture video file
  • speaker
  • order of lecture within its course
  • video title

Customize Google Search XSL

Google returns its search results as a structured document that contains no presentation markup.  Each page that contains a search input control also specifies an XSL transformation to present the search results.  This is an example of the markup that appears in the page:

<input name="proxystylesheet" type="hidden" value="/oeit/OcwWeb/search/google-ocw.xsl" />

The default transformation creates the familiar Google search results page, but this transformation may be customized.  This is our strategy; our custom transformation will use GSA to search the transcripts and return the hits in a page that will allow viewing the lectures.

Edit Video Lecture Pages

GSA indexes pages by following links.  Since we will add a page to the site for each lecture transcript

Global Changes to Google search textbox

Google provides a XSL stylesheet for customizing the search results page.  Substitute your own custom xsl page for Google’s default stylesheet and change the reference to the stylesheet on every page that contains a search box to cause your customization take effect.

These command line commands makes this change.  (A Unix-like OS is assumed.)  Each command line follows a similar pattern.  They first use the find utility to create a list of the files to change.  This list is passed to the grep utility to filter the file list down to those files that contain the text to change.  This filtered list is passed to the perl utility to perform the change by performing a global substitution.

The first changes the XSL file from the default to your customized XSL.  Note that the argument to grep simply identifies those files that contain a search box.  (This argument could as easily have been ‘google-ocw.xsl’ which would been simpler and as effective.)

find .  -name '*.htm*' | xargs grep -l '<input type="hidden" value="ocw" name="site">' | xargs perl -p -i.bak -w -e 's/google-ocw.xsl/google-greenfield.xsl/g'

[update: the replacement above is incorrect. It fails to update the enough of the URL. Here’s a fix if you ran that replace script.]
find . -name 'index.htm' | xargs grep -l 'http://ocw.mit.edu/search/google-greenfield.xsl' | xargs perl -p -i.bak -w -e 's|http://ocw.mit.edu/search/google-greenfield.xsl|http://greenfield.mit.edu/oeit/OcwWeb/search/google-greenfield.xsl|g'

The next changes the name of the site to search.  You may not have to change this value.  In our case, we created our own Google search “collection” that contained our copy of the OCW site.

find . -name '*.htm*' | xargs grep -l '<input type="hidden" value="ocw" name="site">' | xargs perl -p -i.bak -w -e 's/input type="hidden" value="ocw" name="site"/input type="hidden" value="oeit-ocw" name="site"/g'

The last specifies the URL of our Google search engine.  The reason for this change is that the courses in our OCW instance were collected by using the “download course materials” zip that many OCW courses provide.  Since these are intended for download, their search feature is disabled.

find . -name '*.htm*' | xargs grep -l 'form method="get" action="(../){1,3}common/search/AdvancedSearch.htm"' | xargs  perl -p -i.bak -w -e 's|form method="get" action="(../){1,3}common/search/AdvancedSearch.htm"|form method="get" action="http://search.mit.edu/search"|g'


find . -name '*.htm' | xargs perl -p -i.bak -w -e 's|<input type="hidden" name="proxystylesheet" value="/oeit/OcwWeb/search/google-ocw.xsl" />|<input type="hidden" name="proxystylesheet" value="http://greenfield.mit.edu/oeit/OcwWeb/search/google-greenfield.xsl" />|g'

]]>
SpokenMedia at OCW Consortium Global 2010 Conference https://spokenmedia.mit.edu/2010/05/spokenmedia-at-ocw-consortium-global-2010-conference/ Fri, 07 May 2010 03:57:22 +0000 http://spokenmedia.mit.edu/?p=480 Brandon Muramatsu presented on SpokenMedia at the OCW Consortium Global 2010 Conference in Hanoi, Vietnam on May 7, 2010.

Cite as: Muramatsu, B., McKinney, A. & Wilkins, P. (2010, May 5). Opening Up IIHS Video with SpokenMedia. Presentation at OCWC Global 2010: Hanoi, Vietnam, May 5, 2010. Retrieved May 6, 2010 from Slideshare Web site: http://www.slideshare.net/bmuramatsu/opening-up-iihs-video-with-spokenmedia

]]>
SpokenMedia at OER10 https://spokenmedia.mit.edu/2010/03/spokenmedia-at-oer10/ Tue, 23 Mar 2010 11:48:31 +0000 http://spokenmedia.mit.edu/?p=287 Brandon Muramatsu presented on SpokenMedia at the OER10 Conference in Cambridge, UK on March 23, 2010.

Cite as: Muramatsu, B., McKinney, A. & Wilkins, P. (2010, March 23). Improving the OER Experience: Rich Media Notebooks of OER Video and Audio. Presentation at OER10: Cambridge, UK, March 23, 2010. Retrieved March 23, 2010 from Slideshare Web site: http://www.slideshare.net/bmuramatsu/improving-the-oer-experience-enabling-rich-media-notebooks-of-oer-video-and-audio
]]>
SpokenMedia at the IIHS Curriculum Conference https://spokenmedia.mit.edu/2010/01/spokenmedia-at-iihs-curriculum-conference/ Sat, 23 Jan 2010 16:41:26 +0000 http://spokenmedia.mit.edu/?p=124 Brandon Muramatsu and Andrew McKinney presented on SpokenMedia at the Indian Institute for Human Settlements (IIHS) Curriculum Conference in Bangalore, India on January 5, 2010.

Along with Peter Wilkins, we developed a demonstration of SpokenMedia technology using automatic lecture transcription to transcribe videos from IIHS. We developed a new JavaScript player that allowed us to view and search transcripts, and that supports transcripts in multiple languages. View the demo.

Cite as: Muramatsu, B., McKinney, A. & Wilkins, P. (2010, January 6). IIHS Open Framework-SpokenMedia. Presentation at the Indian Institute for Human Settlements Curriculum Conference: Bangalore, India, January 5, 2010. Retrieved January 23, 2010 from Slideshare Web site: http://www.slideshare.net/bmuramatsu/iihs-open-frameworkspoken-media
]]>
SpokenMedia at EdTech Fair https://spokenmedia.mit.edu/2009/10/spokenmedia-at-edtech-fair/ Fri, 16 Oct 2009 17:53:46 +0000 http://18.9.60.213/?p=116 Brandon Muramatsu, Andrew McKinny and Phillip Long presented at the EdTech Fair at MIT in Cambridge, MA on October 14, 2009. We provided an on-going demonstration on the automated lecture transcription, search and playback functions of the SpokenMedia project.

]]>
SpokenMedia: Content, Content Everywhere…What video? Where? at OpenEd 2009 https://spokenmedia.mit.edu/2009/08/spokenmedia-content-content-everywhere-what-video-where-at-opened-2009/ Thu, 13 Aug 2009 01:48:27 +0000 http://18.9.60.213/?p=44 Brandon Muramatsu presented on SpokenMedia at the Open Education 2009 Conference in August 2009 in Vancouver, British Columbia, Canada.

Cite as: Muramatsu, B. (2009). SpokenMedia: Content, Content Everywhere…What video? Where?: Improving the Discoverability of OER video and audio lectures. Presentation at the Open Education 2009: Vancouver, British Columbia, August 12, 2009. Retrieved August 17, 2009 from Slideshare Web site: http://www.slideshare.net/bmuramatsu/spokenmedia-content-content-everywherewhat-video-where-at-opened-2009#?type=presentation

In the uStream video below, the SpokenMedia presentation starts at about 19:30 in. The first part of the presentation is Mara Hancock from UC Berkeley talking about Opencast Matterhorn. (Unfortunately they forgot to start saving the stream at the start of her talk.)

Cite as: Muramatsu, B. (2009). SpokenMedia: Content, Content Everywhere…What video? Where? Presentation at Open Education 2009: Vancouver, British Columbia, August 12, 2009. Retrieved August 17, 2009 from uStream Web site: http://www.ustream.tv/flash/video/1972941
]]>
SpokenMedia Project: Media-Linked Transcripts and Rich Media Notebooks for Learning and Teaching at T4E 2009 https://spokenmedia.mit.edu/2009/08/spokenmedia-project-media-linked-transcripts-and-rich-media-notebooks-for-learning-and-teaching-at-t4e-2009/ Fri, 07 Aug 2009 01:39:40 +0000 http://18.9.60.213/?p=37 Brandon Muramatsu presented on SpokenMedia three times in India in August 2009–at the 2009 Technology for Education Workshop, Microsoft Research India and IEEE Computer Society Bangalore Section.

The presentation to the IEEE-CS Bangalore Section was also the best presentation of the three–this presentation really wants to be an hour long, and we got great questions from the audience. Unfortunately I forgot to record the presentation, it would have made a great slidecast.

Embedded below is the presentation to the Technology for Education 2009 Conference, the one with a slidecast.

Cite as: Muramatsu, B. (2009). SpokenMedia Project: Media-Linked Transcripts and Rich Media Notebooks for Learning and Teaching at T4E 2009. Presentation at the Technology for Education Conference, Bangalore, India, August 4, 2010. Retrieved August 6, 2009 from Slideshare Web site: http://www.slideshare.net/bmuramatsu/spokenmedia-project-medialinked-transcripts-and-rich-media-notebooks-for-learning-and-teaching
]]>
Building Community for Rich Media Notebooks: The SpokenMedia Project at NMC 2009 https://spokenmedia.mit.edu/2009/06/building-community-for-rich-media-notebooks-the-spokenmedia-project-at-nmc-2009/ Sat, 20 Jun 2009 19:43:55 +0000 http://18.9.60.213/?p=100 Brandon Muramatsu and Phillip Long presented at the NMC Summer Conference in Monterey, CA on June 12, 2009.

Cite as: Muramatsu, B., McKinney, A., Long, P.D. & Zornig, J. (2009, June 12). Building Community for Rich Media Notebooks: The SpokenMedia Project. 2009 New Media Consortium Summer Conference, Monterey, CA on June 12, 2009. Retrieved October 13, 2009 from Slideshare Web site: http://www.slideshare.net/bmuramatsu/building-community-for-rich-media-notebooks-the-spokenmedia-project
]]>
SpokenMedia Poster at AcademiX Conference at Duke https://spokenmedia.mit.edu/2009/05/spokenmedia-poster-at-academix-conference-at-duke/ Mon, 25 May 2009 19:22:00 +0000 http://18.9.60.213/?p=88 Andrew McKinney presented a poster at the AcademiX Conference at Duke University, Durham, NC on May 28, 2009.

SpokenMedia AcademiX Poster
Source: Andrew McKinney

SpokenMedia AcademiX Poster

Cite as: McKinney, A. & Muramatsu, B., Zorning, J. & Long, P. (2009, May). SpokenMedia. Poster at the AcademiX Conference, Duke University, Durham, NC. May 28, 2009.
]]>
SpokenMedia at OCWC Global https://spokenmedia.mit.edu/2009/04/spokenmedia-at-ocwc-global/ Thu, 23 Apr 2009 17:47:40 +0000 http://18.9.60.213/?p=117 Brandon Muramatsu presented on SpokenMedia at the OpenCourseWare Consortium Global Conference in Monterrey, Mexico on April 22, 2009.

Cite as: Muramatsu, B. (2009, April). Automated Lecture Transcription. Presentation at the OpenCourseWare Consortium Global Conference in Monterrey, Mexico on April 22, 2009. Retrieved January 23, 2010 from Slideshare Web site: http://www.slideshare.net/bmuramatsu/automated-lecture-transcription
]]>