This site has been archived

Using Lucene/Solr for Transcript Search

Overview

In any but a trivial implementation, searching lecture transcripts presents challenges not found in other search targets.  Major among them is that each transcript word requires its own metadata (start and stop times).  Solr, a web application that derives its search muscle from Apache Lucene, has a query interface that is both rich and flexible.  It doesn’t hurt that it’s also very fast.  Properly configured, it provides an able platform to support lecture transcript searching.  Although Solr is the server, the search itself is performed by Lucene so much of the discussion will address Lucene specifically.  The integration with the server will be discussed in a subsequent posting.

Objective

We want to implement an automated work flow that can take a file that contains all the words spoken in the lecture, along with their start and stop times and persist them into a repository that will allow us to:

  • search all transcripts for a word, phrase, or keyword with factored searches, word-stemming, result ranking, and spelling correction.
  • Have the query result include metadata that will allow us to show a video clip mapping the word to the place in the video where it is uttered.
  • Allow a transcript editing application to modify the content of the word file, as well as the time codes, in real-time.
  • Dependably maintain mapping between words and their time codes.

Technique

The WRD file contains the transcript for a single lecture, one word to a line.  Preceding each word on its line is its start and stop time in milliseconds.  We call this format a WRD file[1] and you can see a snippit of it below.

To use Lucene/Solr, the first task is loading the data.  Lucene reads the data and creates an index.

The first transformation of the workflow is to convert the WRD file into a format that may be easily POST’ed into Solr.  (Solr allows uploading documents by passing a URL to the Solr server with an appended query string specifying the upload document.)  I’m currently performing the transformation from WRD to the required XML format manually in an editor.  I’m taking a series of lines like this:

...
6183 6288 in
6288 6868 physics
7186 7342 we
7342 8013 explore
9091 9181 the
9181 9461 very
9461 9956 small
10741 10862 to
10862 10946 the
10946 11226 very
11226 11686 large
...

to this:

...
<field name="trans_word">In</field>
<field name="trans_word">physics</field>
<field name="trans_word">we</field>
<field name="trans_word">explore</field>
<field name="trans_word">the</field>
<field name="trans_word">very</field>
<field name="trans_word">small</field>
<field name="trans_word">to</field>
<field name="trans_word">the</field>
<field name="trans_word">very</field>
<field name="trans_word">large</field>
...

Automating the transformation as shown above will be trivial, especially since there are libraries available for a variety of programming language.  What you will have noticed is that the uploaded content doesn’t include the time codes.  Snap!

So it seems that it is easy to use Lucene/Solr to perform a full featured search of transcripts, and it’s easy to use a database to search for single transcript words and retrieve their timecodes, but there isn’t a tool that allow me to integrate these two requirements out-of-the-box.

Ideally, we want to store the word together with its time codes while still permitting Solr to work its search magic.

There are a couple of ways to perform this integration as an additional step, which isn’t my preference.[2]  Once uploaded, I can use a test tool to return character offsets for words in transcripts.  That’s the raw data I need to work with individual words and still have useful full-text searching, now I’ve got to figure out what to do with it.
Here’s what the data looks like for the word “yes” in Lewin’s first lecture:
...
<lst name="yes">
<int name="tf">3</int>
<lst name="offsets">
<int name="start">6402</int>
<int name="end">6405</int>
<int name="start">20045</int>
<int name="end">20048</int>
<int name="start">22858</int>
<int name="end">22861</int>
</lst>
<int name="df">1</int>
</lst>
...

tf” is term frequency
offsets” are the start and end position in characters.

Lucene Payloads

Lucene has an advanced feature that stores additional metadata with each indexed token (in our case, that is a transcript word).  The payload feature is usually used to influence the ranking of search results, but it can be used for other purposes as well.[3]  What recommends it for our use is that it stores the data in the index, so it is readily available as part of the search result.  The preparation of the transcript would change also.  Using the WRD file example from the Technique section above, we would produce this file:

" in|6183 physics|6288 we|7186 explore|7342 the|9091 very|9181 small|9461 to|10741 the|10862 very|10946 large|11226 "


[1] The .WRD extension is an artifact of previous development and has no relationship to desktop software applications that may use this same file extension.

[2] The reason I don’t favor integrating these two as an extra step is that I would need to maintain referential integrity between two independent datastores.  (Remember, that we also have the requirement of editing both transcript words timecodes.)

[3] Update: Lucene has a feature called payloads which will allow me to store the timecodes with the words.  Here is a link to a blog post that explains the technique.

Creative Commons License Unless otherwise specified, the Spoken Media Website by the MIT Office of Digital Learning, Strategic Education Initiatives is licensed under a Creative Commons Attribution 4.0 International License.