BWT for NLP (2)

I show how the Burrows-Wheeler Transform can be used to compute the similarity between two strings. We submitted results from this method (along with results from the Context-Chain metric developed by my colleagues Frank Schilder and Ravi Kondadadi) for the Automatically Evaluating the Summaries of Peers (AESOP) task of the TAC 2009 conference.

The task was to produce an automatic metric to evaluate machine generated summaries (i.e., system summaries) against human generated summaries for the TAC '09 Update Summarization Task. Clearly the automatic metric is just some function that produces a similarity score between the system summary and the human generated (the so-called model) summary.

The  proposed metrics were evaluated by comparing their rankings of the system summaries from different peers to that of the ranking produced by human judges.
Similarity Metric

We use an estimate of the conditional "compressibility" of the model summary given the system summary as the similarity metric. The conditional compressibility is defined as the increase in the compressibility of the model summary when the system summary has been observed.

In order to judge the similarity of the system summary latexS, to the model summary latexM, we propose to use the difference in compressibility of latexM when latexS is not seen to when latexS is given. This metric basically

captures the reduction in the uncertainty in latexM when latexS is known.

We define the compressibility latexc(M) of any string latexM by

latexc(M)=H(M)|M|

and the conditional compressibility of string latexM over an alphabet latexA given another string latexS over the same alphabet as

latexc(M|S)=H(S+M)H(S)|M|

where latexS+M is the concatenation of the strings latexS and latexM, latexH(S) is the entropy of string latexS, and latex|M| is the length of the string latexM.
The fractional increase in compressibility of latexM given latexS can then measured by

latexr(M|S)=c(M)c(M|S)c(M).

We use latexr(M|S) as the similarity metric to measure the similarity of a system summary latexS to the model summary latexM.

Our metric is similar to the one proposed by Li and Vitanyi and is theoretically well-justified from the perspective of algorithmic information theory. One peculiarity is that our similarity is asymmetric.

The only thing that is needed to implement the above similarity metric is an estimate of the entropy latexH(S) for a string latexS. We use the BWT for this estimate.

BWT-based String Entropy Estimate

We use the Move-To-Front (MTF) entropy of the Burrows-Wheeler transform of a given string latexS as an estimate for its entropy H(S).


The MTF encoding of a string is performed by traversing the string and assigning to each symbol the position of that symbol in the alphabet and then moving the symbol to the front of the alphabet. Therefore a sequence with a lot of runs will  have a lot of zeros in its MTF encoding.

In this paper the MTF coding is used to define the MTF entropy (which the authors also call local entropy) of a string latexR as

latexMTFE(R)=ilog(MTF(R)i+1)

where latexMTF(R)i is the latexith symbol of the MTF coding of the string latexR.

Now we define latexH(S), the entropy of string latexS as

latexH(S)=MTFE(BWT(S))

where latexBWT(S) is the BWT of string latexS.

Since the Burrows-Wheeler transform involves just the construction of a suffix array, the computation of our compression based evaluation metric is linear in time and space in the length of the model and system summary strings.

Some Technical Details
For our implementation, we considered each word in a string as a separate symbol. Our alphabet of symbols therefore contained all the words in the two strings being compared. The words were normalized by lower casing and removing punctuation. Because BWT needs an ordered alphabet, we used the lexicographic order on the words in the alphabet.

Results
table1

The results on the TAC-AESOP task (above) show that the BWT based metric (FraCC in the table) is reasonable for summarization evaluation, especially because there are not very many knobs to tune. I obtained these results from Frank (who will present them at TAC next week). The "best metric" is the AESOP submission that seemed to have high scores across several measures.

Comments

  1. There is a tool, "complearn", to compute this kind of distances and then do clustering: http://www.complearn.org/

    This tool airises from the Ph.D. activity of R.Cilibrasi that worked with Vitanyi: http://cilibrar.com/

    E.

    ReplyDelete
  2. Em,

    Thanks. I was aware of Cilibrasi and Vitanyi's work. Marcus Hutter is another guy who does similar things. Hopefully one day I'll get down to reading and understanding the theory behind this stuff better.

    ReplyDelete

Post a Comment

Popular posts from this blog

Incremental complexity support vector machine

The Cult of Universality in Statistical Learning Theory

An effective kernelization of logistic regression