BWT for NLP (2)

I show how the Burrows-Wheeler Transform can be used to compute the similarity between two strings. We submitted results from this method (along with results from the Context-Chain metric developed by my colleagues Frank Schilder and Ravi Kondadadi) for the Automatically Evaluating the Summaries of Peers (AESOP) task of the TAC 2009 conference.

The task was to produce an automatic metric to evaluate machine generated summaries (i.e., system summaries) against human generated summaries for the TAC '09 Update Summarization Task. Clearly the automatic metric is just some function that produces a similarity score between the system summary and the human generated (the so-called model) summary.

The  proposed metrics were evaluated by comparing their rankings of the system summaries from different peers to that of the ranking produced by human judges.
Similarity Metric

We use an estimate of the conditional "compressibility" of the model summary given the system summary as the similarity metric. The conditional compressibility is defined as the increase in the compressibility of the model summary when the system summary has been observed.

In order to judge the similarity of the system summary $latex S$, to the model summary $latex M$, we propose to use the difference in compressibility of $latex M$ when $latex S$ is not seen to when $latex S$ is given. This metric basically

captures the reduction in the uncertainty in $latex M$ when $latex S$ is known.

We define the compressibility $latex c(M)$ of any string $latex M$ by

$latex c(M) = \frac{H(M)}{|M|}$

and the conditional compressibility of string $latex M$ over an alphabet $latex \mathcal{A}$ given another string $latex S$ over the same alphabet as

$latex c(M|S) = \frac{H(S+M) - H(S)}{|M|}$

where $latex S+M$ is the concatenation of the strings $latex S$ and $latex M$, $latex H(S)$ is the entropy of string $latex S$, and $latex |M|$ is the length of the string $latex M$.
The fractional increase in compressibility of $latex M$ given $latex S$ can then measured by

$latex r(M|S) = \frac{c(M) - c(M|S)}{c(M)}$.

We use $latex r(M|S)$ as the similarity metric to measure the similarity of a system summary $latex S$ to the model summary $latex M$.

Our metric is similar to the one proposed by Li and Vitanyi and is theoretically well-justified from the perspective of algorithmic information theory. One peculiarity is that our similarity is asymmetric.

The only thing that is needed to implement the above similarity metric is an estimate of the entropy $latex H(S)$ for a string $latex S$. We use the BWT for this estimate.

BWT-based String Entropy Estimate

We use the Move-To-Front (MTF) entropy of the Burrows-Wheeler transform of a given string $latex S$ as an estimate for its entropy $H(S)$.


The MTF encoding of a string is performed by traversing the string and assigning to each symbol the position of that symbol in the alphabet and then moving the symbol to the front of the alphabet. Therefore a sequence with a lot of runs will  have a lot of zeros in its MTF encoding.

In this paper the MTF coding is used to define the MTF entropy (which the authors also call local entropy) of a string $latex R$ as

$latex \mbox{MTFE}(R) = \sum_i \mbox{log}(\mbox{MTF}(R)_i + 1)$

where $latex \mbox{MTF}(R)_i$ is the $latex i^{th}$ symbol of the MTF coding of the string $latex R$.

Now we define $latex H(S)$, the entropy of string $latex S$ as

$latex H(S) = \mbox{MTFE}(\mbox{BWT}(S))$

where $latex \mbox{BWT}(S)$ is the BWT of string $latex S$.

Since the Burrows-Wheeler transform involves just the construction of a suffix array, the computation of our compression based evaluation metric is linear in time and space in the length of the model and system summary strings.

Some Technical Details
For our implementation, we considered each word in a string as a separate symbol. Our alphabet of symbols therefore contained all the words in the two strings being compared. The words were normalized by lower casing and removing punctuation. Because BWT needs an ordered alphabet, we used the lexicographic order on the words in the alphabet.

Results
table1

The results on the TAC-AESOP task (above) show that the BWT based metric (FraCC in the table) is reasonable for summarization evaluation, especially because there are not very many knobs to tune. I obtained these results from Frank (who will present them at TAC next week). The "best metric" is the AESOP submission that seemed to have high scores across several measures.

Comments

  1. There is a tool, "complearn", to compute this kind of distances and then do clustering: http://www.complearn.org/

    This tool airises from the Ph.D. activity of R.Cilibrasi that worked with Vitanyi: http://cilibrar.com/

    E.

    ReplyDelete
  2. Em,

    Thanks. I was aware of Cilibrasi and Vitanyi's work. Marcus Hutter is another guy who does similar things. Hopefully one day I'll get down to reading and understanding the theory behind this stuff better.

    ReplyDelete

Post a Comment

Popular posts from this blog

Incremental complexity support vector machine

The Cult of Universality in Statistical Learning Theory

Regularized Minimax for Robust Learning