BWT for NLP (1)

The Burrows-Wheeler transform (BWT), which is the main step in the bzip2 compression algorithm, is a permutation transform on a string over an ordered alphabet. It is a clever idea and can be useful for some string processing for natural language processing.  I will present one such use.

BWT massages the original string into being more amenable to compression. Of course the transform doesn't alter the compressibility (entropy rate) of the original string. All it does is make the string more compressible by algorithms we know.

The reason string permutation by BWT (as opposed to say sorting the string, which makes it really compressible) is useful is that the reverse transform (undoing the permutation) can be done with very little additional information. Mark Nelson wrote a nice introduction to the transform.  Moreover, the BWT essentially involves the construction of the suffix array for the string, and therefore can be done in time and space linear in the length of the string.

Here is an example of the Burrows-Wheeler tranformation of the first stanza of Yeats' Sailing to Byzantium. I added some newlines to the transformed string, and the underscores represent spaces in the original string. Notice the long runs of characters in the transformed string.

Original string

THAT is no country for old men. The young In one another's arms, birds in the trees - Those dying generations - at their song, The salmon-falls, the mackerel-crowded seas, Fish, flesh, or fowl, commend all summer long Whatever is begotten, born, and dies. Caught in that sensual music all neglect Monuments of unageing intellect.


BWTransformed string

rsgnsnlhhs__lntsnH__T__.A____ss.,gt,.-gcd,es s,,,ode,yrgtsgrTredllssrn,edtrln,ntefemnu__fs___eh_hrC___ia__-eennlew_r_nshhhhslldrnbghrttmmgsmhvmnkielto-___nnnnna_ueesstWtTtTttTgsd__ye_teb__Fcweallolgfaaeaa_l


__mumoulr_reoeIiiueao_eouoii_aoeiueon__cm_sliM_


fbhngycrfeoeeoieiteaoctamleen'idit_o__ieu_n_cchaanta


____oa_nnosans_oomeoord_


A useful property

Effros et. al. showed that for a string generated by a finite-memory source, the BWT of the string is asymptotically (in the length of the string) indistinguishable from a piece-wise independent and identically distributed (i.i.d.) string. This is not surprising given that symbols with similar contexts appear sequentially in the BWT string, and for finite memory sources the current symbol is generated i.i.d. given a finite length context.

This property can be exploited to easily cluster words according to context by using BWT.

Word clustering

In this paper, among other things, Brown et.al. present a word clustering algorithm based on maximizing the average mutual information between the cluster ids of adjacent words. Some results are presented in Table 2 in the paper.

Such word clusters can be useful for feature engineering for sequence tagging tasks such as part-of-speech tagging or named-entity recognition. One of the most commonly used features for such tasks is one which checks if the current word is in a carefully constructed list of words.

Brown et. al. admit that, even after optimizations, their algorithm is slow and resort to approximations. (I realize that computers have gotten much faster since but still their algorithm is cubic in the size of the vocabulary.)

Word clustering based on BWT

We will cluster two words together if they appear independently given certain contexts (albeit with different probabilities). We first perform a BW transform on the input string of words (considering each word as a symbol, unlike in the example above) and measure whether the two words appear independently in an i.i.d. fragment.

Instead of actually trying to chop the BWT string into i.i.d. fragments before analysis, we adopt a proxy metric. We check if the number of times the two words are next to each other in the BWT string is large compared to what we would expect from their frequencies. We compute this as probability ratio with appropriate smoothing.

Another neat consequence of doing the clustering by BWT is that we only need to consider pairs of words that do appear next to each other in the BWT string. Therefore the selection of candidates for clustering is linear in the length of the string and not quadratic in the size of the vocabulary.

Some results

I ran this algorithm on about a month's worth of New York Times and Wall Street Journal news data and these are the pairs of words with the highest scores.
january february 0.177721578886

january march 0.143172972502

march february 0.142398170589

englandgeoneng jerseyusanj 0.141412321852

news becdnews 0.135642386152

finala final 0.131901568726

finala finalb 0.122728309966

finala finalc 0.113085215849

cafd cea 0.107549686029

february april 0.100734422316

january april 0.0993752546848

has have 0.0967101802923

march april 0.0929933503714

did does 0.0854452561942

has had 0.0833642704346

will would 0.0827179598199

have had 0.0773517518078

january february 0.177721578886


january march 0.143172972502


march february 0.142398170589


englandgeoneng jerseyusanj 0.141412321852


news becdnews 0.135642386152


finala final 0.131901568726


finala finalb 0.122728309966


finala finalc 0.113085215849


cafd cea 0.107549686029


february april 0.100734422316


january april 0.0993752546848


has have 0.0967101802923


march april 0.0929933503714


did does 0.0854452561942


has had 0.0833642704346


will would 0.0827179598199


have had 0.0773517518078


...


I constructed a graph by joining all word pairs that have a score above a threshold and ran a greedy maximal clique algorithm. These are some of the resulting word clusters.

older young younger


announced today yesterday said reported


month today week yesterday


days month months decade year weeks years


decades months decade weeks years


com org www


writing write wrote


directed edited produced


should will probably could would may might can


worries worried concerns


work worked working works


wearing wear wore


win lost losing


man people men


against like to about that by for on in with from of at


under by on with into over from of


baton moulin khmer


daughter husband sister father wife mother son


red green blue black


ice sour whipped


time days months year years day


eastern coast southeastern


bergen orange nassau westchester


east ivory west


goes gone go going went


known seen well


travel review leisure weekly editorial


cultural financial foreign editorial national metropolitan


thursdays wednesdays fridays sundays tuesdays


thursday today monday sunday yesterday wednesday saturday friday tuesday


...


Discussion

1. For the above results, I only did the clustering based on right contexts. We can easily extend the word-pair score to take into account left contexts as well by concatenating the BWT of the reversed string to the BWT of the original string, and calculating the scores on this double length transformed string.

2. The word clustering algorithm of Brown et. al. proceeds by iteratively merging the best pair of words and replacing the two words in the alphabet (and the string) by a merged word. We can imagine doing something similar with our approach, except, because BWT uses the order on the alphabet, we need to decide where to insert the merged word.

3. One thing that I should have done but didn't for the above results is to order the alphabet (of words) lexicographically. Instead I assign positive integers to the words based on their first appearance in the string, which is the order BWT uses to sort. Fixing this should improve the results.

Comments

Popular posts from this blog

Incremental complexity support vector machine

An effective kernelization of logistic regression

The Cult of Universality in Statistical Learning Theory