Unsupervised Learning of Natural Languages
Many types of sequential symbolic data possess structure that is (i) hierarchical, and (ii) context- sensitive. Natural-language text or transcribed speech are prime examples of such data: a corpus of language consists of sentences, defined over a finite lexicon of symbols such as words. Linguists traditionally analyze the sentences into recursively structured phrasal constituents; at the same time, a distributional analysis of partially aligned sentential contexts reveals in the lexicon clusters that are said to correspond to various syntactic categories (such as nouns or verbs). Such structure, however, is not limited to the natural languages: recurring motifs are found, on a level of description that is common to all life on earth, in the base sequences of DNA that constitute the genome. In this book, I address the problem of extracting patterns from natural sequential data and inferring the underlying rules that govern their production. This is relevant to both linguistics and bioinformatics, two fields that investigate sequential symbolic data that are hierarchical and context sensitive.