3.2. Parsing Text with NLTK

(full code)

In this section we will parse a long written text, everyone’s favorite tale Alice’s Adventures in Wonderland by Lewis Carroll, to be used to create the state transitions for Markov chains. In this example, we use NLTK for natural language processing (refer to book for clearer instructions on usage). However, many of the parsing tasks using NLTK could be adequately achieved with sufficiently simple regular expressions.

Note

NLTK should be installed in your environment and its Punkt tokenizer donwloaded. NLTK is contained in the requirements.txt. To install it separately use pip install nltk while your virtual environment is activated.

To install Punkt tokenizer, enter Python interpreter (while you have your virtual environment active) and write:

>>> import nltk
>>> nltk.download()

Then, in the appearing window select Models and in the appearing tab select Punkt Tokenizer Models and click Download.

3.2.1. Downloading the Data

First, we need to get the data. Fortunately, our book of choice is served on Project Gutenberg, which offers thousands of free books. Natural choice is to download the book inside our script. However, to make the part little bit more interesting, we are going to download it only once, and then, if the file is already present, we read it from the file.

import os
import re

import nltk

alice_file = 'alice.txt'
alice_raw = None

if not os.path.isfile(alice_file):
    from urllib import request
    url = 'http://www.gutenberg.org/cache/epub/19033/pg19033.txt'
    response = request.urlopen(url)
    alice_raw = response.read().decode('utf8')
    with open(alice_file, 'w', encoding='utf8') as f:
        f.write(alice_raw)
else:
    with open(alice_file, 'r', encoding='utf8') as f:
        alice_raw = f.read()

3.2.2. Remove the Excessive Parts

Now, we have the raw version of the book. Next, we are going to remove the “bloat” that Project Gutenberg adds to the beginning and the end of the book.

# For reasons, lets remove the start and end bloat from the text
start = "I--DOWN THE RABBIT-HOLE"
end = "End of the Project Gutenberg"
start_index = alice_raw.find(start)
end_index = alice_raw.rfind(end)
alice = alice_raw[start_index:end_index]

# And replace more than one subsequent whitespace chars with one space
alice = re.sub(r'\s+', ' ', alice)

3.2.3. Tokenize the text

Our text is now ready to be tokenized with NLTK. First, we are going to split it into sentences, which is easy with the tools NLTK offers:

sentences = nltk.sent_tokenize(alice)

Next, we are going to tokenize each sentence using nltk.word_tokenize, which splits the text into ‘words’ (it also splits punctuation into separate tokens). Here is an example of its output:

>>> nltk.word_tokenize('Follow the "White Rabbit".')
['Follow', 'the', '``', 'White', 'Rabbit', "''", '.']

Here is the actual tokenization code:

tokenized_sentences = []
for s in sentences:
    w = nltk.word_tokenize(s)
    tokenized_sentences.append(w)

Another often used NLP task is part-of-speech (POS) tagging. We are not going to use it for now, but it is as simple as tokenization:

>>> tokens = nltk.word_tokenize('Follow the "White Rabbit".')
>>> nltk.pos_tag(tokens)
[('Follow', 'VB'),
 ('the', 'DT'),
 ('``', '``'),
 ('White', 'NNP'),
 ('Rabbit', 'NNP'),
 ("''", "''"),
 ('.', '.')]

Note

nltk.pos_tag needs a pos-tagger which does not come bundled with basic nltk-version. To download a pos-tagger, type nltk.download() in iPython and download the Averaged Perceptron Tagger from Models-section. In general, the download tool offers many usable models and corporas.

3.2.4. Sanitation of the Tokenized Sentences

Lastly, we sanitize the tokenized sentences a bit so that the punctuation does not clutter the Markov chains. For this purpose, we naively assume that any token in the sentences is a proper word, if it contains any Unicode word character. We also end all the sentences with a dot, to mark a natural pause in the text (one could also add a special token to the beginning).

is_word = re.compile('\w')
sanitized_sentences = []
for sent in tokenized_sentences:
    sanitized = [token for token in sent if is_word.search(token)] + ['.']
    sanitized_sentences.append(sanitized)

Now, the sanitized_sentences should be ready for the creation of state transition probabilities. However, it is left as an exercise together with the actual generation of texts.