Week 1 - Markov Chains, ParsingΒΆ

(Tue 1.11. 15.45) The first week’s exercises are now in their final form.


  1. RETURN Familiarize yourself with Markov chains, and how they can be created from the data. See Text Generation with First-Order Markov Chain for an example of how to create Markov chain from a set of toy data.

    Given a string babacccabac:

    1. What are the states for the first-order Markov chain?
    2. What are the states for the second-order Markov chain?
    3. Compute the state transition probabilities for the first-order Markov chain.
  2. Create first-order Markov chain from the book Alice’s Adventures in Wonderland by Lewis Carroll. See Parsing Text with NLTK for the text parsing instructions. The transitions should be counted from the sanitized_sentences data structure by looping over its list of tokenized sentences (which are lists also), i.e. the last token of each sentence is not succeeded by any state.

    For sanity checking, here are the five most probable state successors for the state ‘Alice’:

    [('.', 0.17751479289940827),
     ('was', 0.047337278106508875),
     ('and', 0.047337278106508875),
     ('in', 0.03550295857988166),
     ("'s", 0.03550295857988166)]
    

    What are overall 5 the most probable state transitions? What about 100 the most probable state transitions? Why so?

  3. Create a function markov_chain(raw_text), which takes a piece of raw text (str) and returns the state transition probabilities. The function first splits the raw text into sentences, then each sentence into tokens and then computes the state transition probabilities from the tokenized sentences.

  4. RETURN Create a function sanitize(token_list), which takes as an argument a list of text tokens and sanitizes the list by removing inappropriate tokens. Use different sanitization method than in the example, and try to make it more “intelligent”. Justify your choices and state them clearly in the function’s documentation.

  5. Alter the function markov_chain to accept an optional boolean argument sanitize which defaults to True. If it is true, each tokenized sentence is sanitized (with your sanitize-function) before the state transition probabilities are computed.

  6. RETURN Create a function generate(state_transition_probabilities, length=10, start=None), which returns a piece of text generated from given state transition probabilities. Specification:

    1. The piece of text should be at most length tokens long and concatenated into a single string.
    2. The generation should end without an exception if a state that has no successors is generated and return the already generated string (including the state with no successor).
    3. If start is other than None, the generation should be started from that state. Otherwise the generation is started from a random state.
    4. Handle the case where the state given by start is not in the transition probabilities. (Most probably raise an exception.)
  7. RETURN Dan Ventura formulates several “steps” towards more computationally creative systems in his article ‘Mere generation: Essential barometer or dated concept?’. Shortly, write your opinions in which category your generate function from the last exercise falls. Also answer the question: “How would you make it more creative?”. 5-10 sentences should be enough.

  8. RETURN Create a function likelihood(text, state_transitions_probabilities) that computes the (pseudo)likelihood of the text using the state transition probabilities. If some of the tokens (or state transitions) derived from text are not in state_transition_probabilities, the function raises an error. Otherwise it returns the computed (pseudo)likelihood. Justify your choices when computing the (pseudo)likelihood and state them clearly in the function’s docstring.