Recently, I wrote a twitter bot that uses a Markov chain to generate
random text in the style of questions and answers from Canada’s
Question Period. The bot is here. Here’s
how it works (although the approach is pretty simple and can be found
many places on the internet using Google).
The interesting part of the code is in two functions, build_corpus
and generate. build_corpus takes a bunch of source text (lists of
strings) and a level parameter. Then it generates a hash whose keys
are lists of level consecutive words which appear in the source
text; each level-tuple of words maps to the conditional probability
distribution of word frequencies given the occurrence of the tuple. We
compute this by splitting the input into a list of words and running
through the list, adding data for each tuple.
def build_corpus( fragments, level=2 )
corpus = {"level" => level}
fragments.each do |f|
words = [nil]*level + f.strip.split + [nil]*level
(words.length() -level).times do |i|
key = words[i..i+level-1]
corpus[key] ||= []
corpus[key] = corpus[key] << words[i+level]
end
end
return corpus
end
For instance, the 2-tuple ["having", "attempted"] maps to the list
["to"], so there is only one word, whereas ["the", "Conservative"]
maps to a list of 211 words with, e.g., 88 instances of “government”,
27 instances of “Party”, etc. Obviously this isn’t a very space-efficient
method, but it works and it’s fast to implement for a weekend project.
The next method generates text. It’s also pretty simple. It starts
with nil and then loops, each time selecting a word based on the
previously-generated words.
def generate( corpus )
level = corpus["level"]
output = [nil]*level
begin
key = output[-level .. -1]
output = output << corpus[key].sample
end while not output[-1] == nil
return output.join(" ")
end
The rest of the
code is
handling the tweeting and getting the source data — it uses a program
originally written for OpenParliament.ca
to parse the Hansards. It runs as a cron job on my server now, so it updates every hour.
Posted by Henry de Valence under
–
Programming