Project Overview

The project will give you an opportunity to use Python to model, analyze, and score the similarity of text samples. In essence, you will develop a statistical tool that will allow you to identify a particular author or style of writing!

Background

Statistical models of text are one way to quantify how similar one piece of text is to another. Such models were used as evidence that the book The Cuckoos Calling was written by J. K. Rowling (using the name Robert Galbraith) in the summer of 2013. Details on that story and the analysis that was performed appear here.

The comparative analysis of Rowlings works used surprisingly simple techniques to model the authors style. Possible style signatures could include:

  • the distribution of word lengths in a document calculating the frequency of words of length one, length two, length three, etc. This is one of the metrics that was used in the Rowling case.
  • the distribution of words used by an author calculating the frequency of each unique word.
  • the distribution of word stems used (e.g., spam and spamming would have the same stem) by an author calculating the frequency of each unique stem.
  • the distribution of sentence lengths used by an author calculating the frequency of sentences of length one, length two, length three, etc.

Other similar measures of style are also possible. Perhaps the most common application of this kind of feature-based text classification is spam detection and filtering (which you can read more about here, if youre interested).

Your task

To allow you to compare and classify texts, you will create a statistical model of a body of text using several different techniques. At a minimum, you should integrate five features:

  • word frequencies
  • word lengths
  • stem frequencies
  • sentence lengths
  • one other feature of your choice. You might choose one of the features used in the analysis that revealed Rowlings authorship of The Cuckoos Calling. Or you could choose something else entirely (e.g., something based on punctuation) anything that you can compute/count using Python!

Computing word lengths from words should be fairly easy, but computing stems, sentence lengths, and your additional feature are an additional part of the challenge, and offer plenty of room for algorithmic creativity!

In general, the project description below will offer fewer guidelines than the problem sets did. If the guidelines do not explicitly mandate or forbid certain behavior, you should feel free to be creative!

Part I: Building an initial text model

For the first part of the project, you will create an initial version of a TextModel class, which will serve as a blueprint for objects that model a body of text (i.e., a collection of one or more text documents).

To get started, open a new file in IDLE and name it finalproject.py. In this file, declare a class named TextModel. Add appropriate comments to the top of the file.

Write a constructor __init__(self, model_name) that constructs a new TextModel object by accepting a string model_name as a parameter and initializing the following three attributes:

  • name a string that is a label for this text model, such as 'JKRowling' or 'Shakespeare'. This will be used in the filenames for saving and retrieving the model. Use the model_name that is passed in as a parameter.
  • words a dictionary that records the number of times each word appears in the text.
  • word_lengths a dictionary that records the number of times each word length appears.

Each of the dictionaries should be initialized to the empty dictionary ({}). In Part III, you will add dictionaries for the other three features as well

Write a method __str__(self) that returns a string that includes the name of the model as well as the sizes of the dictionaries for each feature of the text

For example, if a model for J. K. Rowling has been set-up, the return value of this method may look like:

text model name: J. K. Rowling
number of words: 2103
number of word lengths: 17

Here again, information about the other feature dictionaries will eventually be included, but for now only the first two dictionaries are needed.

Notes:

  • Remember that the __str__ method should create a single string and return it. You should not use the print function in this method.
  • Since the returned string should have multiple lines, you will need to add in the newline character ('n'). Below is a starting point for this method that already adds in some newline characters. You will have to expand upon this to finish the method:
def __str__(self):
"""Return a string representation of the TextModel."""
s = 'text model name: ' + self.name + 'n'
s += ' number of words: ' + str(len(self.words)) + 'n'
return s
  • In the final version of your class, the returned string should not include the contents of the dictionaries because they will become very large. However, it may be a good idea to include their contents in the early stages of your code to facilitate small-scale testing.
  • You are welcome to also add a __repr__ method that allows you to obtain useful information about a TextModel object by evaluating it directly from the Python Shell. However, doing so is not required.

Write a helper function named clean_text(txt) that takes a string of text txt as a parameter and returns a list containing the words in txt after it has been cleaned. This function will be used when you need to process each word in a text individually, without having to worry about punctuation or special characters.

Notes:

  • Because this is a regular function and not a method, you should define it outside of your TextModel classe.g., before the class header. And when you call clean_text, you should just use the function name; you will not need to prepend a called object. The reason for implementing clean_text as a function rather than a method is that it doesnt need to access the internals of a TextModel object.
  • The extent to which you clean the text is up to you. While you should at least remove common punctuation symbols (e.g., ., ,, and ?), you can decide how many other symbols you want to remove. You may also find it helpful to perform other operations on the string, such as converting all of the characters to lowercase (which you can do using the string method lower).
  • You may find it helpful to use the string method replace. To see how it works, try the following in the Python Shell:
>>> s = 'Mr. Smith programs.'
>>> s = s.replace('.', '')
>>> s
'Mr Smith programs'
  • Instead of using replace, you could use a loop to iteratively look at every character in the string and only keep the characters that are not punctuation. However, you should not use recursion to remove the punctuation, since for large files you will run out of memory from too many recursive method calls!

Write a method add_string(self, s) that adds a string of text s to the model by augmenting the feature dictionaries defined in the constructor.

Here is the beginning of the method to get you started. It includes code for updating the words dictionary:

def add_string(self, s):
"""Analyzes the string txt and adds its pieces
to all of the dictionaries in this text model.
"""

# Add code to clean the text and split it into a list of words.
# *Hint:* Call one of your other methods!
words = ...

# Code for updating the words dictionary.
for w in words:
if w not in self.words:
self.words[w] = 1
else:
self.words[w] += 1

# Add code to update other feature dictionaries.

For now, you should add code to this method to update the word_lengths dictionary. Later, you will extend the method to update the other dictionaries as well.

Write a method add_file(self, filename) that adds all of the text in the file identified by filename to the model.

Hint: You have already written a method that adds a string of text to the model. To implement add_file, you only need to read the file into a string and make the appropriate method call to add the string of text to the model!

Important: When you open the file for reading, you should specify two additional arguments as follows:

f = open(filename, 'r', encoding='utf8', errors='ignore')

These encoding and errors arguments should allow Python to handle special characters (e.g., smart quotes) that may be present in your text files.

At this point, you are ready to run some initial tests on your methods. Try entering the following commands from the Shell, but remember that the contents of your dictionaries may be printed in a different order.

>>> model = TextModel('A. Poor Righter')
>>> model.add_string("The partiers love the pizza party.")
>>> print(model)
text model name: A. Poor Righter
number of words: 5
number of word lengths: 4

>>> model.words
{'party': 1, 'partiers': 1, 'pizza': 1, 'love': 1, 'the': 2}

>>> model.word_lengths
{8: 1, 3: 2, 4: 1, 5: 2}

You should continue testing your code frequently from this point forward to make sure everything is working correctly at each step. Otherwise, if you reach the end and realize there are errors, it can be very difficult to determine the causes of those errors in such a large program!

Part II: Saving and retrieving a text model

Creating a text model can require a lot of computational power and time. Therefore, once we have created a model, we want to be able to save it for later use. The easiest way to do this is to write each of the feature dictionaries to a different file so that we can read them back in at a later time. In this part of the project, you will add methods to your TextModel class that allow you to save and retrieve a text model in this way.

To get you started, here is a function that defines a small dictionary and saves it to a file:

def sample_file_write(filename):
"""A function that demonstrates how to write a
Python dictionary to an easily-readable file.
"""
d = {'test': 1, 'foo': 42} # Create a sample dictionary.
f = open(filename, 'w') # Open file for writing.
f.write(str(d)) # Writes the dictionary to the file.
f.close() # Close the file.

Notice that the file is opened for writing by using a second parameter of 'w' in the open function call. In addition, we write to the file by using the file handles write method on a string representation of the dictionary..

Below is a function that reads in a string representing a dictionary from a file, and converts this string (which is a string that looks like a dictionary) to an actual dictionary object. The conversion is performed using a combination of two built-in functions: dict, the constructor for dictionary objects; and eval, which evaluates a string as if it were an expression.

def sample_file_read(filename):
"""A function that demonstrates how to read a
Python dictionary from a file.
"""
f = open(filename, 'r') # Open for reading.
d_str = f.read() # Read in a string that represents a dict.
f.close()

d = dict(eval(d_str)) # Convert the string to a dictionary.

print("Inside the newly-read dictionary, d, we have:")
print(d)

Try saving these functions in a Python file (or even in finalproject.py), and then use the following calls from the Python Shell:

>>> filename = 'testfile.txt'
>>> sample_file_write(filename)
>>> sample_file_read(filename)
Inside the newly-read dictionary, d, we have:
{'test': 1, 'foo': 42}

There should also now be a file named testfile.txt in your current working directory that contains this dictionary.

Now that you know how to write dictionaries to files and read dictionaries from files, add the following two methods to the TextModel class:

Write a method save_model(self) that saves the TextModel object self by writing its various feature dictionaries to files. There will be one file written for each feature dictionary. For now, you just need to handle the words and word_lengths dictionaries.

In order to identify which model and dictionary is stored in a given file, you should use the name attribute concatenated with the name of the feature dictionary. For example, if name is 'JKR' (for J. K. Rowling), then we would suggest using the filenames:

  • 'JKR_words'
  • 'JKR_word_lengths'

In general, the filenames are self.name + '_' + name_of_dictionary. Taking this approach will ensure that you dont overwrite one models dictionary files when you go to save another model.

It may help to use the code for sample_file_write as a starting point for this method, but dont forget that you should create a separate file for each dictionary.

Write a method read_model(self) that reads the stored dictionaries for the called TextModel object from their files and assigns them to the attributes of the called TextModel.

This is the complementary method to save_model, and you should assume that the necessary files have filenames that follow the naming scheme used in save_model.

It may help to use the code for sample_file_read as a starting point for this method, but note that you should read a separate file for each dictionary.

Remember that you can use the dict and eval functions to convert a string that represents a dictionary to an actual dictionary object.

Examples: You can test your save_model and read_model methods as follows:

# Create a model for a simple text, and save the resulting model.
>>> model = TextModel('A. Poor Righter')
>>> model.add_string("The partiers love the pizza party.")
>>> model.save_model()

# Create a new TextModel object with the same name as the original one,
# and assign it to a new variable.
>>> model2 = TextModel('A. Poor Righter')

# Read the dictionaries that were saved for the original model,
# and use them as the dictionaries of `model2`.
>>> model2.read_model()

>>> print(model2)
text model name: A. Poor Righter
number of words: 5
number of word lengths: 4

>>> model2.words
{'party': 1, 'partiers': 1, 'pizza': 1, 'love': 1, 'the': 2}

>>> model2.word_lengths
{8: 1, 3: 2, 4: 1, 5: 2}

Part III: Adding features to the model

Update your __init__ method so that it initializes attributes for three additional dictionaries:

  • stems a dictionary that records the number of times each word stem appears in the text.
  • sentence_lengths a dictionary that records the number of times each sentence length appears.
  • an appropriately named dictionary that records the frequencies of whatever additional feature you have chosen to include in your TextModel (see the section above entitled Your task for possible options).

Write a helper function named stem(s) that accepts a string as a parameter. The function should then return the stem of s. The stem of a word is the root part of the word, which excludes any prefixes and suffixes. For example:

>>> stem('party')
'parti'
>>> stem('parties')
'parti'
>>> stem('love')
'lov'
>>> stem('loving')
'lov'

Notes:

  • Like clean_text, this is a regular function and not a method, so you should define it outside of your TextModel class. When you call it, you should just use the function name; you will not need to prepend a called object.
  • The stem of a word is not necessarily a word itself!
  • This function does not have to work perfectly for all possible words and stems. Instead, you should define a multitude of cases for stems that work for many words, as shown in lecture. You will need to build on the starting code that has been given in the lecture notes.

Extend your add_string method to update the feature dictionaries for word stems, sentence lengths, and your chosen additional feature.

Notes:

  • You should update the sentence lengths dictionary before you clean the text. Once you remove the punctuation from the string, it will be difficult to count the sentences.
  • You should make use of the stem function that you wrote above, and you should define any additional helper functions/methods as you see fit. In particular, you may need one or more helper function related to the dictionary for your chosen additional feature.

Update the following methods to incorporate the feature dictionaries for word stems, sentence lengths, and your chosen additional feature:

  • __str__
  • save_model
  • read_model

Test your new code by performing the following test. Here again, your dictionary contents may be printed in a different order. In addition, your dictionary for stems may be slightly different, depending on how you implemented the stem function:

>>> model = TextModel('A. Poor Righter')
>>> model.add_string("The partiers love the pizza party.")
>>> print(model)
text model name: A. Poor Righter
number of words: 5
number of word lengths: 4
number of stems: 4
number of sentence lengths: 1
# info for your additional feature goes here!

>>> model.words
{'party': 1, 'partiers': 1, 'pizza': 1, 'love': 1, 'the': 2}

>>> model.word_lengths
{8: 1, 3: 2, 4: 1, 5: 2}

>>> model.stems
{'parti': 2, 'the': 2, 'pizza': 1, 'lov': 1}

>>> model.sentence_lengths
{6: 1}

Optional: Add extra functionality to your TextModel object as you see fit. This may include improving your algorithms for cleaning the strings or finding stems, or it may entail adding even more feature dictionaries beyond the ones that we have required.

Part IV: Adding methods for scoring and classification

In this part of the project, you will first implement the core algorithm that will allow you to compare bodies of text. This algorithm will produce a numeric similarity score that measures how similar one body of text is to another, based on one type of feature (e.g., word lengths). You will then compute scores of this type for all five of the features, and use them to classify a piece of text as being more likely to come from one source than another.

The similarity score that we will compute is based on a statistical model known as a Naive Bayes probability model. Despite the naive in its name, scores computed using this model have been very successful in distinguishing spam email from non-spam (ham) email, among other classification problems.

In essence, the Naive Bayes scoring algorithm works in the following way: You give it feature counts (e.g., word counts) from one body of text and feature counts from a second body of text. The algorithm will then compute the likelihood that the second body text is from the same source as the first body of text! The reason that the algorithm is called naive is that it makes the assumption that each item in a given feature set is independent. For example, it assumes that the appearance of the word spell does not depend on the appearance of the word potter and that this independence holds for all pairs of words. This assumption is certainly not true, but that turns out not to matter in many situations!

You can read more details about the use of Naive Bayes probabilities for classification on Wikipedia if you would like to know more, but all of the necessary information is summarized below.

How it works

To illustrate how the Bayesian scoring algorithm works, lets assume that the only features we care about are the individual words in the texts.

As you have already done in your TextModel class, we can use a Python dictionary to model all of the words in a text. The dictionarys keys are words, and the value for a given word is the number of times that it appears in the text.

For example, lets assume that we have two text documents:

  • a source text (which we are pretending was written by Shakespeare!) that has the following dictionary:
shakespeare_dict = {'love': 50, 'spell': 8, 'thou': 42}
  • This document has 100 words in all: 50 occurrences of the word love, 8 of spell, and 42 of thou.
  • a mystery text (author unknown) whose dictionary looks like this:
mystery_dict = {'love': 3, 'thou': 1, 'potter': 2, 'spam': 4}
  • This document has 10 words in all: three occurrences of love, one of thou, two of potter, and four of spam.

The Bayesian similarity score between these two texts attempts to measure the likelihood that the ten words in the mystery text come from the same class of text as the 100 words in the source text. (A given class of text could be based on a particular author or publication, or on other characteristics of the texts in question.)

To calculate the score, we first take each word in the mystery text and compute a probability for it that is based on the number of times that it occurs in the source text. If a word in the mystery text doesnt occur at all in the source text (which would lead to a probability of 0), we instead compute a probability that is based on a default word count of 0.5. This will allow us to avoid multiplying by 0 when we compute the final score.

Here are the probabilities for the words in our mystery text:

  • love has a probability of 50/100 or 0.5 (it occurs 50 times out of the 100 words in the source text)
  • thou has a probability of 42/100 or 0.42
  • potter has a probability of 0/100, but we change it to 0.5/100 or 0.005 to avoid a probability of 0
  • spam has a probability of 0/100, but we change it to 0.5/100 or 0.005

Important: These probabilities have denominators of 100 because the source text has 100 words in it. The denominators should not always be 100!

To compute the similarity score of the mystery text, we need to compute a product in which a given words probability is multiplied by itself n times, where n is the number of times that the word appears in the mystery text. In this case, we would do the following:

# 3 "love" 1 "thou" 2 "potter" 4 "spam"
sim_score = (.5*.5*.5) * (.42) * (.005*.005) * (.005*.005*.005*.005)

This similarity score is very small! In practice, these very small values are hard to work with, and they can become so small that Pythons floating-point values cannot accurately represent them! Therefore, instead of using the probabilities themselves, we will use the logs of the probabilities. The log operation transforms multiplications into additions (and exponents into multiplication), so our log-based similarity score would be:

log_sim_score = 3*log(.5) + 1*log(.42) + 2*log(.005) + 4*log(.005)

This results in a more manageable value of around -34.737. (Note that Pythons math.log function uses the natural log (of base e) by default, which is fine for our purposes.)

The resulting similarity score gives us a measure of how similar the mystery text is to the source text. To classify a new mystery text, we compute similarity scores between it and a collection of known texts in order to determine which of the known texts is most likely to be related to the mystery text.

For example, lets say that we also have the following model for texts by J.K. Rowling:

jkr = {'love': 25, 'spell': 275, 'potter': 700}

We can compute the similarity score in the same way as above, accounting for three love and two potter appearances, and with the default value being used for the missing words:

sim_score = (.025*.025*.025) * (.001) * (.7*.7) * (.0005*.0005*.0005*.0005)

This value is also very small! Using logs:

log_sim_score = 3*log(.025) + 1*log(.001) + 2*log(.7) + 4*log(.0005)

Now the similarity score is approximately -49.091. This value is less than the value that we computed when comparing the mystery text to the Shakespeare text. Therefore, we can conclude that the mystery text is more likely to have come from Shakespeare than from J.K. Rowling.

Your tasks

Write a function (not a method, so it should be outside the class) named compare_dictionaries(d1, d2). It should take two feature dictionaries d1 and d2 as inputs, and it should compute and return their log similarity score. Here is some pseudocode for what you will need to do:

  • Start the score at zero.
  • Let total be the total number of words in d1 not only distinct items, but all of the repetitions of all the items as well. (For example, total for our example shakespeare_dict would be 100.)
  • For each item in d2:
  • Check if the item is in d1.
  • If so, add the log of the probability that the item would be chosen at random from everything in d1, multiplied by the number of times it appears in d2.
  • If not, add the log of the default probability (0.5 / total), multiplied by the number of times the item appears in d2.
  • Return the resulting score.

Write a method similarity_scores(self, other) that computes and returns a list of log similarity scores measuring the similarity of self and other one score for each type of feature (words, word lengths, stems, sentence lengths, and your additional feature). You should make repeated calls to compare_dictionaries, and put the resulting scores in a list that the method returns.

Important: In each call to compare_dictionaries, the dictionary belonging to self should be the second parameter of the call. For example:

word_score = compare_dictionaries(other.words, self.words)

Finally, write a method classify(self, source1, source2) that compares the called TextModel object (self) to two other source TextModel objects (source1 and source2) and determines which of these other TextModels is the more likely source of the called TextModel.

You should begin by calling similarity_scores twice:

scores1 = self.similarity_scores(source1)
scores2 = self.similarity_scores(source2)

Next, print the two lists of scores, preceding each list of scores with the name of the source TextModel for example:

scores for Shakespeare: [-34.737, ...]
scores for J.K. Rowling: [-49.091, ...]

(Note: If you like, you can use the round function to round the printed scores to 2 or 3 places after the decimal, but doing so is not required.)

You should then use these two lists of scores to determine whether the called TextModel is more likely to have come from source1 or source2. One way to do this is to compare corresponding pairs of scores, and determine which of the source TextModels has the larger number of higher scores.

For example, imagine that the two sets of scores are the following:

scores1: [-34.737, -25.132, -55.312, -10.715, -47.125]
scores2: [-49.091, -21.071, -60.154, -16.502, -43.675]

scores1 has a higher score for three of the features (the ones in positions 0, 2, and 3 of the lists), while scores2 has a higher score for only two of the features (the ones in positions 1 and 4). Thus, we conclude that self is more likely to have come from source1.

You should also feel free to take an alternative approach to using the two lists of scores. For example, you could compute a weighted sum of the scores in each list by doing something like this:

weighted_sum1 = 10*scores1[0] + 5*scores1[1] + 7*scores1[2] + ...
weighted_sum2 = 10*scores2[0] + 5*scores2[1] + 7*scores2[2] + ...

You could then base your classification on which sources weighted sum is larger. One advantage of this approach is that it allows you to adjust the relative importance of the different features giving certain features a larger impact on the classification.

Your classify method should report its conclusions, using the names of the relevant TextModel objects. For example:

mystery is more likely to have come from Shakespeare

The method does not need to return anything.

Testing

Here is one function that you can use to test your TextModel implementation:

# Copy and paste the following function into finalproject.py
# at the bottom of the file, *outside* of the TextModel class.
def test():
""" your docstring goes here """
source1 = TextModel('source1')
source1.add_string('It is interesting that she is interested.')

source2 = TextModel('source2')
source2.add_string('I am very, very excited about this!')

mystery = TextModel('mystery')
mystery.add_string('Is he interested? No, but I am.')
mystery.classify(source1, source2)

Here is what we get when we run this function using our TextModel implementation:

>>> test()
scores for source1: [-16.394, -9.92, -15.701, -1.386, -1.386]
scores for source2: [-17.087, -15.008, -17.087, -1.386, -3.466]
mystery is more likely to have come from source1

Some of your scores will be different than ours (e.g., the third scores, which depend on how you stem the words, and the fifth scores, which depend on which additional feature you include). Our conclusion is based on a pairwise comparison of the scores: because source1 has a larger number of higher scores, it is chosen as the more likely source. If you use the lists of scores in another way, you may come to a different conclusion, which is fine!

Academic Honesty!
It is not our intention to break the school's academic policy. Posted solutions are meant to be used as a reference and should not be submitted as is. We are not held liable for any misuse of the solutions. Please see the frequently asked questions page for further questions and inquiries.
Kindly complete the form. Please provide a valid email address and we will get back to you within 24 hours. Payment is through PayPal, Buy me a Coffee or Cryptocurrency. We are a nonprofit organization however we need funds to keep this organization operating and to be able to complete our research and development projects.