Homework 6: Dictionaries, Data Representations Due Friday November 10, 11:59PM EST
In this assignment, you’ll use dictionaries and iteration to solve several problems involving text analysis: the use of computational tools to learn more about bodies of text. You’ll also further explore binary representations of data.
Below is a suggested timeline to complete the assignment, which leaves plenty of time before the due date for debugging if necessary:
Date | Tasks to Complete |
---|---|
11/04 | Read through the assignment and start problem 1 |
11/05 | Complete problem 1, pass problem 1 tests on the autograder |
11/06 | Complete problem 2, pass problem 2 tests on the autograder |
11/07 | Complete problem 3, pass problem 3 tests on the autograder |
11/08 | Complete problems 4/5, pass problem 4/5 tests on the autograder |
11/10 | DEADLINE! Finsh debugging and submit your final code. Attend drop-in hours if you’ve had problems with any of the previous parts. Make sure to review your code for code quality. |
These are soft deadlines that are not part of your grade, but I encourage you to stick to this timeline if you’ve struggled to complete the homework assignments on time. Being ahead of this timeline is great!
Downloading starter file
Start by downloading the homework 6 starter file here.
Problem 1: to_binary
In class, we practiced converting integer from binary into decimal representations and vice-versa. In this problem, you’ll write a function called to_binary() that converts a decimal integer to a binary representation. This function should accept an integer argument and return a string of 0s and 1s. For example:
to_binary(16)
>>> "10000"
to_binary(12345)
>>> "11000000111001"
See slide 17 from October 26 for the algorithm for converting from decimal to binary.
HINT: This is a good application for a while-loop. It is possible to do this problem using recursion instead of a loop.
Problem 2: count_words
Introduction
To start this problem, please start by finding a large file of plain text that you would like to study. One easy way to find such a file is to visit Project Gutenberg’s list of popular books.
Click a title you’d like to use, and find the Plain Text UTF-8 version. See an example of what it should look like here.
Choose File –> Save As, and save the file with the filename story.txt
. You should save it in the same directory as your homework file. You can open this file in a text editor. You might wish to do so and delete the language at the beginning and end of the file about Project Gutenberg, so that you can focus on the text.
We are going to study word frequencies in this text. By the end of this sequence of problems, you’ll be able to print a list describing the most common words in the text.
The following code loads your file and divides it into a list of lowercase words. For example, if your file was very small and contained only the words “The cat sat on the mat”, we would have:
words = load_words("text.txt") words
>>> ["the", "cat", "sat", "on", "the", "mat"]
import string
def load_words(path):
with open(path, "r", encoding = "utf8") as f:
s = " ".join(list(f.readlines()))
for punc in string.punctuation:
s = s.replace(punc, "")
s = s.lower()
words = s.split()
return words
Your task
Now, write a function called count_words()
whose argument is a list of words and whose return value is a dictionary where each key is a word and each key’s value is the number of times that word appears in the list. This is sometimes called a concordance. For example, using the words variable from above:
count_words(words)
>>> {"the" : 2, "cat" : 1, "sat" : 1, "on" : 1, "mat" : 1}
HINT: This problem is closely related to Example 3 from our reading on dictionaries.
You will likely find this example to be helpful in the next several problems.
HINT: This is a problem that could be addressed using recursion, but Python places a limit on how many times a function can be called recursively. Because of this, I recommend that you use a loop instead.
Problem 3: remove_stopwords
A stopword is a word that that is considered to be uninteresting for text analysis. Examples of English stopwords include “the,” “but,” “and,” “her,” and so on. You can find a list of common stopwords in the hw6.py
file (the variable STOPWORDS
).
Write a function called remove_stopwords()
. This function should take two arguments:
- A dictionary of counts, such as would be returned by
count_words()
- A list of stopwords.
Your function should return a dictionary of counts, NOT INCLUDING stopwords.
With the example before:
stopwords = ["the", "on", "and"]
d = count_words(words)
d = remove_stopwords(d, stopwords)
d
>>> {"cat" : 1, "sat" : 1, "mat" : 1}
The counts for "the"
and "on"
have been removed because they are stopwords.
Problem 4: print_top_words
Write a function called print_top_words(). This function will print (NOT return) the words with the highest counts in the data set. This function should accept two arguments:
- d, a dictionary of counts such as would be returned by
count_words()
orremove_stopwords()
num_words
, the number of words to print
The function should print the top num_words words in the data set, in descending order, along with their counts. For example, I obtained these counts on the text of Grimm’s Fairy Tales:
print_top_words(d, 10)
Output:
little 388
away 278
king 264
man 214
old 201
time 184
day 181
come 170
home 170
shall 168
For full credit, PLEASE USE RECURSION. It’s also possible to do this problem using a loop; loop-based solutions will receive most but not all of the credit. Your recursion-based solution might use both recursion and a loop - that’s fine in this case!
To do this recursively:
- if the input dictionary is not empty and
num_words
is positive:- Find the word with the highest count in the dictionary, and print it.
- Remove that word from the dictionary.
- Reduce
num_words
by 1. - Call
print_top_words()
with the modified dictionary and reduced num_words.
HINTS:
- The implementation have some similarities with what you did for
matched_min
on Homework 4. print(f"{word:20} {count}")
will print your word and count in a pretty way, like shown above.- It’s ok if your function prints MORE than the specified number of words, in case some words are tied for 10th.
- The largest value in the dictionary
d
can be found usingmax(d.values())
. d.pop()
can be used to remove key-value pairs from dictionaries.
Problem 5: putting it all together (summarize_file
)
Using the functions that you implemented for problems 2-4, along with the load_words
function, write a function summarize_file
that takes a string filename
and a list stopwords
as input. Your function should:
- Use
load_words
to read the file and create a listwords
- Use
count_words
to convertwords
to a dictionarycounts
- Use
remove_stopwords
to create a new dictionaryclean_counts
without stopwords - Use
print_top_words
to print the 10 top words fromclean_counts
In the end, you should be printing the 10 most frequent words from filename
that are not stopwords! Your function shouldn’t return anything.
Test your funciton on your "story.txt"
file to make sure everything is working before submitting to gradescope!