Background and Introduction:

Word counting is used in many applications, including text analysis, text indexing, text compression, and cryptography. In this programming assignment, you will read a text file and discover what words appear in that file and how many times each word appears.

In this assignment, you will create a dictionary (Map), inserting each word (String) that appears in the input text file into the dictionary. The dictionary will be implemented as a Map using classes from the JCF (TreeMap and HashMap). You will then determine which words appeared in the file most frequently. You will run the code on 2 input files of different sizes and different content to see which data structure performs better in practice.

Learning Objectives:

  • Increase your familiarity with the Map ADT.
  • Conduct experiments to compare the performance of data structures and their implementations.
  • Understand how different data structures and different inputs can affect application performance.

Tasks:

Write a program named WordCount.java to perform the tasks described below. (You may create additional classes to help perform this task).

1) The program should prompt for the name of a text file, read the words in the file, and build a List of those words. The program should also ask the user how many of the most frequently occurring words they would like the program to report about. For instance, the user might choose to ask for information about the top 20 most frequent words in the file. (This first part of your program is not timed.)

(The next steps in your program should be timed. Repeat the next steps 10 times and capture the 'best' time.)

2) Place the words from the word List into a TreeMap of frequency counts for the words.

3) Your program should then extract the top 'N' most frequent words (the number of words requested by the user of the program) and the frequency counts for those words. (As stated above, you should do steps 2 and 3 in a loop. Run the loop 10 times. In your code, record how long each iteration of the loop requires. In your code, save the 'best' (fastest) time.)

Next, repeat the above process using a HashMap instead of a TreeMap.

4) Your program must display the total number of words in the file, the top 'N' words and the count for each of those words, and the elapsed times for the TreeMap and HashMap implementations. If there are ties in the frequency counts for several words, then your program should break ties by listing the tied words alphabetically.

NOTE: You will experiment with 2 different text files to see how the performance of TreeMap and HashMap differ. As an example of a large piece of text, The Adventures of Sherlock Holmes is provided in the project archive (ASH.txt). Some other large text files were also provided for testing your program. These files were obtained from Project Gutenburg, whose website is http://www.gutenberg.org/. You must also run your test on the largest source code file in your project.

Make the program flexible.

  • It should be easy for a user to select any text file to process
  • It should be easy for a user to select the size of the top 'N' list

Your program might prompt for the user to enter these choices OR Your program might accept these choices as command line parameters.

Sample program output

(You may format your program output differently, but your program must provide the same information):

This program will display a list of the 'n' most frequently occurring words found in an input text file.
This program will also display the time in milliseconds
for the process of placing the words into a Map of frequency counts and then extracting the top 'n'.

The program displays the elapsed time for the algorithm using both HashMap and
TreeMap.

This program will prompt for the name of a file and the size 'n' for the most frequent list.

Please enter the name of a text file to read : abc
File not found. Please try again.

Please enter the name of a text file to read : Address.txt
Word list size: 271
Please enter a number between 1 and 100 for the top 'n' list: 10
Best elapsed time to populate the HashMap and extract the top 'n': 0.22 ms. Top 10 list: that (13) the (11) we (10) here (8) to (8) a (7) and (6) can (5) for (5) have (5)


Best elapsed time to populate the TreeMap and extract the top 'n': 0.27 ms.
Top 10 list: that (13) the (11) we (10) here (8) to (8) a (7) and (6) can (5) for (5) have (5)
Academic Honesty!
It is not our intention to break the school's academic policy. Posted solutions are meant to be used as a reference and should not be submitted as is. We are not held liable for any misuse of the solutions. Please see the frequently asked questions page for further questions and inquiries.
Kindly complete the form. Please provide a valid email address and we will get back to you within 24 hours. Payment is through PayPal, Buy me a Coffee or Cryptocurrency. We are a nonprofit organization however we need funds to keep this organization operating and to be able to complete our research and development projects.