Problem Description

In many situations, we may need to nd a word based on one or more letters in it. For example, when making or completing a crossword you may want to nd a word that has 4 letters, starts with J and ends with A. In this assignment, the program you will write will create a lexicon of words from various sources (that is, from various text les) and will allow the user to search for words that t particular patterns.

The operational denition of what is taken to be a word will be given later in this handout. For each word, we also keep the following information:

  • The frequency: How many times the word appears in the input les
  • The list of neighbors: A neighbor of a word w is a word that is of the same length and differs from w by only one letter Note: The motivation for this requirement is to make the lexicon capable of supporting the following game. Two words are given let us called them start word and end word. The aim is to transform the start word into the end word by changing one letter at a time, subject to the condition that all the intermediate words are valid words. Obviously, this game is equivalent to nding a path from the start word to the end word along the edges connecting two neighbors.

Assignment Requirements

The assignment is divided into two parts. For part 1, you are required to write a simple version, where execution efciency is not taken into account. For part 2, you are required to write an optimized version where efciency, in addition to correctness, is the major concern.

With the exceptions of ArrayList and LinkedList, you are not permitted to use any of the classes in the Java Collections Framework, e.g. TreeMap, HashMap.

Task 1

For this task, you are to design and implement the program LexiconTester.java. This program will

  • Read two text les, sample1-pp.txt and sample2-zoo.txt, and construct a lexicon that contains words from the two les;
  • Write the words from the lexicon to the text le sample3-wordlist.txt.

Further information about the task is given below.

Building the lexicon

The rst text le to be read in is the le sample1-pp.txt shown below:

Pride and Prejudice by Jane Austen Chapter 1 It is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters. "My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?" Mr. Bennet replied that he had not. "But it is," returned she; "for Mrs. Long has just been here, and she told me all about it." Mr. Bennet made no answer. "Do you not want to know who has taken it?" cried his wife impatiently. "YOU want to tell me, and I have no objection to hearing it." This was invitation enough.

The second text le to be read in (and its words added to the lexicon) is the le sample2-zoo.txt below:

You can see zebras and yaks at the zoo. Yes, zebras and yaks and wombats too. You will also meet a fat cat, wearing a hat, sitting on a mat, reading a map while drinking from a cup.

What is a word?

A word is a sequence of letters. Words will be treated in a case-insensitive manner. Numbers and punctuation must be removed. More specically,

  • If the word "and" occurs twice in the le, it is only stored once in the lexicon
  • If the word "you" also occurs as "YOU", only one version is stored.
  • Punctuation, such as commas, quotes, hyphens, full stops and question marks, are removed.
  • If the word "dont" or the word "second-hand" appears in the input texts, they would be stored without punctuation, i.e. "dont" and "secondhand" respectively.
  • The word "1", for example, will be ignored as it contains numeric characters.

Saving the lexicon

The words from the created lexicon are to be written to le sample3-wordlist.txt. These words must be in sorted in alphabetical order.

The information on each word, which is also written to the text le, includes the frequency and the list of its neighbors. The list of neighbors must also be in alphabetical order.

The format must be as shown in the example below for a number of words.

a 11 [i]
about 1 []
acknowledged 1 []
all 1 []
also 1 []
and 6 []
answer 1 []
at 2 [it]
austen 1 []
be 2 [by, he, me]
been 1 []
bennet 3 []
but 1 []
by 1 [be, my]
can 1 [cat, man]
cat 1 [can, fat, hat, mat]
chapter 1 []
considered 1 []
cried 1 []
cup 1 []

For each word, rst the spelling of the word is displayed, followed by the frequency, and then followed by the words neighbors (in alphabetical order, inside a pair of square brackets).

Task 2

For this task, you are to write the program WordMatchTester.java. This program will

  • Read two text les, sample1-pp.txt and sample2-zoo.txt, and construct a lexicon that contains words from the two les (as in Task 1);
  • Find the words that match a number of patterns, and display the results (both the patterns and matching words) on the screen and write them to the le sample4-results.txt. The test cases (patterns) should be carefully chosen, and the reasons for including them must be documented in the program.

Further information about the task is given below.

Finding matching words

The patterns to match the words in the lexicon are to be included (hard-coded) as strings in the test program. A pattern consists of a sequence of letters, with no spaces within the sequence. In addition, there are two wildcard characters allowed. A ? symbol in the pattern can match with any one character in a word, while a * symbol can match with any number of characters in a word (zero or more).

For each pattern, the pattern and the words that match the pattern are to be displayed on the screen and write to the text le. Each matching word, together with its frequency, appears on a separate line. The words are in alphabetical order.

For example, the pattern:

ma?

may result in the words below being displayed to the screen

man 2
map 1
mat 1
may 1

The pattern:

?o?

may result in the following words being displayed. Note the words are displayed in lexicographic order.

for 1
not 2
too 1
you 5
zoo 1

The pattern:

Mr*

may result in the output:

mr 3
mrs 1

The pattern:

i*

may result in the output:

i 1
impatiently 1
in 3
invitation 1
is 5
it 5

The pattern

?ear*

may result in the output:

dear 1
heard 1
hearing 1
wearing 1

Any pattern that has no matches will result in the message

No words in the lexicon match the pattern

output to the screen and written to the le.

The patterns should be designed for a comprehensive testing of the correctness of what you have implemented. As a suggestion, you should include at least 10 test cases (i.e. patterns) and the reasons for including them must be documented in the program.

Academic Honesty!
It is not our intention to break the school's academic policy. Posted solutions are meant to be used as a reference and should not be submitted as is. We are not held liable for any misuse of the solutions. Please see the frequently asked questions page for further questions and inquiries.
Kindly complete the form. Please provide a valid email address and we will get back to you within 24 hours. Payment is through PayPal, Buy me a Coffee or Cryptocurrency. We are a nonprofit organization however we need funds to keep this organization operating and to be able to complete our research and development projects.