A concordance is an alphabetical list of the principal words used in a book or body of work, with their immediate contexts. Because of the time and difficulty and expense involved in creating a concordance in the pre-computer era, only works of special importance, such as the Vedas, Bible, Qur'an or the works of Shakespeare, had concordances prepared for them. (From Wikipedia)

In this assignment, the program you will design and implement an application that performs the basic tasks of a concordance.

More specifically, the application will allow us to read in a book which is stored in a text file and then to search for a word and the sentences in which the word appears, if any.

Operational Definition of ‘Sentence’

To define a sentence, we need to define a paragraph. A paragraph is a sequence of characters from the text that

  • is terminated by a newline character (except possibly the last paragraph), and
  • contains at least a letter

The newline character will not be considered to be part of the paragraph.

A sentence will be defined as a sequence of characters from a paragraph (in the sense defined above) that

  • terminates with the period character, the question mark, or the exclamation mark ( „.‟, „?‟, „!‟), except possibly the last sentence of the paragraph
  • contains at least one letter
  • has all the preceding and trailing whitespace characters removed

In addition, the period character, the question mark, and the exclamation mark will not be considered to be part of the sentence.

Note that this is an operational definition. It approximates, but is not completely in agreement with, the usual concept of a sentence. It even has „strange‟ consequences. For example, the sentence (from Pride and Prejudice)

Mr. Bennet replied that he had not.

will be extracted as two sentences according to our definition.

You must use this definition so that everyone will extract the same list of sentences from a given text file.

Operational Definition of ‘Word’

A word is defined to be a sequence of letters that appears in a sentence (in the sense defined above), separated by non-letter characters.

Task 1

Write a Java program called Concordance.java, which provides a menu with the following options:

  • R) Read in a book and build the concordance
  • M) List the words that start with a given string
  • S) Search for a word and the sentences in which it appears
  • W) Write to a file the sentences in which a word appears
  • Q) Quit

The menu must accept the letters „R‟, „S‟, „L‟, „W‟ or „Q‟ and be case insensitive.

Option ‘R’

When the user selects option „R‟ from the menu, the program should prompt the user for the name of a text file to open.

If there is any error in reading the file, the program must warn the user and return to the menu. Otherwise, the program reads the text file and constructs all the required data structures about the words and sentences in the input text file.

All the words are to be stored in upper case. The sentences are stored in uppercase and lower case as they appear in the text.

Option ‘M’

When the user selects option „M‟, the program will ask for a non-empty string to be entered from the key board, and display on the screen all the words concordance that begins with that string.

For example, if the text file is the one listed in the Appendix, and the string entered by the user is „Th‟, the following output will be displayed

Looking for words starting with: Th
Number of words matched: 4
THAT, THE, THERE, THIS

The matching is not case-sensitive (that is, we get the same output when we enter “th” or “TH” or “Th”, etc.). The words are listed in alphabetical order and separated by commas.

Option ‘S’

When the user selects option „S‟, the program will ask for a word to be entered from the key board, and display on the screen the list of sentences in which the word appears.

As an example, if the text file is the one listed in the Appendix, and the search word is „truth‟, the following output will be displayed

Search word: truth
Number of sentences: 2
(1)[4]: It is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife
(2)[5]: However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters

It is required that

  • The sentences must appear in the order they appear in the text.
  • Each sentence is preceded by two numbers: the first signifies the order it appears in the current list, the second the order it appears in the text
  • The sentences are separated by blank lines
  • If a word appears in the same sentence more than once, the sentence is listed only once
  • The matching of the search word with the words in a sentence is not case-sensitiv. (That is, we would have the same output when the user enters “TRUTH” or “tRutH”, etc.)
  • If the word is not in the text, an appropriate message should be displayed

Option ‘W’

When the user selects option „W‟, the program will ask for the search word, and display the search result on the screen as described for option „S‟. It will then the search result to a text file. If, for example, the search word is „truth‟, the text file will have the name

ContextOf-TRUTH.txt

Note that the search word appears in uppercase in the filename.

Restriction on the use of Java Collections Framework

You can use only classes ArrayList and LinkedList. Other classes, such as TreeMap for example, are not allowed.

Efficiency Issue

The efficiency with which the program performs various operations is a major concern. The files to be read in can be quite long. You will need to carefully consider which algorithms and data structures you use.

You can use any text file for input to this program. A good source of long text files is at the Gutenberg project (www.gutenberg.com) which is a project aimed to put into electronic form older literary works that are in the public domain. The extract from Jane Austen‟s book Pride and Prejudice used as the sample text file above was sourced from this web site. You should choose files of lengths suitable for providing good information about the efficiency of your program.

The program will be marked on correctness, style and efficiency.

Style will be worth 10 marks. This will include commenting, layout, choice of identifier names, choice of methods and/or classes, etc. Correctness will be allocated 25 marks. 35 marks will be allocated for efficiency in the program. Programs that are highly inefficient may receive no marks here.

Task 2

You are required to submit electronically a written report describing your solution and implementation.

The report should describe:

  • How the program works. You should focus on the data structures and algorithms used. Include in your discussion a comparison of at least three data structures/algorithms you considered using to solve the problem. (3 page limit)
  • The tests carried out on the program, and the results of these tests. Testing should be carried out for both correctness of algorithm and efficiency
  • Correctness testing should be reported in table form with fields for the test case‟s description, the test data used for this test, the expected outcome, and the actual outcome. At least 10 different test cases should be reported.
  • Efficiency testing should include the results of using the time command on different data sets and a discussion of these results in relation to the theory of the data structures/algorithms used and considered. (3 page limit)

The report does not have to be long but it should discuss all necessary material. Point form and tables are acceptable where appropriate.

It should be submitted electronically as a file with the name “report” (acceptable formats are: plain text, PDF or Word). Make sure your name and student number are on every page of your report. Putting them into a header or footer would be a good idea.

Task 3

Also include in the report for Task 2 your answer for Task 3 described below.

  • Consider the requirements for Task 1.Suppose we do not have option M (List the words that start with a given string). What data structures and algorithms would you choose in this case? Explain why you make those choices.
  • Consider again the requirements for Task 1. Suppose we do not have options S (Search for a word and the sentences in which it appears) and W (Write to a file the sentences in which a word appears). What data structures and algorithms would you choose in this case? Explain why you make those choices.

Appendix

An extract from Pride and Prejudice

Pride and Prejudice
by Jane Austen
Chapter 1
It is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife.
However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.
"My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?"
Mr. Bennet replied that he had not.
"But it is," returned she; "for Mrs. Long has just been here, and she told me all about it." Mr. Bennet made no answer.
"Do you not want to know who has taken it?" cried his wife impatiently.
"YOU want to tell me, and I have no objection to hearing it."
This was invitation enough.
Academic Honesty!
It is not our intention to break the school's academic policy. Posted solutions are meant to be used as a reference and should not be submitted as is. We are not held liable for any misuse of the solutions. Please see the frequently asked questions page for further questions and inquiries.
Kindly complete the form. Please provide a valid email address and we will get back to you within 24 hours. Payment is through PayPal, Buy me a Coffee or Cryptocurrency. We are a nonprofit organization however we need funds to keep this organization operating and to be able to complete our research and development projects.