Web search engines are programs that search webpages for specified keywords turns a list of relevant webpages where the keywords were found. The tu in tasks for search engines are matching and ranking. The matching phase determines which webpages match a query and the ranking phase determines the relevance of each of the matches. Note that search queries can produce thousands or millions of hits (matches) and a search engine must be capable of picking the most relevant results (ranking).

The concept of an index is the most fundamental idea behind any search engine The index for a search engine works the same way as a book's index. The "pages" of the book are now webpages and search engines assign a different page number to every single page on the web. The search engine builds up an index by first making a list of all the words that appear in any page, making a note of the page number in which each word appears, and then sorts that list in alphabetical order. So if a query is made on the word "computer", the search engine can quickly jump to the entry for "computer in the word list and extract the pages for that entry.

Rather than build an index for webpages, this lab will build an index, or a crossreference listing, for text-based documents. The lab will display the number of times each word appears in the document, the line numbers in which each word appears and the position of each word in the line. The lab will also allow us to query the document for specific words and will display the number of times the word appears in the document, the line numbers in which the word appears and the position of the word in the line.

At the outset we do not know how many different words are in the text, nor do we know in how many different lines a word may appear. Our approach will be to use an object binary search tree whose nodes provide for the following fields:

  • Word object
  • Two references to the left and right children of the current node

Note that while the ObjectTreeNode class is built to hold a generic Object, you'll be creating a Word class whose objects will be placed in the ObjectTreeNodes of the object binary search tree.

Each Word object should provide for the following fields:

  • a String for the word to be stored Note that a word is defined as any sequence of characters separated by a blank, tab, or newline. Punctuation at the end of a word should be removed. Therefore, is a hyphen considered Words should not be case sensitive.
  • an int to count the number of times the word appears in the text
  • a reference to an object linear linked list in which each object contains the line number in which the current word can be found as well as the position of the word on the line. Note that we're counting word-by-word position here and not character-by character position.

Here is a listing of the classes and interfaces that you will want to include in your project:

Xret ObjectBinarytree ObjectList
Word ObjectBinaryTreeInterface ObjectListInterface
LinePosition ObjectTreeNode ObjectListNode
Hash ObjectTreeNodeInterface ObjectListNodeInterface
Query TreeComparable Comparable

Specifically, your program should perform the following operations:

  • Input text from a file, getty.txt, and output each line of text prefixed by a line number.
  • Output a well-formatted cross-referenced listing alphabetically by performing an inorder traversal of the binary search tree. Each word, the number of times each word appears in the text, the line number(s), in ascending order, in which each word is found, as well as the position the word appears on each line should be output. Here's a look at the recommended output format:
our 2 1-7 11-10
people 3 20-11 21-1 21-4
perish 1 21-7
  • Allow the user of the program to perform run-time queries and perform a search for the following words:
dedicate men resolve vain
devotion not soldier war
gave people us

For each word searched, output the word, the number of times the word appears in the text, the line numbers in which the word appears and the position the word appears on each line.

To make this cross-reference table as simple and as useful as possible, we will not include all the words of the text in the index. The following words should be omitted:

a have that
after in the
all zis their
and it there
because it's to
every now was
for of were
from on which
had so with

To accomplish this task, we shall store each of the above words in a hash table of size 37 that resolves collisions by linear probing or chaining. After the hash table is created, we can select each word from the text, get its hash, and determine whether or not it appears in the hash table. If the word does appear in the hash table, we should simply ignore it: otherwise, we should store the word in ne binary search tree, incrementing the counter which keeps track of the number of times the word appears, and attaching a new node to the linear linked list which contains the line number in which the word appears along with its position within the line. Note that the line numbers stored in the linear linked list should be kept in increasing order for eventual output.

You may place the 27 unimportant words into a text file, read the words into your gram, and hash the words into their proper location within the hash table.

Note that these 27 unimportant words placed into the hash table are not the same as the Word objects that will be placed into the binary search tree!

Note that the hash table should be created before reading any data from the getty.txt file. The hash function signature should be as follows:

private int getHash (String s);

If you are using linear probing to resolve collisions, you should create a hash function that produces ten or fewer collisions and you should output the number of hash function collisions as well as the number of resolution collisions. The total of these two types of collisions should be ten or less. If you are using chaining to resolve collisions, your hash function should return five or fewer collisions and you should output the total number of collisions, the average chain size to two decimal places, as well as the maximum chain size.

Be sure to create your own Hash class rather than use the Hash class built into the Java library

For extra credit, develop a perfect hash function. For extra, extra credit, develop a minimal perfect hash function.

The following should be added to your program's output:

  • a description of your hash function
  • a display of the hash table including array indices of each item in the hash table

Be sure all output is sent to the terminal window as well as the csis.txt output file. Shown below are the contents of getty.txt.

Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth.

Binary Tree and Hashing Lab Note 1: Program Documentation

I've provided a document on the use of Javadoc that can be found in this link on Blackboard:

Docs | Java Review | Javadoc Program Documentation

In particular, note the following guidelines:

  • Every class must have a Javadoc class comment.
  • Every method must have a Javadoc method comment.
  • Every method parameter must have an @param tag.
  • Every method with a return statement must have an @return tag.
  • Generate the HTML Javadoc documentation for your project.

Binary Tree and Hashing Lab Note 2: Interfaces

Each data structure in your project must implement an interface for that data re. The interface files must be included in the zip archive for your code.

Academic Honesty!
It is not our intention to break the school's academic policy. Posted solutions are meant to be used as a reference and should not be submitted as is. We are not held liable for any misuse of the solutions. Please see the frequently asked questions page for further questions and inquiries.
Kindly complete the form. Please provide a valid email address and we will get back to you within 24 hours. Payment is through PayPal, Buy me a Coffee or Cryptocurrency. We are a nonprofit organization however we need funds to keep this organization operating and to be able to complete our research and development projects.