In order to account for the difference in programming experience between the different students, the project is divided in three parts:
The joint part should be completed by all the students, regardless of whether they come from GP or OOP. It consists of creating the core of the search engine, and a fully functional program will be obtained at the end.
The advanced part is only mandatory for OOP students or students with similar programming experience. It will add some features and optimizations to the search engine.
Finally, the extensions are a list of optional features for the engines.
Three data files containing (parts of) old ITU web pages are available for use as field data for the project. The following example illustrates the file format: See image.
The idea is that a data file contains information about a series of web pages. The information for a web page starts with a *PAGE line which denotes at which address the page can be found. Such an address is called a URL. The list of words contained in the page follows the URL. The words are listed one word per line in the same order as they occur on the page. As seen in the example, each word can have several occurrences on the same page. The following three data files are available:
- The file itcwww-small.txt
- The file itcwww-medium.txt
- The file itcwww-big.txt
The program SearchCmd
As part of the background material, the program SearchCmd.java is available. It implements a small search engine. The program takes the name of the data file used for searching as a parameter. In the program, the given parameter is accessible in args where args is the argument to the main method.
The program then constructs a linked list (see data structures) with all the lines from the data file given inargs. The objects in the linked list contain the fields str and next. The field str contains a line from the data file and next points to the next object in the list. After reading the data file, a simple user interface launches, enabling the user to query the search engine: Given a word, the engine informs the user whether or not the word was present in the input.
The joint part
The joint part consists of solving the following 4 assignments.
- Compile and run the program SearchCmd.
- Modify SearchCmd so that if the user requests a search for a particular word (say, “Hans”), the URLs of all pages where the word “Hans” occurs are written to the screen. This has to be done by changing the search method in SearchCmd. It is not done by changing the data structure.
- Modify the construction of the data structure in such a way that a linked list of pairs of strings and urls is created. This linked list consists of objects containing three fields. The three fields are str, next andurls. The fields str and next are like in SearchCmd, but we will now only use the str field to store words (i.e. not urls) and each word only occurs once. In case an object contains the word “Hans” in the fieldstr, the object’s urls field will be a pointer to a list of URLs of all pages that contain the word “Hans”. A list of URLs consists of objects with two fields; next and url. The field next is a pointer to the next object in the list and the field url is (a pointer to) a URL. After construction of this data structure, you have to program the corresponding search procedure(s). The data structure created in this assignment has smaller size than the structure from SearchCmd and is faster to search through.
- Modify the solution from assignment 3 so as to use a hashtable (see data structures) instead of a linked list of words. You can create the data structure using chained hashing. This means that each entry contains a reference to a linked list of URLs. The use of a hashtable will have a dramatic effect on both the time it takes to initialize the search engine and the time it takes to perform individual searches. In the solution of the above assignments it is not allowed to use any packages other than java.io andjava.lang.
As a consequence it is not allowed to use any of the classes from the java.util package: One of the purposes of this project is to learn how to program various data structures yourself, as opposed to just using data structures programmed by others.
As you can see, you are allowed to use different methods on String objects. It is also allowed to use the method hashCode and to allocate arrays.
The advanced part
For the assignments after the joint part, there are no restrictions on the Java functionality you may use. We however recommend that you do not spend too much time looking for advanced and unnecessarily complicated features in the vast collection of java libraries.
The advanced part is mandatory for OOP students. The task is to add support for the boolean operators AND and OR to the search engine. The resulting search engine should accept three kinds of input strings:
- Ordinary words, as before. Example: “Hans”.
- A word AND another word. Example: “Hans AND Grethe”.
- A word OR another word. Example: “Hans OR Grethe”.
For example, the input “Hans AND Grethe” should result in a list of those pages where both words “Hans” and “Grethe” occur, while the input “Hans OR Grethe” should result in a list of those pages where either “Hans” or “Grethe” (or both) occurs.
You are not required to support more complex expressions containing several occurrences of AND and OR. (If you are interested in solving that problem, see one of the suggested extensions below.) As a starting point you should use your solution to assignment 4 of the joint part (i.e., the hashtable-based solution).
The advanced part is deliberately more open-ended than the joint part. However, we recommend that you proceed as follows:
- Modify the search method in SearchCmd such that it recognizes (but does not yet respond to) the three kinds of input strings listed above; you may want to use the java.util.StringTokenizer class. If the input is an ordinary word, a normal search is performed. If the input is, e.g., “Hans AND Grethe”, the method simply prints “And: Hans Grethe” instead of performing a search. If the input is, e.g., “Hans OR Grethe”, the method simply prints “Or: Hans Grethe” instead of performing a search.(This advice applies to everybody, not just new programmers. Failure to divide tasks into sub-tasks can result in spectacularly complicated debugging sessions where the error is finally traced to a supposedly “easy” part.)
- Modify the search method such that it responds correctly to “AND” input strings. For now, aim for the simplest and easiest solution you can think of, without worrying about performance. Do the same for “OR” input strings.
- (Optional.) Try to improve the design. The goal is to make searches faster and/or decrease the memory usage. You may want to look at the java.util.HashSet class or other classes from the Java Collections Framework.
The following are suggestions for assignments to do after the mandatory part(s). The ordering of these suggestions is random.
- Write a web crawler.
- Use the extended data file.
- Investigate how to easily support prefix-search. For example a search for “arm*” should find all pages with “arms”, “armoury”, “armadillo”, “armageddon”, “armistice” etc.This only requires an extra table of pointers to the words, where the table is ordered by words. • Create a graphical user interface.
- Make the program more efficient in terms of usage of space (mostly for people studying Introduction to algorithms and data structures).Idea: Instead of having pointers to URL-strings, you can put the various URL-strings in a table. The pointers can now be replaced by a number corresponding to the entry in the URL-table where the URL can be found. Since you don’t know how large the URL-table has to be, you double it’s size each time it runs full (this is described in CLRS). By using amortized analysis it can be shown that this doesn’t increase the running time significantly. Instead of using a list of URL’s with each word, a table of URL’s represented as numbers can now be used. This also requires doubling techniques and amortized analysis. For words that occur in almost all pages a third strategy can be used; namely a “bit vector”. The idea is that a word occurs in the page if and only if the corresponding bit has a value 1.
- Construct your own hashCode method. See the reference to hashing under Data structures.
- Find other data other than homepages to search through and modify your search engine to fit your situation.For example, you could be searching for grep patterns in a text file.
- Design a tool that, given a web page x, will search for web pages similar to x.
- Enable the user limit the search to URLs matching a pattern such as *.dk or itu.dk/dkm/*.
- Combine two user-imposed constraints on searching, for example limited-domain search and search for a particular prefix.
- Support complex boolean searches, for instance ((((a AND b) OR c) AND f OR b) OR g) AND NOT h.
- If a search produces no result, it might be because the user performed an erroneous keystroke, so that, e.g., an “a” became an “s”. In such cases you can choose to return pages which contain words almost matching the word searched for.
- Support free text searches. That is, give the opportunity to search for (fragments of) sentences. A strategy is to implement a free-text search as an AND boolean search. This limits the search to a small amount of pages which are then examined in closer detail. Another approach is to calculate hash values for sentences instead of words. A third possibility is a more compact representation of the text. Instead of having the word repeated a certain number of times, you remember at which positions in the homepage the word occurs. Find your own favorite strategy.
- How can the data structure be updated dynamically? You can choose to collect information on a lot of new homepages before updating. Another approach is to implement dynamic data structures such as dynamic search trees. The method chosen will be a compromise between space, search time and the time for updating the data structure.
- Implement the joint part using vectors, collections etc. and compare this (these) solution(s) to the solution from the joint part where these classes were not used.
In order to complete the assignments in the joint part you need (an understanding of) linked lists and hashing.
Linked lists are explained at the first project day.
The following are general requirements on the report:
- The report must contain a foreword telling who has fulfilled the assignment, what the assignment contains, where the assignment is made (ITU), during what period, and who the teachers are. Furthermore, a short statement on whether the project went as planned is required. The foreword should not exceed one page.
- The document “project-agreement” is used as the cover page for the assignment. Furthermore there are the following requirements on the report that are related to the three search engines implemented in the joint part:
- You have to describe how the different search engines work, which data structures are used and how they are used. This description must be in a natural language and Java code may only be used to support the description.
- You have to document your investigation of how effective the different search engines are by running them. Be careful to run each engine several times on the same input.
- You have to compare the different implementations to each other, by using your benchmarking results from the preceding clause and/or by applying algorithm analysis. People studying algorithms are expected to use O-notation in this comparison.
- You have to document correctness testing of the search engines. Both an internal test (also called white box/structural test) and an external test (also called black box/functional test) must be performed and documented, but you are not required to do a full internal test.
- The code for your search engines and documentation of tests etc. must be handed in as appendices to the project report. In addition, your code must be handed in on a CD or USB stick. What the report contains apart from the joint part, depends on the work done by the group and the chosen focus of this work.
- Cover page (the “project-agreement”)
- Walk-through of SearchCmd 1
- Walk-through of SearchCmd 2
- Walk-through of SearchCmd 3
- Walk-through of SearchCmd 4
- Benchmarking results for the different search engines together with theoretical explanations of the results.
- Extra assignments (including the advanced part if you carry it out) The description of SearchCmd can be presented as follows:
- Describe the Input/Output-relation (”What does the program do?”) without reference to code in such a way that it could be read by anyone.
- Describe the data structure without reference to the specific implementation.
- Describe ideas, invariants etc. for the code.
- Describe in general terms and without references to the code how it is structured.
- Describe details of the code if needed.
- Describe how much time and space the code uses.
- Compare to previously described search engines.