We are often faced with the need to process multipledocuments in order to determine how similarthese documents are. For example, in ECE15, the source code submitted by the students is routinely tested in order to prevent cheating: for each pair of students, the programs they submit are compared and the degree of similarity between these programs is determined. Performing such comparisons manually is impractical due to the sheer volume of the material (for instance, if there are 100 students in ECE15, onehas toperform almost5,000comparisonsbetween programs). Furthermore, variousdocumentsthat need to be compared are often much too complex for humans to analyze directly in order to determine similarity. Thus computer scientists often write programs designed to automate this task.
Given two text files, say fooA.txt and fooB.txt, how can one compute the degree of similarity between them? A simple approach is described in what follows. Suppose fooA.txt contains n1dif- ferent words, say x1, x2,..., xn1, and define See image.
for all i = 1,2,...,n1. In the same manner, suppose that fooB.txt contains n2different words, say y1, y2,..., yn2, and define See image.
for all i = 1,2,...,n2. If the sets of words that appear in fooA.txt and fooB.txt, namely the sets x1, x2,..., xn1and y1, y2,..., yn2, are disjoint, then similarity(fooA.txt,fooB.txt) = 0 by convention. Otherwise, suppose m different words, say z1, z2,..., zm, appear in both fooA.txt and fooB.txt. Then the similarity between fooA.txt and fooB.txt is defined as follows: See image.
Observe that for any two files fooA.txt and fooB.txt, the similarity between them is a real num- ber in the range [0,100]. This number is zero if and only if the two files have no words in common. On the other hand, if the two files are identical, then the similarity between them is 100.
As an example, consider the files inception1.txt and inception2.txt that are shown on the next page. Table1 and Table2, also on thenext page, showall the different words that appear in thefiles inception1.txtandinception2.txt,respectively,alongwiththecorrespondingwordcounts. Observe that these word counts are case insensitive. For example, the words The and the in the file inception1.txt are counted as the same word. Using Table1 and Table2, we can easily compute
The action film Inception was the biggest hit of 2010. The film was written and directed by Christopher Nolan.
Inception is a science fiction action film written and directed by Christopher Nolan which grossed 8 hundred million and was the biggest hit of 2010.
See figure result: See image.
Formula: See image.
Table3tabulates thosewords thatare commonto bothinception1.txt and inception2.txt, along with the number of times they appear in each file. Notice that Nolan is not a common word. For the purposes of this problem, a word is any sequence of consecutive non- whitespacecharactersterminatedbywhitespace. Thusinception1.txtcontainsthewordNolan. while inception2.txt contains the word Nolan, and these words are considered different. Fin- ally, using Table3, we can completethecomputationofthesimilaritybetween inception1.txtand inception2.txt as follows: See image.
This is a relatively high degree of similarity, which reflects the fact that the text in inception1.txt and in inception2.txt is about the same subject.
Write a C program, called similarity.c, that reads two text files, and computes the similarity be- tween them. The program should also output all the words that are common to both files along with the number of times each of these words appears in the two files (in a format similar to Table3). Details regarding the program input and output are described below.
The program should read its input from the files file1.txt and file2.txt. Both input files con- tain numerous sentences, each sentence consisting of several words. For the purposes of this problem, a word is any sequence of consecutive non-whitespace characters terminated by whitespace. Note that the total number of words in an input file is not limited in any way. However, the length of each indi- vidual word is limited: you can assume that a word will never exceed 255 characters. This limit should be introduced into your program as a symbolic constant (use #define MAX WORD LENGTH 255).
You can regard the files inception1.txt and inception2.txt shown on the previous page as sample input files. You do not need to check the validity of the input files: you can assume that they conform to the specifications detailed above.
The program shouldwriteits outputto thefile similarity.out,in the format described below. The first line of similarity.out should report the similarity between the two files. The second line should report the number of common words found. Following the second line, the common words should be printed out, along with their counts in file1.txt and file2.txt. The order in which the common words are printed is not important, you can print them in any order. Print one common word per line, followed by a single space, followed by the count of this word in file1.txt, followed by a single space, followed by the count of this word in file2.txt.
A sampleoutputfile similarity.out,which results upon processing thefiles inception1.txt and inception2.txt on the previous page, is shown on the next page. See image.