You want to become the next billionaire techie. The key to your fortune is a state-of-the-art Web search engine, which ranks Web pages against a given query better than your competitors. Your similarity measure treats both Web pages and queries as 1D arrays. Let us assume that our vocabulary, which contains all words that we will use, contains m words. For example, "apple", "atom", "and" and so on.

Each Web page (or document) d_i can be viewed as a 1D array:

For example, if the first three terms of Web page d_i were "1 0 1," then the Web page contains words "apple" and "and," but not the word "atom."

Similarly, the user query can be viewed as a 1D array:

Then, for example, if w_q2 is 1, the user entered "atom" in her query.

The relevance of Web page d_i and the query q is measured by the similarity function sim(d_i,q):

In other words, the numerator is the sum of the product of the corresponding terms in the 1D arrays of the Web page and query. The denominator is the squareroot of the sum of the square of the terms in the 1D array for the Web page multiplied with the squareroot of the sum of the square of the terms in the 1D array for the query. The higher the sim(d_i,q) score, the more relevant the Web page d_i is to query q.

Being very clever, instead of just using a 1 to indicate the word appears and a 0 to indicate it is missing, you decide to use any real number between 0 and 1 to give "weight" to each term. For example, "atom" might be more useful than "and," or a certain word might appear several times in the same document or query. Using weighted terms is your key to a fortune!

Exercise

For this lab, we will use a small vocabulary of size 7, that is, m=1, m=2, ...,m=7. There are three web pages (d1,d2,d3) and one query (q) issued, all of which already have been stored as 1D arrays. Complete this Web Search C++ program by implementing the similarity function. After calculating the three similarity scores sim(d1,q), sim(d2,q), sim(d3,q), your program should output the most relevant Web page number to the user.

  • Get a copy of the web_search.cpp
  • Here is a sample output.

    
    
    hercules[1]% web_search
    
    The denominator for the 1st document is 0.848999
    
    The similarity for the 1st document is: 0.777386
    
    The denominator for the 2nd document is 0.772172
    
    The similarity for the 2nd document is: 0.25901
    
    The denominator for the 3rd document is 0.919112
    
    The similarity for the 3rd document is: 0.312258
    
    Document number 1 is most relevant to the query.