You want to become the next billionaire techie. The key to your fortune is a state-of-the-art Web search engine, which ranks Web pages against a given query better than your competitors. Your similarity measure treats both Web pages and queries as 1D arrays. Let us assume that our vocabulary, which contains all words that we will use, contains m words. For example, "apple", "atom", "and" and so on.
Each Web page (or document) d_i can be viewed as a 1D array:
For example, if the first three terms of Web page d_i were "1 0 1," then the Web page contains words "apple" and "and," but not the word "atom."
Similarly, the user query can be viewed as a 1D array:
Then, for example, if w_q2 is 1, the user entered "atom" in her query.
The relevance of Web page d_i and the query q is
measured by the similarity function sim(d_i,q):
In other words, the numerator is the sum of the product of the corresponding
terms in the 1D arrays of the Web page and query. The denominator is the
squareroot of the sum of the square of the terms in the
1D array for the Web page
the squareroot of the sum of the square of the terms in the
1D array for the query.
The higher the sim(d_i,q) score, the more relevant the Web page d_i is to query
Being very clever, instead of just using a 1 to indicate the word appears and a 0 to indicate it is missing, you decide to use any real number between 0 and 1 to give "weight" to each term. For example, "atom" might be more useful than "and," or a certain word might appear several times in the same document or query. Using weighted terms is your key to a fortune!
Here is a sample output.
hercules% web_search The denominator for the 1st document is 0.848999 The similarity for the 1st document is: 0.777386 The denominator for the 2nd document is 0.772172 The similarity for the 2nd document is: 0.25901 The denominator for the 3rd document is 0.919112 The similarity for the 3rd document is: 0.312258 Document number 1 is most relevant to the query.