CS210 Lab: Hash Table


Highlights of This Lab:

Lab Exercise:


Click the little computer above for a detailed description.
For this excercise you will be asked to implement a password checking system to authenticate a users password.

1. Definition of a Hash Table

Before we get into the definition of Hash Tables, it is good to introduce WHY to use Hash tables.

Hash tables are good for doing a quick search on things.

For instance if we have an array full of data (say 100 items). If we knew the position that a specific item is stored in an array, then we could quickly access it. For instance, we just happen to know that the item we want it is at position 3, we can apply myitem=myarray[3];

With this, we don't have to search through each element in the array, we just access position 3.

The question is, how do we know that position 3 stores the data that we are interested in?

This is where hashing comes in handy. Given some key, we can apply a hash function to it to find an index or position that we want to access.

1.1 What is the hash function?

There are many different hash functions. Some hash functions will take an integer key and turn it into an index. A common one is the division method.

Let's learn through an example:

1.2 Division method (one hash method for integers)

Let's say you had the following numbers or keys that you wanted to map into an array of 10 elements:

123456
123467
123450

To apply the division method, you could divide the number by 10 (or the maximum number of elements in the array) and use the remainder (the modulo) as an index. The following would result:

123456 % 10 = 6 (the remainder is 6 when dividing by 10)
123467 % 10 = 7 (the remainder is 7)
123450 % 10 = 0 (the remainder is 0)

These numbers would be inserted into the array at positions 6, 7, and 0 respectively. It might look something like this:

The important thing with the division method is that the keys are integers.

1.3 What happens when the keys aren't integers?

You have to apply another hash function to turn them into integers. Effectively, you get two hash functions in one:

  1. function to get an integer
  2. function to apply a hash method from above to get an index to an array

What do we mean that the keys aren't integers? Well, let's say that the keys are people's names. Such as:

Sarah Jones
Tony Balognie
Tom Katz

The goal is to type in one of these names and get an index to an array in order to access that information. How do we do this?
The hash function has to do two things:

  1. Convert the names into integers
  2. Apply a hash method to get an index

1.4 What would that look like in the array?

The array is known as a hash table. It stores the key (used to find the index) along with associated values. In the above example, we might have a hash table that looked something like this:

Again, the idea is that we will insert items into the hash table using the key and applying the hash function(s) to get the index.

A problem occurs when two keys yield the same index. For Instance, say we wanted to include:
John Smith --> 948 % 10 --> 8

We have a collision because Sarah Jones is already stored at array index 8.

We need a method to resolve this. The resolution comes in how you create your hash table. There two major approaches given in the book:

  1. Linear Probing
  2. Chaining
The approach used in this lab is referred to as chaining.

The details are left as class material, but recognize that in chaining the hash table is basically an array of linked lists. These linked lists are called buckets. All the data in the same bucket, have colliding index values.

Consider a diagram of the above example. Remember, there was a collision with Sarah Jones and John Smith. Notice that John Smith is "chained" or "linked" after Sarah Jones; John Smith and Sarah Jones are in the same bucket.

1.5 Applications of a Hash Table

Hash tables are good in situations where you have enormous amounts of data from which you would like to quickly search and retrieve information. A few typical hash table implementations would be in the following situations:

 


2 STL Hash Table Based Containers

As of C++11, the STL provides four hash table based containers:

  1. unordered_set: The elements themselves are used as hash keys. Elements must be unique. Elements cannot be changed while in the hash table.
  2. unordered_multiset: Like unordered_set but duplicate elements (keys) are allowed.
  3. unordered_map: The elements are made up of a pair of types: a key_type and a mapped_type. Keys are stored separately from mapped values. Key values must be unique. The mapped value may be changed while in the hash table, while the key may not.
  4. unordered_multimap: Like unordered_map but duplicate keys are allowed.

Of these, the easiest to work wih is the unordered_map. Elements in the map are stored as a pair: the key and its corresponding value. Storage and retrieval of the elements requires that the key have defined methods for hashing and checking for equality. So long as the key is a simple built-in type like char, int, float or string, C++ and the STL provide this functionality. For more complex keys, you would need to define your own hashing function object and provide an overloaded operator==.

Before you can use unordered_maps you need to understand two new concepts: the STL pair type, and function objects.

2.1 STL Pair

This is a simple utility type that allows STL functions to return or accept two items as if they were one. You will see them used to indicate whether an operation was successful, or to pass and return a key and value as one.

Here is an sample program that declares a pair and accesses its two members: first and second.

// Using a pair. Notice that the pair has two members named first and second.
// The pair is templated, and you may specialize each member to separate or to
// the same types.
//
// Written by: Alex Clarke Nov. 7 2014
 
#include <iostream>
#include <string>
#include <utility> //pair lives in this header. 
                   //You do not normally include it... 
                   //it is included for you by containers that require it.
 
using namespace std;
 
template <class A, class B>
void printMyPair(pair<A, B>& p)
{
   cout << p.first << " " << p.second << endl;
}
 
int main()
{
   //declare, set and print a pair
   pair<string, int> a;
   a.first = "Pair number: ";
   a.second = 1;
   printMyPair(a);
 
   //alternate way to set a pair
   a = make_pair("New pair number: ", 2);
   printMyPair(a);
 
   //initialize the pair at construction time
   pair<doubledouble> b(2.2, 3.3);
   printMyPair(b);
 
   return 0;
}

2.2 Function Objects

The easiest way to define your own hash function for an unordered_map, is to make a class whose instances can be used as functions. To do this, you simply add an operator() to a class that accepts whatever argument you need. For our hash table exercise, we will need to hash a string. Although STL provides a superior hashing function for strings, it is good to know how to write your own. Here is a simple example of such a function.

// String hashing with a function object.
//
// Although STL provides a hashing function object for strings, it is nice to know
// how to define and use your own hashing function object should the need arise.
//
// This demo compares STL's string hashing function to our own simple one.
//
// Written by: Alex Clarke Nov. 7 2014
 
#include <iostream>
#include <string>
#include <functional> //STL hash lives in this header.
                      //You do not normally include it... 
                      //it is included for you by containers that require it.
 
using namespace std;
 
 
// Our special string hashing class
class stringHash
{
public:
   //this operator allows us to use stringHash objects as functions to hash strings.
   size_t operator()(const string& hashit) const
   {
      size_t val = 0;
 
      for (size_t i = 0; i < hashit.length(); i++)
         val += hashit[i];
      return val;
   }
};
 
 
int main()
{
   // This example shows how to instance and use a function object on its own.
   stringHash myHash;
   hash<string> stlHash;
   cout << "Hashes for the sentence \"What number will this produce?\"\n\n";
   cout << "stringHash:\t" << myHash("What number will this produce?") << endl;
   cout << "hash<string>:\t" << stlHash("What number will this produce?") << "\n\n\n";
 
   // This example shows how to use a function object to specialize a template.
   // This is similar to what you need to do with unordered_map, but the hash function
   // will be called for you by various member functions.
   pair<string, stringHash> mine;
   pair<string, hash<string> > STLs;
   cout << "Hashes for the word \"mary\"\n\n";
   mine.first = STLs.first = "mary";
   cout << "stringHash:\t" << mine.second(mine.first) << endl;
   cout << "hash<string>:\t" << STLs.second(STLs.first) << "\n\n\n";
 
   return 0;
}

2.3 Using STL unordered_map

In order to use STL unordered map, you must

#include <unordered_map>

The number of methods for creating and working with an unordered map are overwhelming at first. Here is an abbreviated list of some typical functions associated with the STL unordered map. Note that size_type type is an unsigned integral value.

create a map

unordered_map<key_type, mapped_type> name;
To create an unordered map you must provide at least a key type, and the mapped value type, like so:

unordered_map<string, float> mymap;

unordered_map<key_type, mapped_type, hasher> name;
You may also want to provide a custom hashing function, especially if your key type is not supported. For example, to use the stringHash class discussed above, you would write:

unordered_map<string, float, stringHash> mymap;

unordered_map<key_type, mapped_type> name(size_type n);
By default, an unordered map will resize itself (known as rehashing) when its size is equal to the number of buckets. Rehashing is expensive. If you know in advance how many items your map will likely hold, you can specify an appropriate minimum number n of buckets. The implementation may choose to use any number larger than n.

unordered_map<string, float> mymap( 64 ); // mymap will have at least 64 buckets
empty bool empty() const noexcept;
Returns a true value if the number of elements is zero, false otherwise.
if (!mymap.empty())
{
  ...
} 

create /
update an element

mapped_type& operator[](const key_type &key);
Used to insert new mapped values, or to retrieve and modify existing ones. You must be careful when using it, since if the key was not in the table, it will be added with a default value. An insert function exists, but this is much easier to use.

Note: Should not be used if you only want to get the mapped value of a key.

mymap["quarter"] = 0.25;
mymap["dime"] = 0.1;
mymap["nickel"] = 0.5;  //whoops - wrong value
mymap["nickle"] = 0.05; //double whoops - wrong spelling
mymap["nickel"] = 0.05; //partly fixed

erase

size_type erase (const key_type &key);
Searches the hash table for the data item with the key key. If the data item is found, it removes the data item and returns 1. Otherwise, it returns 0.
if (mymap.erase("nickle") == 1)
{
   //"nickel" was found and removed
}
else
{
   //"nickel" was not found
}

retrieve

mapped_type& at (const key_type &key);
Searches the hash table for the data item with the key key. If the data item is found, it returns a reference to the mapped value. Otherwise, it throws an out_of_range exception..
try
{
   float value = mymap.at("nickel");
   cout << "A nickle is worth " << 100*value << " cents." << endl;
}
catch (out_of_range)
{
   cerr << "Could not find the last key you searched for" << endl;
}
iterator find (const key_type &key);
Searches the hash table for the data item with the key key. If the data item is found, it returns a iterator to the stored pair. Otherwise, it returns an iterator to end().
unordered_set<string, float>::iterator findit;
findit = find("nickel");
if (findit != mymap.end())
{
   cout << "A " << findit->first 
        << " is worth " << 100*findit->second << " cents." << endl;
}
else
{
   cerr << "Could not find the last key you searched for" << endl;
}

begin

iterator begin() noexcept;
Returns an iterator that references the first element in the unordered map. This could be any element, since the unordered map container does not guarantee which element comes first.

end

iterator end() noexcept;
Returns an iterator that references a special element just past the last valid element in the table.

size size_type size() const noexcept;
Returns the number of elements (pairs) currently stored in the map. The size_type type is an unsigned integral value.
// Print the number of elements in the list.
 cout << mymap.size() << endl;
        
bucket count size_type bucket_count() const noexcept;
Returns the number of buckets in the map. If this number changes, you know the map has been rehashed. The size_type type is an unsigned integral value.
// Print the number of buckets in the list.
 cout << mymap.size() << endl;
bucket number size_type bucket(const key_type& key) const;
Returns the bucket number corresponding to the given key.
 cout << mymap.bucket(myset.begin()->first) << endl;

print contents /
show structure

This is not a member function, but rather a demonstration of using begin() and end() to print all members of the hash table. The order of elements between begin() and end() is not guaranteed. For this reason, I have asked for the bucket number of each element so we can better see the structure of the hash table.

for (unordered_set<string, float>::iterator i = myset.begin(); i != myset.end(); i++)
{
   cout << "Key: " << i->first << " Value: " << i->second 
        << " Bucket number: " << myset.bucket(i->first) << endl;
}

3. Application: Looking up Passwords

The following section outlines an algorithm for authenticating a user's password. Later, in the lab exercise, you will be given the skeleton code and asked to add lines to make it work.

One possible use for a hash table is to store computer user login usernames and passwords.

There are two major steps to this program:
  1. The program will load username/userinfo sets from the file password.dat and insert them into the hash table until the end of file is reached on password.dat. The password.dat file will look something like this with one username and its associated information per line:
    first last username password name name
    jack broken.crown Jack Hill jill tumbling!down Jill Hill mary contrary Mary Stuart bopeep sheep!lost Bo Peep ... etc.

  2. The program will then present a login prompt, read one username, present a password prompt, and after looking up the username's password in the hash table, will print either "Authentication succeeded." and some user information, or "Authentication failed.". The output might look something like this:
    Login: mary
    Password: contrary
    Authentication succeeded. Welcome Mary Stuart.
    
    Login: jim
    Password: contrary
    Authentication failed.
    
    Login: bopeep
    Password: sheeplost
    Authentication failed.
    
Step 2 will be repeated until the end of the input data (EOF) is reached on the console input stream (cin). The EOF character on the PC's is the CTRL Z character.

Let's fill in some of the details:

Here's a closer look at the algorithm used in the string hashing function we saw earlier. A simple way to convert from a string to an integer is to add the ASCII value of each character together. For instance, mary's conversion from string to integer might look something like this:
109('m') + 97('a') + 114('r') + 121('y')=441
The code will look like this:
    int hash(const string str) const
    {
        int val = 0;

        for (int i=0; i<str.length(); i++) 
            val += str[i];
        return val;
    }
We've converted the string to an integer, but we still need to convert the integer to an index. For an array of 16 elements we can divide by 16 and use the remainder as an index. This is what most implementations of unordered map do. Combining these two hash functions, we will get some code that looks like this:
   int index = hash ( searchKey ) % 16;
Therefore mary's index will be:
 441 % 16 = 9.

Notice that if there were a student named ramy, this hash function would generate exactly the same result as for mary because both names have the same letters. This is only one of the bad behaviours of this particular string hasher.


3. Lab Exercise

Steps include:

3.0 Explore

3.1 Load User Information

  1. Add in a line to insert passwords into the table.
  2. Add in code to print out the contents of the hash table including:

3.2 Authenticate Logins

3.3 Change the Hasher

  1. Add the string hashing class from Section 2.2 to your program

  2. make the passwords unordered map use it instead of the default hasher.

3.4 Explore and Answer

Now it's time to explore the behaviour of the hash table a little


4. Postlab Exercises

For postlab exercises, click here.

CS Dept Home Page
CS Dept Class Files
CS210 Class Files


Copyright: Department of Computer Science, University of Regina.