Skip to main content

skip to main content

developerWorks  >  Linux  >

GNOMEnclature: The wonders of GLib, Part 2

Hash tables, the iterator, the GScanner, and tokens

developerWorks
Document options

Document options requiring JavaScript are not displayed


Hey there! developerWorks is using Twitter

Follow us


Rate this page

Help us improve this content


Level: Intermediate

George Lebl (jirka@5z.com), developerWorks columnist

01 Jun 2000

In his second installment on GLib, George Lebl gets into a little more detail. He gives us the rundown on hash tables, and walks us through creating a table, inserting and looking up data, and using the iterator through entries. He also provides code and setup examples of how to use tokens and the GScanner.

Given the popularity of my previous GLib article, I've decided to write a new episode in the saga of GLib. Last time we looked at some very basic GLib concepts and went through some basic containers. This time we'll look at the slightly more exotic features. We will examine hash tables and the lexical scanner.

Hash tables

For those that don't know how hash tables work, here's a quick rundown. A hash table is for associating data with a key. You want to be able to find the data again quickly when given the key. It's a very time-consuming task to look through every possible option, so hash tables figure out a number from the key called the hash or the hash number. The hash table is just an array from 0 to the maximum hash number possible. Each entry in the array has a list of entries which have this hash number. So when the hash table wants to look something up, it takes the key, figures out its hash number, and then looks for the entry in the given location in the array. The reason this is so quick is that most of the time the array is larger then the number of elements. And each location in the array is not likely to have more than a few entries. Thus the time spent searching for an element is not dependent on the number of entries you put into the hash table.

When most people think of hash tables, they think of associating a piece of data with a string key and then retrieving the data quickly using that string key. GLib takes this concept further. There is no reason for keys to be only strings. The keys in GHashTable, the GLib type for a hash table, are simple void pointers. The keys can be any sort of data. Each hash table then needs to be able to give a hash number to a key. It also needs to be able to compare two keys to see if they are equal. This means that each hash table stores pointers for two functions: one that gives back an unsigned integer with the hash of the key, and one that compares two keys for equality and returns a boolean value.

You do not have to provide the table size to GHashTable. It is resized to an optimum value when necessary. This means that the hash table is always only as big as it really needs to be.



Back to top


Creating and destroying

OK then, let's start with the code. First you need to create a hash table and specify the two functions described above. GLib defines three types of standard functions. The first type is a double set for strings. This is g_str_hash for the hashing function and g_str_equal for comparing strings. The second type is g_int_hash and g_int_equal for hashes of integers. In this case you will need to use the GINT_TO_POINTER and GPOINER_TO_INT macros described in the previous article, when inserting or looking up things in the hash table. The third type, g_direct_hash and g_direct_equal, is for using actual pointers as keys. In this case the hash number is just the pointer value itself, hence the name of the functions. The function for creating a new hash table is g_hash_table_new. This takes the two functions as its arguments and returns a new and empty hash table. After you are done with the hash table, you call g_hash_table_destroy. This will free and destroy just the hash table itself, without the data you stored in it; you are responsible for that yourself.

  GHashTable *table;
  	table = g_hash_table_new(g_str_hash, g_str_equal);
  
  	/* here we would use the hash table */

  g_hash_table_destroy(table);      



Back to top


Inserting and looking up data

Now comes the somewhat tricky part. Imagine your keys have some dynamically allocated data. Most string hashes will work this way. So not only do you have to allocate and free the data that you store, you also have to allocate and free the key itself. For strings this is a simple g_strdup when you insert a piece of data, and g_free when you remove it or before you destroy the hash table. This is one area where it's easy to go wrong with the GHashTable. First, make sure that you don't g_free the string too early. That is, only free the string if you are just about to remove the entry or destroy the hash table. Second, when inserting, make sure that the entry does not exist yet. If it does, it will be replaced and the old key and value will be leaked. Now that we're through with the warnings, let's see the code. The first call is a call to insert. It takes the hash table, the key, and the value as arguments. Let's imagine we're reading from a tab-separated spreadsheet file and we want to insert the second column as data with the first column as key. We have created our hash table as described, and the variable's name is table.

  FILE *fp;
  char buf[1024];

  /* just a standard way to open files */
  fp = fopen("file.txt", "r");
  if(!fp) exit(1); /* if file does not exist, exit */

  /* read, file line by line */
  while(fgets(buf, sizeof(buf), fp)) {
          char *key, *value;
          /* get the first and the second field */
          key = strtok(buf, "\t");
          if(!key) continue;
          value = strtok(NULL, "\t");
          if(!value) continue;

          /* insert into our table */
          g_hash_table_insert(table, g_strdup(key), g_strdup(value));
  }
  fclose(fp);

This code opens the file and then reads each line into a buffer. It then uses the standard strtok function from the C library to fetch the first and second field by separating the line with tabs. This code violates one of the principles set above. Let's imagine the input file has two lists with exactly the same first field. When the second line gets inserted, the first two pointers will just get lost. There will be no way to find these pointers again and their allocated memory will never be freed. So what we want to do is look up the key first, and only insert the new value if we don't find any value with that key. The g_hash_table_lookup function takes the hash table and a key as its arguments. It returns a pointer either to the value, if it was found, or to NULL, if the key does not exist in the hash table.

  char *old_value;

  /* try looking up this key */
  old_value = g_hash_table_lookup(table, key);
  /* if we didn't find anything, insert the new key and value */
  if(old_value == NULL) {
          g_hash_table_insert(table, g_strdup(key), g_strdup(value));
  }

This is nice, but what if we want the second line to override the first line? If we do find a value by the same key as the line that we have just read, we want to free those old values and insert the new ones. Here we face a problem. The g_hash_table_lookup doesn't retrieve the key that was originally used to insert the value. We need to use another function, namely g_hash_table_lookup_extended. This function is a bit more complicated. It still takes the hash table and the lookup key as the first two arguments. But then it takes two more arguments that are pointers to gpointers. This is because in fact we want to return two pointers: the first is the original key that was used for insertion, and the second is the value. The g_hash_table_lookup_extended function also returns a boolean. It will return TRUE if the lookup was successful or FALSE if the lookup failed. Using this function we can just free the original key and value and then insert the new key and new value.

  char *old_key, *old_value;

  /* try looking up this key */
  if(g_hash_table_lookup_extended(table, key, &old_key, &old_value)) {

          /* insert the new value */
          g_hash_table_insert(table, g_strdup(key), g_strdup(value));

          /* just free the key and value */
          g_free(old_key);
          g_free(old_value);
  } else
          /* insert the new value */
          g_hash_table_insert(table, g_strdup(key), g_strdup(value));

Notice that we need to insert the new value before freeing the old key and value. This is because of the first principle, which we set earlier, not to free the key or the value too soon. The insertion will first try to look up the entry. So if we freed the key and value before inserting, it would try to examine a key which points to a non existent memory location. After the insert, this entry will have been overwritten by the new one and no longer exists, so we can free the old key and value.

A similar problem arises when we want to remove a value from the hash table. The function g_hash_table_remove takes the hash table and a key as arguments, and will remove that entry from the hash table. However, you are still responsible for freeing the data. So here is another use of g_hash_table_lookup_extended.

  /* try looking up this key */
  if(g_hash_table_lookup_extended(table, key, &old_key, &old_value)) {

          /* remove the entry in the hash table */
          g_hash_table_remove(table, key);

          /* just free the key and value */
          g_free(old_key);
          g_free(old_value);
  }

The above issues are only a concern when using dynamically allocated data as a key and/or value. For example, if we were using string constants, we wouldn't ever have to worry about freeing them. Another example would be if you were hashing by integers.



Back to top


Iterating through every entry

While hash tables are primarily designed for very fast lookup of specific data given one key, you sometimes may need to look at and do something with every entry. An example would be freeing every key and value in the hash table before destroying the table itself. The function for this is, similar to the case of linked lists, g_hash_table_foreach. It takes 3 arguments: the hash table, the iterating function to call on each entry, and a user data pointer to pass to the iterating function. Unlike linked lists, the iterating function here takes three pointers. First it takes the key, then the value, and then the user data pointer.

  /* this just frees the key and the value */
  static void
  free_a_hash_table_entry(gpointer key, gpointer value, gpointer user_data)
  {
          g_free(key);
          g_free(value);
  }
  /* somewhere else in the code */
  g_hash_table_foreach(table, free_a_hash_table_entry, NULL);
  g_hash_table_destroy(table);

You may also want to remove specific entries. Imagine wanting to remove every entry where the value starts with the letter 'A'. Doing it with the above function would involve working a bit of magic, because of a restriction on the foreach function. You shouldn't insert or remove any items while inside the iterating function. This could in fact crash your application or have unexpected results. But there is a similar foreach function designed specifically for removing members. This is the g_hash_table_foreach_remove. It behaves very much like the g_hash_table_foreach function, except for two differences. The first difference is that the iterating function should now return a boolean value. If it returns a TRUE, the entry will be deleted. If it returns FALSE, the entry will be kept. The second difference is that the function will now return the number of items deleted, in case you want to keep track of them.

  static gboolean
  remove_keys_with_A(gpointer key, gpointer value, gpointer user_data)
  {
          char *char_value = (char *)value;
          if(char_value[0] == 'A') {
                  g_free(key);
                  g_free(value);
                  return TRUE;
          } else
                  return FALSE;
  }

  /* somewhere else in the code */
  int deleted;

  deleted = g_hash_table_foreach_remove(table, remove_keys_with_A, NULL);

  printf("Deleted %d items!\n", deleted);



Back to top


The rest of GHashTable

That's almost it for hash tables. There are 3 more functions GLib provides. The first two are g_hash_table_freeze and g_hash_table_thaw. These two functions will stop the hash table from resizing after each insertion or removal. This can be useful if, for example, you remove a lot of entries, add a similar number of entries, and then you don't want the hash table to resize down and back up when it doesn't need to. However, if you do this for the initial insertion, it won't help you and can, in fact, hurt performance. But hashes are very fast, so this is not a big problem. The last function is g_hash_table_size, which will just return the number of entries in the hash table.



Back to top


The lexical scanner

We will now look at the lexical scanner that GLib provides. Since the scanner is quite a large and complicated beast, we will only cover the important parts. A lexical scanner takes textual input, scans it for individual "tokens," and hands you only the tokens one by one. A token in this case would be something like a number, a character, a punctuation mark, or an identifier. The classical example of a lexical scanner is lex. While lex is very flexible in some areas, its design makes it difficult to use and unsuitable for many modern programs. GScanner, the glib lexical scanner, provides an alternative. GScanner was not, however, designed to scan absolutely anything. It has many limitations. But for most types of input it will suffice.



Back to top


Setting up GScanner

GScanner uses a large configuration structure that you fill with your desired preferences. We will look at this structure later. For now let's just use the default configuration of GScanner. To create a new scanner, we use g_scanner_new and pass to it a pointer to our configuration structure. We can also pass it NULL and it will use the default configuration.

We then need to tell GScanner where it should read input from. We can either give it a standard Unix file descriptor, or we can give it a string to read from. This is done with the g_scanner_input_file or g_scanner_input_text. The g_scanner_input_file function takes an extra argument of the Unix file descriptor to read. The g_scanner_input_text takes two extra arguments: one is the pointer to the text buffer, and the other is the length of the text buffer. After we're done with scanning, we need to use g_scanner_destroy to destroy the scanner and free all its data.

  GScanner *scanner;
  int filefd;

  filefd = open("file.txt", O_RDONLY);
  if(filefd < 0) exit(1); /* we couldn't open the file */

  /* create a new scanner */
  scanner = g_scanner_new(NULL);

  /* set up the scanner to read from the file */
  g_scanner_input_file(scanner, filefd);

  /* Here we would do our scanning */
  ...

  /* destroy the scanner */
  g_scanner_destroy(scanner);



Back to top


Tokens

What GScanner will give you as tokens is actually an enumerated value of type GTokenType, which is basically just an integer. The types of tokens that you can get is given by the configuration of GScanner. By default, GScanner will return all separate ASCII characters as themselves, so you can use the returned token directly as a character if it is less than 256. There are some other tokens, however. The obvious one is G_TOKEN_EOF. This is given when the end of input is reached. On the default configuration you can also get: G_TOKEN_INT, which means that an integer was found and read on the input, G_TOKEN_FLOAT, which means a floating point value was read on the input, G_TOKEN_STRING, which means a string constant was read on the input (a string constant is any number of characters enclosed in quotes), G_TOKEN_SYMBOL, which means a symbol which you defined was read on the input, G_TOKEN_IDENTIFIER, which means that an identifier was read on the input, or, finally, G_TOKEN_ERROR, which means that an error has occurred while scanning the input.

These last few tokens also have an associated value with them. When GScanner finds one of these values, you can also get the GTokenValue type. GTokenValue is a union of different types depending on the token. The name of the element in this union is a lowercase name of the token with 'v_' prepended.

  typedef union _GTokenValue GTokenValue;
  union _GTokenValue
  {
    gpointer	v_symbol;
    gchar      *v_identifier;
    gulong      v_binary;
    gulong      v_octal;
    gulong      v_int;
    gdouble     v_float;
    gulong      v_hex;
    gchar      *v_string;
    gchar      *v_comment;
    guchar      v_char;
    guint       v_error;
  };

If you are wondering why all the integer values are unsigned long, the answer is simple. GScanner does not look at the number sign when scanning numbers. So if the input contains -32 you will get two tokens: one for the minus sign, and one for the integer 32.



Back to top


Getting tokens and values

Now we need to figure out how to get the token from the scanner. The basic function to use is g_scanner_get_next_token. This function returns the next token from the input and advances the scanner. This means that if you call the g_scanner_get_next_token in a loop, you will get all the tokens in order one by one. Sometimes you may want to look at the next token without advancing to it. The g_scanner_peek_next_token does just that. Depending on the context in your parser, this function may be useful to see what's next. Once you have your token, if you need to get the GTokenValue, you can call g_scanner_cur_value.

  /* while the next token is something else other than end of file */
  while(g_scanner_peek_next_token(scanner) != G_TOKEN_EOF) {

          /* get the next token */
          GTokenType token = g_scanner_get_next_token(scanner);
          if(token == G_TOKEN_IDENTIFIER) {
                /* if we have an identifier, print it out */
                GTokenValue value = g_scanner_cur_value(scanner);
                printf("%s\n", value.v_identifier);
          }
  }

You can also get the current token again with g_scanner_cur_token. If you also need the current line or the position on the line in the input, you can use g_scanner_cur_line or g_scanner_cur_position functions. These two functions return the desired integer. If you want to know if the scanner has already hit the end of the line, you can check with g_scanner_eof, which returns TRUE if that is the case.

I think this is enough to get you started. One other thing I will mention is that symbols are basically special identifiers that give you a void pointer instead of the string. You define each of the symbols and the void pointers that GScanner should give you. You can also have GScanner convert this pointer to integer and return it as a token. This way you can add your own tokens to GScanner. GScanner also has a setup for handling parsing errors, which I won't cover. We do not have space to go into these issues here. Maybe next time!



Back to top


The setup

The last thing about GScanner is its setup structure. This is quite a large structure and I will go through it field by field.

  typedef struct _GScannerConfig GScannerConfig;
  struct _GScannerConfig
  {
    /* Character sets
     */
    gchar         *cset_skip_characters;          /* default: " \t\n" */
    gchar         *cset_identifier_first;
    gchar         *cset_identifier_nth;
    gchar         *cpair_comment_single;          /* default: "#\n" */
    
    /* Should symbol lookup work case sensitive?
     */
    guint         case_sensitive : 1;
    
    /* Boolean values to be adjusted "on the fly"
     * to configure scanning behavior.
     */
    guint         skip_comment_multi : 1;         /* C like comment */
    guint         skip_comment_single : 1;        /* single line comment */
    guint         scan_comment_multi : 1;         /* scan multi line comments? */
    guint         scan_identifier : 1;
    guint         scan_identifier_1char : 1;
    guint         scan_identifier_NULL : 1;
    guint         scan_symbols : 1;
    guint         scan_binary : 1;
    guint         scan_octal : 1;
    guint         scan_float : 1;
    guint         scan_hex : 1;                   /* `0x0ff0' */
    guint         scan_hex_dollar : 1;            /* `$0ff0' */
    guint         scan_string_sq : 1;             /* string: 'anything' */
    guint         scan_string_dq : 1;             /* string: "\\-escapes!\n" */
    guint         numbers_2_int : 1;              /* bin, octal, hex => int */
    guint         int_2_float : 1;                /* int => G_TOKEN_FLOAT? */
    guint         identifier_2_string : 1;
    guint         char_2_token : 1;               /* return G_TOKEN_CHAR? */
    guint         symbol_2_token : 1;
    guint         scope_0_fallback : 1;           /* try scope 0 on lookups? */
  };

A couple of the string constants have been predefined to save you some typing. First we have G_CSET_A_2_Z and G_CSET_a_2_z, which are the standard ASCII characters, upper case and lower case respectively. Then there are G_CSET_LATINC and G_CSET_LATINS, which are the Latin 1 characters, upper and lower case respectively. Since these are just character constants, you can put them one after another.

cset_skip_characters
   This is a string with a set of characters which are skipped completely,
   default is " \t\n".

cset_identifier_first
   The first character of an identifier. The identifier must start with
   one of these characters.  By default this is: 
     G_CSET_a_2_z
     "_"
     G_CSET_A_2_Z

cset_identifier_nth
   Any other character of the identifier.  For a C identifier,
   this would include the numbers as well.  By default it is:
     G_CSET_a_2_z
     "_0123456789"
     G_CSET_A_2_Z
     G_CSET_LATINS
     G_CSET_LATINC

cpair_comment_single
   A two character string describing a comment.  The comment begins with the
   first character and ends with the second character.  If either one is NULL, the
   single style comments are ignored.  By default this is the standard hash
   mark until end of line comment style and thus it is "#\n".
    
case_sensitive
   When looking up symbols, should we be case sensitive?  By default this
   is FALSE.
    
skip_comment_multi
   If to skip C style comments.  Unfortunately the style of the comments is
   hardcoded in.  The default is TRUE.

skip_comment_single
   Skip the single style comments of the style set above.  The default is
   TRUE.

scan_comment_multi
   Should we scan for C style comments?  If you are not skipping C
   style comments then you will get a token of G_TOKEN_COMMENT_MULTI.  The
   default is TRUE.

scan_identifier
   Scan for identifiers.  If this is FALSE, then you will get the separate
   characters only.  If it is true, you will get the G_TOKEN_IDENTIFIER.
   The default is TRUE.

scan_identifier_1char
   Should we allow single character identifiers? If this is false, you will get
   the character itself.  Default is FALSE.

scan_identifier_NULL
   Scan for 'NULL' as an identifier.  If this is TRUE, you will get
   G_TOKEN_IDENTIFIER_NULL.  Default is FALSE.

scan_symbols
   Try to convert identifiers into one of the defined symbols.  The default
   is TRUE.

scan_binary
   Scan for binary style integers.  If it is TRUE, you will get the
   G_TOKEN_BINARY token.  The default is FALSE.

scan_octal
   Scan for octal style integers.  If it is TRUE, you will get the
   G_TOKEN_OCTAL token.  The default is TRUE.

scan_float
   Scan for floating point numbers.  If it is TRUE, you will get the
   G_TOKEN_FLOAT token.  The default is TRUE.

scan_hex
   Scan for hex style integers which begin with '0x'.  If it is TRUE,
   you will get the G_TOKEN_HEX token.  The default is TRUE.

scan_hex_dollar
   Scan for hex style integers which begin with '$'.  If it is TRUE,
   you will get the G_TOKEN_HEX token.  The default is FALSE.

scan_string_sq
   Scan for strings in single quotes.  Every character inside the quotes
   will be added to the string except for another single quote.  This
   will get you the G_TOKEN_STRING token.  Default is TRUE.

scan_string_dq
   Scan for string in double quotes.  This will interpret and replace
   the standard C backslash requests.  This will get you the G_TOKEN_STRING
   token.  Default is TRUE.

numbers_2_int
   Instead of returning all the integer types as different tokens, always
   return just G_TOKEN_INT.  Default is TRUE.

int_2_float
   Instead of returning G_TOKEN_INT, always convert to float and return
   G_TOKEN_FLOAT.  Default is FALSE.

identifier_2_string
   Return all identifiers as G_TOKEN_STRING.  Default is FALSE.

char_2_token
   If TRUE, this will return characters in the token itself.  Otherwise it will return
   G_TOKEN_CHAR, and you will have to get the character from GTokenValue.
   Default is TRUE.

symbol_2_token
   Convert the pointer associated with a symbol to an integer and return
   it as a token.  Otherwise you will have to get the pointer from
   GTokenValue.  Default is FALSE.

scope_0_fallback
   If TRUE, the scope number 0 will be tried unless a symbol can be found
   in the current scope.  Default is FALSE.



Back to top


Conclusion

I think this article should give you more insight into how some of the GLib features work. For the next article I will pick a different topic, so that people who aren't so fond of GLib will not get bored. We can return to GLib later, if people display further interest.



Resources



About the author

George Lebl

George (Jiri or Jirka in Czech) Lebl was born in Prague, Czechoslovakia, and now lives in San Diego, California, where he is trying to finish his degree. After several years using non-UNIX operating systems, he started using UNIX and became a UNIX bigot about four years ago, and a Free Software bigot about two years ago. He joined the GNOME project in the fall of 1997, and finally became a C bigot as well. And, most importantly, he's a VI user. He can be reached at jirka@5z.com.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top