Synopsis: Improve SSM hashing speed by a factor 15. April 97
From erwin@pip.dknet.dk Tue May 27 15:48:30 1997
Date: Fri, 18 Apr 1997 17:16:37 +0200 (MET DST)
From: "Erwin S. Andreasen" <erwin@pip.dknet.dk>
To: MERC/Envy Mailing List <merc-l@webnexus.com>
Subject: SSM: More accurate hash function

Included is a more accurate hash function for the temp_hash_{find,add}
functions. It is based off a hash function that supposedly comes from
perl.

static inline unsigned get_string_hash( register const char *key, int len)
{
    register int i = UMIN(len,32);
    register unsigned hash = 0;

    while(i--)
        hash = hash * 33U + *key++;

    return hash % MAX_KEY_HASH;
}

The hash function calculates the hash based on the first 32 characters; I
found that setting that value to 8 characters caused a 50% increase in
number of strcmps in temp_hash_find; setting it beyond 32 did not decrease
it. 

The hash function takes the string plus its length as parameter: 
fread_string can figure out the length of the string so there is no reason
to recalculate it. 

The new temp_hash_find will look like this:

char *temp_hash_find( const char *str, int len )
{
    TempHash *ptr;
    int ihash;

    if( !fBootDb || !*str )
        return 0;

    ihash = get_string_hash(str,len);

    for( ptr = temp_string_hash[ ihash ]; ptr; ptr = ptr->next )
    {
        if( *ptr->str != *str )
            continue;
        else if( strcmp( ptr->str, str ) )
            continue;
        else return ptr->str;
    }

    return 0;
}

void temp_hash_add( char *str, int len )
{
    int ihash;
    TempHash *add;

    if( !fBootDb || !*str || ( str <= string_space && str >= top_string ))
        return;

    ihash = get_string_hash(str,len);

    add = (TempHash *)malloc( sizeof( TempHash ) );
    add->next = temp_string_hash[ ihash ];
    temp_string_hash[ ihash ] = add;
    add->str = str;
}


You must then modify fread_string to pass the length over to temp_hash
functions:

if( fBootDb )
{ 
	int len = ptr-buf;
	
	ptr = temp_hash_find( buf, len ); 
	if( ptr )
		return str_dup (ptr);

	ptr = str_dup( buf );
	temp_hash_add( ptr, len );
	return ptr;
}


Also, free_string must be changed too, to use the new hash algorithm
rather than string length when freeing strings at bootup, if you need
that.

A field called "len" in TempHash structure is also now obsolete.

Using this new hash method gives a significant improvement. Some profiling
results:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name


Old method:

 51.55      6.82     6.82   131335     0.05     0.05  temp_hash_find

New method (this was compiled without optimizations to get results
comparable to the old method, so get_string_hash was not inlined)

  4.82      3.21     0.30    63012     0.00     0.01  temp_hash_find
  4.02      3.46     0.25   106934     0.00     0.00  get_string_hash
[...]
  0.32      5.66     0.02    43922     0.00     0.00  temp_hash_add


Number of calls was lowered becuase the old mread_string I use (a
replacement for fread_string, using mmap) used to call temp_hash_find evne
on empty strings. This should not significantly change the results, since
temp_hash_find would immediately return when encountering an empty string.

So, 6.82 seconds the old way, then .30 (temp_hash_find) + roughly .14
(get_string_hash calls made from temp_hash_find) = .44 seconds - about 15
times speed up.

With -O2 enabled (only on ssm.c) the results are:

  6.31      3.50     0.39    63012     0.01     0.01  temp_hash_find

It's higher than before, but get_string_hash now is inlined, so that's .39
total (vs about .44 before).

The testing was done when loading about 7 megabytes area/help files; total
string space used after loading was 5696452 bytes. Machine is a Linux
2.0.30 system with 2 p166.