27
27
file, but you are not obligated to do so. If you do not wish to do
28
28
so, delete this exception statement from your version. */
30
/* With -DSTANDALONE, this file can be compiled outside Wget source
31
tree. To test, also use -DTEST. */
30
33
#ifdef HAVE_CONFIG_H
31
34
# include <config.h>
44
/* If running without Autoconf, go ahead and assume presence of
45
standard C89 headers. */
35
46
# include <string.h>
38
#endif /* HAVE_STRING_H */
39
51
#include <stdlib.h>
40
52
#include <assert.h>
52
# define xmalloc malloc
53
# define xrealloc realloc
55
/* Get Wget's utility headers. */
59
/* Make do without them. */
60
# define xnew(x) xmalloc (sizeof (x))
61
# define xnew_array(type, x) xmalloc (sizeof (type) * (x))
62
# define xmalloc malloc /* or something that exits
63
if not enough memory */
54
64
# define xfree free
65
# define countof(x) (sizeof (x) / sizeof ((x)[0]))
57
66
# define TOLOWER(x) ('A' <= (x) && (x) <= 'Z' ? (x) - 32 : (x))
62
Hash tables are an implementation technique used to implement
63
mapping between objects. Assuming a good hashing function is used,
64
they provide near-constant-time access and storing of information.
65
Duplicate keys are not allowed.
67
This file defines the following entry points: hash_table_new
68
creates a hash table, and hash_table_destroy deletes it.
69
hash_table_put establishes a mapping between a key and a value.
70
hash_table_get retrieves the value that corresponds to a key.
71
hash_table_contains queries whether a key is stored in a table at
72
all. hash_table_remove removes a mapping that corresponds to a
73
key. hash_table_map allows you to map through all the entries in a
74
hash table. hash_table_clear clears all the entries from the hash
77
The number of mappings in a table is not limited, except by the
78
amount of memory. As you add new elements to a table, it regrows
79
as necessary. If you have an idea about how many elements you will
80
store, you can provide a hint to hash_table_new().
82
The hashing and equality functions depend on the type of key and
83
are normally provided by the user. For the special (and frequent)
84
case of using string keys, you can use the pre-canned
85
make_string_hash_table(), which provides an efficient string
86
hashing function, and a string equality wrapper around strcmp().
88
When specifying your hash and test functions, make sure the
91
- The test function returns non-zero for keys that are considered
92
"equal", zero otherwise.
94
- The hash function returns a number that represents the
95
"distinctness" of the object. In more precise terms, it means
96
that for any two objects that test "equal" under the test
97
function, the hash function MUST produce the same result.
99
This does not mean that each distinct object must produce a
100
distinct value, only that non-distinct objects must produce the
101
same values! For instance, a hash function that returns 0 for
102
any given object is a perfectly valid (albeit extremely bad) hash
103
function. A hash function that hashes a string by adding up all
104
its characters is another example of a valid (but quite bad) hash
107
The above stated rule is quite easy to enforce. For example, if
108
your testing function compares strings case-insensitively, all
109
your function needs to do is lower-case the string characters
110
before calculating a hash. That way you have easily guaranteed
111
that case differences will not result in a different hash.
113
- If you care about performance, choose a hash function with as
114
good "spreading" as possible. A good hash function will react to
115
even a small change in its input with a completely different
116
resulting hash. Finally, don't make the hash function itself
117
overly slow, because you'll be incurring a non-negligible
118
overhead to reads and writes to the hash table.
74
Hash tables are a technique used to implement mapping between
75
objects with near-constant-time access and storage. The table
76
associates keys to values, and a value can be very quickly
77
retrieved by providing the key. Fast lookup tables are typically
78
implemented as hash tables.
81
hash_table_new -- creates the table.
82
hash_table_destroy -- destroys the table.
83
hash_table_put -- establishes or updates key->value mapping.
84
hash_table_get -- retrieves value of key.
85
hash_table_get_pair -- get key/value pair for key.
86
hash_table_contains -- test whether the table contains key.
87
hash_table_remove -- remove the key->value mapping for key.
88
hash_table_map -- iterate through table mappings.
89
hash_table_clear -- clear hash table contents.
90
hash_table_count -- return the number of entries in the table.
92
The hash table grows internally as new entries are added and is not
93
limited in size, except by available memory. The table doubles
94
with each resize, which ensures that the amortized time per
95
operation remains constant.
97
By default, tables created by hash_table_new consider the keys to
98
be equal if their pointer values are the same. You can use
99
make_string_hash_table to create tables whose keys are considered
100
equal if their string contents are the same. In the general case,
101
the criterion of equality used to compare keys is specified at
102
table creation time with two callback functions, "hash" and "test".
103
The hash function transforms the key into an arbitrary number that
104
must be the same for two equal keys. The test function accepts two
105
keys and returns non-zero if they are to be considered equal.
120
107
Note that neither keys nor values are copied when inserted into the
121
108
hash table, so they must exist for the lifetime of the table. This
126
113
/* IMPLEMENTATION:
128
All the hash mappings (key-value pairs of pointers) are stored in a
129
contiguous array. The position of each mapping is determined by
130
the hash value of its key and the size of the table: location :=
131
hash(key) % size. If two different keys end up on the same
132
position (hash collision), the one that came second is placed at
133
the next empty position following the occupied place. This
134
collision resolution technique is called "linear probing".
136
There are more advanced collision resolution mechanisms (quadratic
115
The hash table is implemented as an open-addressed table with
116
linear probing collision resolution.
118
The above means that all the hash entries (pairs of pointers, key
119
and value) are stored in a contiguous array. The position of each
120
mapping is determined by the hash value of its key and the size of
121
the table: location := hash(key) % size. If two different keys end
122
up on the same position (collide), the one that came second is
123
placed at the next empty position following the occupied place.
124
This collision resolution technique is called "linear probing".
126
There are more advanced collision resolution methods (quadratic
137
127
probing, double hashing), but we don't use them because they incur
138
more non-sequential access to the array, which results in worse
128
more non-sequential access to the array, which results in worse CPU
139
129
cache behavior. Linear probing works well as long as the
140
fullness/size ratio is kept below 75%. We make sure to regrow or
141
rehash the hash table whenever this threshold is exceeded.
143
Collisions make deletion tricky because finding collisions again
144
relies on new empty spots not being created. That's why
145
hash_table_remove is careful to rehash the mappings that follow the
148
/* When hash table's fullness exceeds this threshold, the hash table
150
#define HASH_FULLNESS_THRESHOLD 0.75
152
/* The hash table size is multiplied by this factor with each resize.
153
This guarantees infrequent resizes. */
130
count/size ratio (fullness) is kept below 75%. We make sure to
131
grow and rehash the table whenever this threshold is exceeded.
133
Collisions make deletion tricky because clearing a position
134
followed by a colliding entry would make the position seem empty
135
and the colliding entry not found. One solution is to leave a
136
"tombstone" instead of clearing the entry, and another is to
137
carefully rehash the entries immediately following the deleted one.
138
We use the latter method because it results in less bookkeeping and
139
faster retrieval at the (slight) expense of deletion. */
141
/* Maximum allowed fullness: when hash table's fullness exceeds this
142
value, the table is resized. */
143
#define HASH_MAX_FULLNESS 0.75
145
/* The hash table size is multiplied by this factor (and then rounded
146
to the next prime) with each resize. This guarantees infrequent
154
148
#define HASH_RESIZE_FACTOR 2
155
typedef unsigned long (*hashfun_t) PARAMS ((const void *));
156
typedef int (*testfun_t) PARAMS ((const void *, const void *));
161
158
struct hash_table {
162
unsigned long (*hash_function) PARAMS ((const void *));
163
int (*test_function) PARAMS ((const void *, const void *));
165
int size; /* size of the array */
166
int count; /* number of non-empty, non-deleted
159
hashfun_t hash_function;
160
testfun_t test_function;
162
struct mapping *mappings; /* pointer to the table entries. */
163
int size; /* size of the array. */
165
int count; /* number of non-empty entries. */
169
166
int resize_threshold; /* after size exceeds this number of
170
167
entries, resize the table. */
171
168
int prime_offset; /* the offset of the current prime in
172
169
the prime table. */
174
struct mapping *mappings; /* the array of mapping pairs. */
177
/* We use NULL key to mark a mapping as empty. It is consequently
178
illegal to store NULL keys. */
179
#define NON_EMPTY(mp) (mp->key != NULL)
172
/* We use the all-bits-set constant (INVALID_PTR) marker to mean that
173
a mapping is empty. It is unaligned and therefore illegal as a
174
pointer. INVALID_PTR_BYTE (0xff) is the one-byte value used to
175
initialize the mappings array as empty.
177
The all-bits-set value is a better choice than NULL because it
178
allows the use of NULL/0 keys. Since the keys are either integers
179
or pointers, the only key that cannot be used is the integer value
180
-1. This is acceptable because it still allows the use of
181
nonnegative integer keys. */
183
#define INVALID_PTR ((void *) ~(unsigned long)0)
185
# define UCHAR_MAX 0xff
187
#define INVALID_PTR_BYTE UCHAR_MAX
189
#define NON_EMPTY(mp) ((mp)->key != INVALID_PTR)
190
#define MARK_AS_EMPTY(mp) ((mp)->key = INVALID_PTR)
181
192
/* "Next" mapping is the mapping after MP, but wrapping back to
182
193
MAPPINGS when MP would reach MAPPINGS+SIZE. */
187
198
#define LOOP_NON_EMPTY(mp, mappings, size) \
188
199
for (; NON_EMPTY (mp); mp = NEXT_MAPPING (mp, mappings, size))
190
/* #### We might want to multiply with the "golden ratio" here to get
191
better randomness for keys that do not result from a good hash
192
function. This is currently not a problem in Wget because we only
193
use the string hash tables. */
195
#define HASH_POSITION(ht, key) (ht->hash_function (key) % ht->size)
197
/* Find a prime near, but greather than or equal to SIZE. Of course,
198
the primes are not calculated, but looked up from a table. The
199
table does not contain all primes in range, just a selection useful
201
/* Return the position of KEY in hash table SIZE large, hash function
203
#define HASH_POSITION(key, hashfun, size) ((hashfun) (key) % size)
205
/* Find a prime near, but greather than or equal to SIZE. The primes
206
are looked up from a table with a selection of primes convenient
200
207
for this purpose.
202
PRIME_OFFSET is a minor optimization: if specified, it starts the
203
search for the prime number beginning with the specific offset in
204
the prime number table. The final offset is stored in the same
209
PRIME_OFFSET is a minor optimization: it specifies start position
210
for the search for the large enough prime. The final offset is
211
stored in the same variable. That way the list of primes does not
212
have to be scanned from the beginning each time around. */
208
215
prime_size (int size, int *prime_offset)
210
static const unsigned long primes [] = {
217
static const int primes[] = {
211
218
13, 19, 29, 41, 59, 79, 107, 149, 197, 263, 347, 457, 599, 787, 1031,
212
219
1361, 1777, 2333, 3037, 3967, 5167, 6719, 8737, 11369, 14783,
213
220
19219, 24989, 32491, 42257, 54941, 71429, 92861, 120721, 156941,
216
223
10445899, 13579681, 17653589, 22949669, 29834603, 38784989,
217
224
50420551, 65546729, 85210757, 110774011, 144006217, 187208107,
218
225
243370577, 316381771, 411296309, 534685237, 695090819, 903618083,
219
1174703521, 1527114613, 1985248999,
220
(unsigned long)0x99d43ea5, (unsigned long)0xc7fa5177
226
1174703521, 1527114613, 1837299131, 2147483647
222
int i = *prime_offset;
224
for (; i < countof (primes); i++)
230
for (i = *prime_offset; i < countof (primes); i++)
225
231
if (primes[i] >= size)
227
233
/* Set the offset to the next prime. That is safe because,
263
274
int (*test_function) (const void *, const void *))
266
struct hash_table *ht
267
= (struct hash_table *)xmalloc (sizeof (struct hash_table));
269
ht->hash_function = hash_function ? hash_function : ptrhash;
270
ht->test_function = test_function ? test_function : ptrcmp;
277
struct hash_table *ht = xnew (struct hash_table);
279
ht->hash_function = hash_function ? hash_function : hash_pointer;
280
ht->test_function = test_function ? test_function : cmp_pointer;
282
/* If the size of struct hash_table ever becomes a concern, this
283
field can go. (Wget doesn't create many hashes.) */
272
284
ht->prime_offset = 0;
274
286
/* Calculate the size that ensures that the table will store at
275
287
least ITEMS keys without the need to resize. */
276
size = 1 + items / HASH_FULLNESS_THRESHOLD;
288
size = 1 + items / HASH_MAX_FULLNESS;
277
289
size = prime_size (size, &ht->prime_offset);
279
ht->resize_threshold = size * HASH_FULLNESS_THRESHOLD;
291
ht->resize_threshold = size * HASH_MAX_FULLNESS;
280
292
/*assert (ht->resize_threshold >= items);*/
282
ht->mappings = xmalloc (ht->size * sizeof (struct mapping));
283
memset (ht->mappings, '\0', ht->size * sizeof (struct mapping));
294
ht->mappings = xnew_array (struct mapping, ht->size);
296
/* Mark mappings as empty. We use 0xff rather than 0 to mark empty
297
keys because it allows us to use NULL/0 as keys. */
298
memset (ht->mappings, INVALID_PTR_BYTE, size * sizeof (struct mapping));
382
398
ht->size = newsize;
383
ht->resize_threshold = newsize * HASH_FULLNESS_THRESHOLD;
399
ht->resize_threshold = newsize * HASH_MAX_FULLNESS;
385
mappings = xmalloc (ht->size * sizeof (struct mapping));
386
memset (mappings, '\0', ht->size * sizeof (struct mapping));
401
mappings = xnew_array (struct mapping, newsize);
402
memset (mappings, INVALID_PTR_BYTE, newsize * sizeof (struct mapping));
387
403
ht->mappings = mappings;
389
405
for (mp = old_mappings; mp < old_end; mp++)
390
406
if (NON_EMPTY (mp))
392
struct mapping *new_mp = mappings + HASH_POSITION (ht, mp->key);
393
/* We don't need to test for uniqueness of keys because all
394
the keys come from the hash table and are therefore known
408
struct mapping *new_mp;
409
/* We don't need to test for uniqueness of keys because they
410
come from the hash table and are therefore known to be
412
new_mp = mappings + HASH_POSITION (mp->key, hasher, newsize);
396
413
LOOP_NON_EMPTY (new_mp, mappings, newsize)
444
461
int size = ht->size;
445
462
struct mapping *mappings = ht->mappings;
463
hashfun_t hasher = ht->hash_function;
450
468
/* Rehash all the entries following MP. The alternative
451
469
approach is to mark the entry as deleted, i.e. create a
452
"tombstone". That makes remove faster, but leaves a lot of
470
"tombstone". That speeds up removal, but leaves a lot of
453
471
garbage and slows down hash_table_get and hash_table_put. */
455
473
mp = NEXT_MAPPING (mp, mappings, size);
456
474
LOOP_NON_EMPTY (mp, mappings, size)
458
476
const void *key2 = mp->key;
459
struct mapping *mp_new = mappings + HASH_POSITION (ht, key2);
477
struct mapping *mp_new;
461
479
/* Find the new location for the key. */
480
mp_new = mappings + HASH_POSITION (key2, hasher, size);
463
481
LOOP_NON_EMPTY (mp_new, mappings, size)
464
482
if (key2 == mp_new->key)
465
483
/* The mapping MP (key2) is already where we want it (in
532
550
don't strictly belong to this file. However, this is as good a
533
551
place for them as any. */
553
/* Guidelines for creating custom hash and test functions:
555
- The test function returns non-zero for keys that are considered
556
"equal", zero otherwise.
558
- The hash function returns a number that represents the
559
"distinctness" of the object. In more precise terms, it means
560
that for any two objects that test "equal" under the test
561
function, the hash function MUST produce the same result.
563
This does not mean that all different objects must produce
564
different values (that would be "perfect" hashing), only that
565
non-distinct objects must produce the same values! For instance,
566
a hash function that returns 0 for any given object is a
567
perfectly valid (albeit extremely bad) hash function. A hash
568
function that hashes a string by adding up all its characters is
569
another example of a valid (but quite bad) hash function.
571
It is not hard to make hash and test functions agree about
572
equality. For example, if the test function compares strings
573
case-insensitively, the hash function can lower-case the
574
characters when calculating the hash value. That ensures that
575
two strings differing only in case will hash the same.
577
- If you care about performance, choose a hash function with as
578
good "spreading" as possible. A good hash function will use all
579
the bits of the input when calculating the hash, and will react
580
to even small changes in input with a completely different
581
output. Finally, don't make the hash function itself overly
582
slow, because you'll be incurring a non-negligible overhead to
583
all hash table operations. */
536
586
* Support for hash tables whose keys are strings.