[2162] | 1 | <HTML>
|
---|
| 2 |
|
---|
| 3 | <HEAD>
|
---|
| 4 | <title>Implementation notes: sparse_hash, dense_hash, sparsetable</title>
|
---|
| 5 | </HEAD>
|
---|
| 6 |
|
---|
| 7 | <BODY>
|
---|
| 8 |
|
---|
| 9 | <h1>Implementation of sparse_hash_map, dense_hash_map, and
|
---|
| 10 | sparsetable</h1>
|
---|
| 11 |
|
---|
| 12 | This document contains a few notes on how the data structures in this
|
---|
| 13 | package are implemented. This discussion refers at several points to
|
---|
| 14 | the classic text in this area: Knuth, <i>The Art of Computer
|
---|
| 15 | Programming</i>, Vol 3, Hashing.
|
---|
| 16 |
|
---|
| 17 |
|
---|
| 18 | <hr>
|
---|
| 19 | <h2><tt>sparsetable</tt></h2>
|
---|
| 20 |
|
---|
| 21 | <p>For specificity, consider the declaration </p>
|
---|
| 22 |
|
---|
| 23 | <pre>
|
---|
| 24 | sparsetable<Foo> t(100); // a sparse array with 100 elements
|
---|
| 25 | </pre>
|
---|
| 26 |
|
---|
| 27 | <p>A sparsetable is a random container that implements a sparse array,
|
---|
| 28 | that is, an array that uses very little memory to store unassigned
|
---|
| 29 | indices (in this case, between 1-2 bits per unassigned index). For
|
---|
| 30 | instance, if you allocate an array of size 5 and assign a[2] = [big
|
---|
| 31 | struct], then a[2] will take up a lot of memory but a[0], a[1], a[3],
|
---|
| 32 | and a[4] will not. Array elements that have a value are called
|
---|
| 33 | "assigned". Array elements that have no value yet, or have had their
|
---|
| 34 | value cleared using erase() or clear(), are called "unassigned".
|
---|
| 35 | For assigned elements, lookups return the assigned value; for
|
---|
| 36 | unassigned elements, they return the default value, which for t is
|
---|
| 37 | Foo().</p>
|
---|
| 38 |
|
---|
| 39 | <p>sparsetable is implemented as an array of "groups". Each group is
|
---|
| 40 | responsible for M array indices. The first group knows about
|
---|
| 41 | t[0]..t[M-1], the second about t[M]..t[2M-1], and so forth. (M is 48
|
---|
| 42 | by default.) At construct time, t creates an array of (99/M + 1)
|
---|
| 43 | groups. From this point on, all operations -- insert, delete, lookup
|
---|
| 44 | -- are passed to the appropriate group. In particular, any operation
|
---|
| 45 | on t[i] is actually performed on (t.group[i / M])[i % M].</p>
|
---|
| 46 |
|
---|
| 47 | <p>Each group contains of a vector, which holds assigned values, and a
|
---|
| 48 | bitmap of size M, which indicates which indices are assigned. A
|
---|
| 49 | lookup works as follows: the group is asked to look up index i, where
|
---|
| 50 | i < M. The group looks at bitmap[i]. If it's 0, the lookup fails.
|
---|
| 51 | If it's 1, then the group has to find the appropriate value in the
|
---|
| 52 | vector.</p>
|
---|
| 53 |
|
---|
| 54 | <h3><tt>find()</tt></h3>
|
---|
| 55 |
|
---|
| 56 | <p>Finding the appropriate vector element is the most expensive part of
|
---|
| 57 | the lookup. The code counts all bitmap entries <= i that are set to
|
---|
| 58 | 1. (There's at least 1 of them, since bitmap[i] is 1.) Suppose there
|
---|
| 59 | are 4 such entries. Then the right value to return is the 4th element
|
---|
| 60 | of the vector: vector[3]. This takes time O(M), which is a constant
|
---|
| 61 | since M is a constant.</p>
|
---|
| 62 |
|
---|
| 63 | <h3><tt>insert()</tt></h3>
|
---|
| 64 |
|
---|
| 65 | <p>Insert starts with a lookup. If the lookup succeeds, the code merely
|
---|
| 66 | replaces vector[3] with the new value. If the lookup fails, then the
|
---|
| 67 | code must insert a new entry into the middle of the vector. Again, to
|
---|
| 68 | insert at position i, the code must count all the bitmap entries <= i
|
---|
| 69 | that are set to i. This indicates the position to insert into the
|
---|
| 70 | vector. All vector entries above that position must be moved to make
|
---|
| 71 | room for the new entry. This takes time, but still constant time
|
---|
| 72 | since the vector has size at most M.</p>
|
---|
| 73 |
|
---|
| 74 | <p>(Inserts could be made faster by using a list instead of a vector to
|
---|
| 75 | hold group values, but this would use much more memory, since each
|
---|
| 76 | list element requires a full pointer of overhead.)</p>
|
---|
| 77 |
|
---|
| 78 | <p>The only metadata that needs to be updated, after the actual value is
|
---|
| 79 | inserted, is to set bitmap[i] to 1. No other counts must be
|
---|
| 80 | maintained.</p>
|
---|
| 81 |
|
---|
| 82 | <h3><tt>delete()</tt></h3>
|
---|
| 83 |
|
---|
| 84 | <p>Deletes are similar to inserts. They start with a lookup. If it
|
---|
| 85 | fails, the delete is a noop. Otherwise, the appropriate entry is
|
---|
| 86 | removed from the vector, all the vector elements above it are moved
|
---|
| 87 | down one, and bitmap[i] is set to 0.</p>
|
---|
| 88 |
|
---|
| 89 | <p>Currently, the code uses memmove() to move vector elements around when
|
---|
| 90 | making space for inserts and deletes. This is why the value stored in
|
---|
| 91 | a sparsetable must be Plain Old Data. This requirement is easy to
|
---|
| 92 | remove, though it may come at a speed cost.</p>
|
---|
| 93 |
|
---|
| 94 | <h3>iterators</h3>
|
---|
| 95 |
|
---|
| 96 | <p>Sparsetable iterators pose a special burden. They must iterate over
|
---|
| 97 | unassigned array values, but the act of iterating should not cause an
|
---|
| 98 | assignment to happen -- otherwise, iterating over a sparsetable would
|
---|
| 99 | cause it to take up much more room. For const iterators, the matter
|
---|
| 100 | is simple: the iterator is merely programmed to return the default
|
---|
| 101 | value -- Foo() -- when dereferenced while pointing to an unassigned
|
---|
| 102 | entry.</p>
|
---|
| 103 |
|
---|
| 104 | <p>For non-const iterators, such simple techniques fail. Instead,
|
---|
| 105 | dereferencing a sparsetable_iterator returns an opaque object that
|
---|
| 106 | acts like a Foo in almost all situations, but isn't actually a Foo.
|
---|
| 107 | (It does this by defining operator=(), operator value_type(), and,
|
---|
| 108 | most sneakily, operator&().) This works in almost all cases. If it
|
---|
| 109 | doesn't, an explicit cast to value_type will solve the problem:</p>
|
---|
| 110 |
|
---|
| 111 | <pre>
|
---|
| 112 | printf("%d", static_cast<Foo>(*t.find(0)));
|
---|
| 113 | </pre>
|
---|
| 114 |
|
---|
| 115 | <p>To avoid such problems, consider using get() and set() instead of an
|
---|
| 116 | iterator:</p>
|
---|
| 117 |
|
---|
| 118 | <pre>
|
---|
| 119 | for (int i = 0; i < t.size(); ++i)
|
---|
| 120 | if (t.get(i) == ...) t.set(i, ...);
|
---|
| 121 | </pre>
|
---|
| 122 |
|
---|
| 123 | <p>Sparsetable also has a special class of iterator, besides normal and
|
---|
| 124 | const: nonempty_iterator. This only iterates over array values that
|
---|
| 125 | are assigned. This is particularly fast given the sparsetable
|
---|
| 126 | implementation, since it can ignore the bitmaps entirely and just
|
---|
| 127 | iterate over the various group vectors.</p>
|
---|
| 128 |
|
---|
| 129 | <h3>Resource use</h3>
|
---|
| 130 |
|
---|
| 131 | <p>The space overhead for an sparsetable of size N is N + 48N/M bits.
|
---|
| 132 | For the default value of M, this is exactly 2 bits per array entry.
|
---|
| 133 | A larger M would use less overhead -- approaching 1 bit per array
|
---|
| 134 | entry -- but take longer for inserts, deletes, and lookups. A smaller
|
---|
| 135 | M would use more overhead but make operations somewhat faster.</p>
|
---|
| 136 |
|
---|
| 137 | <p>You can also look at some specific <A
|
---|
| 138 | HREF="performance.html">performance numbers</A>.</p>
|
---|
| 139 |
|
---|
| 140 |
|
---|
| 141 | <hr>
|
---|
| 142 | <h2><tt>sparse_hash_set</tt></h2>
|
---|
| 143 |
|
---|
| 144 | <p>For specificity, consider the declaration </p>
|
---|
| 145 |
|
---|
| 146 | <pre>
|
---|
| 147 | sparse_hash_set<Foo> t; // Foo is a Plain Old Data type.
|
---|
| 148 | </pre>
|
---|
| 149 |
|
---|
| 150 | <p>sparse_hash_set is a hashtable. For more information on hashtables,
|
---|
| 151 | see Knuth. Hashtables are basically arrays with complicated logic on
|
---|
| 152 | top of them. sparse_hash_set uses a sparsetable to implement the
|
---|
| 153 | underlying array.</p>
|
---|
| 154 |
|
---|
| 155 | <p>In particular, sparse_hash_set stores its data in a sparsetable using
|
---|
| 156 | quadratic internal probing (see Knuth). Many hashtable
|
---|
| 157 | implementations use external probing, so each table element is
|
---|
| 158 | actually a pointer chain, holding many hashtable values.
|
---|
| 159 | sparse_hash_set, on the other hand, always stores at most one value in
|
---|
| 160 | each table location. If the hashtable wants to store a second value
|
---|
| 161 | at a given table location, it can't; it's forced to look somewhere
|
---|
| 162 | else.</p>
|
---|
| 163 |
|
---|
| 164 | <h3><tt>insert()</tt></h3>
|
---|
| 165 |
|
---|
| 166 | <p>As a specific example, suppose t is a new sparse_hash_set. It then
|
---|
| 167 | holds a sparsetable of size 32. The code for t.insert(foo) works as
|
---|
| 168 | follows:</p>
|
---|
| 169 |
|
---|
| 170 | <p>
|
---|
| 171 | 1) Call hash<Foo>(foo) to convert foo into an integer i. (hash<Foo> is
|
---|
| 172 | the default hash function; you can specify a different one in the
|
---|
| 173 | template arguments.)
|
---|
| 174 |
|
---|
| 175 | </p><p>
|
---|
| 176 | 2a) Look at t.sparsetable[i % 32]. If it's unassigned, assign it to
|
---|
| 177 | foo. foo is now in the hashtable.
|
---|
| 178 |
|
---|
| 179 | </p><p>
|
---|
| 180 | 2b) If t.sparsetable[i % 32] is assigned, and its value is foo, then
|
---|
| 181 | do nothing: foo was already in t and the insert is a noop.
|
---|
| 182 |
|
---|
| 183 | </p><p>
|
---|
| 184 | 2c) If t.sparsetable[i % 32] is assigned, but to a value other than
|
---|
| 185 | foo, look at t.sparsetable[(i+1) % 32]. If that also fails, try
|
---|
| 186 | t.sparsetable[(i+3) % 32], then t.sparsetable[(i+6) % 32]. In
|
---|
| 187 | general, keep trying the next triangular number.
|
---|
| 188 |
|
---|
| 189 | </p><p>
|
---|
| 190 | 3) If the table is now "too full" -- say, 25 of the 32 table entries
|
---|
| 191 | are now assigned -- grow the table by creating a new sparsetable
|
---|
| 192 | that's twice as big, and rehashing every single element from the
|
---|
| 193 | old table into the new one. This keeps the table from ever filling
|
---|
| 194 | up.
|
---|
| 195 |
|
---|
| 196 | </p><p>
|
---|
| 197 | 4) If the table is now "too empty" -- say, only 3 of the 32 table
|
---|
| 198 | entries are now assigned -- shrink the table by creating a new
|
---|
| 199 | sparsetable that's half as big, and rehashing every element as in
|
---|
| 200 | the growing case. This keeps the table overhead proportional to
|
---|
| 201 | the number of elements in the table.
|
---|
| 202 | </p>
|
---|
| 203 |
|
---|
| 204 | <p>Instead of using triangular numbers as offsets, one could just use
|
---|
| 205 | regular integers: try i, then i+1, then i+2, then i+3. This has bad
|
---|
| 206 | 'clumping' behavior, as explored in Knuth. Quadratic probing, using
|
---|
| 207 | the triangular numbers, avoids the clumping while keeping cache
|
---|
| 208 | coherency in the common case. As long as the table size is a power of
|
---|
| 209 | 2, the quadratic-probing method described above will explore every
|
---|
| 210 | table element if necessary, to find a good place to insert.</p>
|
---|
| 211 |
|
---|
| 212 | <p>(As a side note, using a table size that's a power of two has several
|
---|
| 213 | advantages, including the speed of calculating (i % table_size). On
|
---|
| 214 | the other hand, power-of-two tables are not very forgiving of a poor
|
---|
| 215 | hash function. Make sure your hash function is a good one! There are
|
---|
| 216 | plenty of dos and don'ts on the web (and in Knuth), for writing hash
|
---|
| 217 | functions.)</p>
|
---|
| 218 |
|
---|
| 219 | <p>The "too full" value, also called the "maximum occupancy", determines
|
---|
| 220 | a time-space tradeoff: in general, the higher it is, the less space is
|
---|
| 221 | wasted but the more probes must be performed for each insert.
|
---|
| 222 | sparse_hash_set uses a high maximum occupancy, since space is more
|
---|
| 223 | important than speed for this data structure.</p>
|
---|
| 224 |
|
---|
| 225 | <p>The "too empty" value is not necessary for performance but helps with
|
---|
| 226 | space use. It's rare for hashtable implementations to check this
|
---|
| 227 | value at insert() time -- after all, how will inserting cause a
|
---|
| 228 | hashtable to get too small? However, the sparse_hash_set
|
---|
| 229 | implementation never resizes on erase(); it's nice to have an erase()
|
---|
| 230 | that does not invalidate iterators. Thus, the first insert() after a
|
---|
| 231 | long string of erase()s could well trigger a hashtable shrink.</p>
|
---|
| 232 |
|
---|
| 233 | <h3><tt>find()</tt></h3>
|
---|
| 234 |
|
---|
| 235 | <p>find() works similarly to insert. The only difference is in step
|
---|
| 236 | (2a): if the value is unassigned, then the lookup fails immediately.</p>
|
---|
| 237 |
|
---|
| 238 | <h3><tt>delete()</tt></h3>
|
---|
| 239 |
|
---|
| 240 | <p>delete() is tricky in an internal-probing scheme. The obvious
|
---|
| 241 | implementation of just "unassigning" the relevant table entry doesn't
|
---|
| 242 | work. Consider the following scenario:</p>
|
---|
| 243 |
|
---|
| 244 | <pre>
|
---|
| 245 | t.insert(foo1); // foo1 hashes to 4, is put in table[4]
|
---|
| 246 | t.insert(foo2); // foo2 hashes to 4, is put in table[5]
|
---|
| 247 | t.erase(foo1); // table[4] is now 'unassigned'
|
---|
| 248 | t.lookup(foo2); // fails since table[hash(foo2)] is unassigned
|
---|
| 249 | </pre>
|
---|
| 250 |
|
---|
| 251 | <p>To avoid these failure situations, delete(foo1) is actually
|
---|
| 252 | implemented by replacing foo1 by a special 'delete' value in the
|
---|
| 253 | hashtable. This 'delete' value causes the table entry to be
|
---|
| 254 | considered unassigned for the purposes of insertion -- if foo3 hashes
|
---|
| 255 | to 4 as well, it can go into table[4] no problem -- but assigned for
|
---|
| 256 | the purposes of lookup.</p>
|
---|
| 257 |
|
---|
| 258 | <p>What is this special 'delete' value? The delete value has to be an
|
---|
| 259 | element of type Foo, since the table can't hold anything else. It
|
---|
| 260 | obviously must be an element the client would never want to insert on
|
---|
| 261 | its own, or else the code couldn't distinguish deleted entries from
|
---|
| 262 | 'real' entries with the same value. There's no way to determine a
|
---|
| 263 | good value automatically. The client has to specify it explicitly.
|
---|
| 264 | This is what the set_deleted_key() method does.</p>
|
---|
| 265 |
|
---|
| 266 | <p>Note that set_deleted_key() is only necessary if the client actually
|
---|
| 267 | wants to call t.erase(). For insert-only hash-sets, set_deleted_key()
|
---|
| 268 | is unnecessary.</p>
|
---|
| 269 |
|
---|
| 270 | <p>When copying the hashtable, either to grow it or shrink it, the
|
---|
| 271 | special 'delete' values are <b>not</b> copied into the new table. The
|
---|
| 272 | copy-time rehash makes them unnecessary.</p>
|
---|
| 273 |
|
---|
| 274 | <h3>Resource use</h3>
|
---|
| 275 |
|
---|
| 276 | <p>The data is stored in a sparsetable, so space use is the same as
|
---|
| 277 | for sparsetable. Time use is also determined in large part by the
|
---|
| 278 | sparsetable implementation. However, there is also an extra probing
|
---|
| 279 | cost in hashtables, which depends in large part on the "too full"
|
---|
| 280 | value. It should be rare to need more than 4-5 probes per lookup, and
|
---|
| 281 | usually significantly less will suffice.</p>
|
---|
| 282 |
|
---|
| 283 | <p>A note on growing and shrinking the hashtable: all hashtable
|
---|
| 284 | implementations use the most memory when growing a hashtable, since
|
---|
| 285 | they must have room for both the old table and the new table at the
|
---|
| 286 | same time. sparse_hash_set is careful to delete entries from the old
|
---|
| 287 | hashtable as soon as they're copied into the new one, to minimize this
|
---|
| 288 | space overhead. (It does this efficiently by using its knowledge of
|
---|
| 289 | the sparsetable class and copying one sparsetable group at a time.)</p>
|
---|
| 290 |
|
---|
| 291 | <p>You can also look at some specific <A
|
---|
| 292 | HREF="performance.html">performance numbers</A>.</p>
|
---|
| 293 |
|
---|
| 294 |
|
---|
| 295 | <hr>
|
---|
| 296 | <h2><tt>sparse_hash_map</tt></h2>
|
---|
| 297 |
|
---|
| 298 | <p>sparse_hash_map is implemented identically to sparse_hash_set. The
|
---|
| 299 | only difference is instead of storing just Foo in each table entry,
|
---|
| 300 | the data structure stores pair<Foo, Value>.</p>
|
---|
| 301 |
|
---|
| 302 |
|
---|
| 303 | <hr>
|
---|
| 304 | <h2><tt>dense_hash_set</tt></h2>
|
---|
| 305 |
|
---|
| 306 | <p>The hashtable aspects of dense_hash_set are identical to
|
---|
| 307 | sparse_hash_set: it uses quadratic internal probing, and resizes
|
---|
| 308 | hashtables in exactly the same way. The difference is in the
|
---|
| 309 | underlying array: instead of using a sparsetable, dense_hash_set uses
|
---|
| 310 | a C array. This means much more space is used, especially if Foo is
|
---|
| 311 | big. However, it makes all operations faster, since sparsetable has
|
---|
| 312 | memory management overhead that C arrays do not.</p>
|
---|
| 313 |
|
---|
| 314 | <p>The use of C arrays instead of sparsetables points to one immediate
|
---|
| 315 | complication dense_hash_set has that sparse_hash_set does not: the
|
---|
| 316 | need to distinguish assigned from unassigned entries. In a
|
---|
| 317 | sparsetable, this is accomplished by a bitmap. dense_hash_set, on the
|
---|
| 318 | other hand, uses a dedicated value to specify unassigned entries.
|
---|
| 319 | Thus, dense_hash_set has two special values: one to indicate deleted
|
---|
| 320 | table entries, and one to indicated unassigned table entries. At
|
---|
| 321 | construct time, all table entries are initialized to 'unassigned'.</p>
|
---|
| 322 |
|
---|
| 323 | <p>dense_hash_set provides the method set_empty_key() to indicate the
|
---|
| 324 | value that should be used for unassigned entries. Like
|
---|
| 325 | set_deleted_key(), set_empty_key() requires a value that will not be
|
---|
| 326 | used by the client for any legitimate purpose. Unlike
|
---|
| 327 | set_deleted_key(), set_empty_key() is always required, no matter what
|
---|
| 328 | hashtable operations the client wishes to perform.</p>
|
---|
| 329 |
|
---|
| 330 | <p>Since this implementation uses C arrays rather than an equivalent C++
|
---|
| 331 | data structure like vectors, data in it must have a Plain Old Data
|
---|
| 332 | type. This restriction is probably easy to remove -- using a vector
|
---|
| 333 | instead of an array would probably work fine -- but it has not been
|
---|
| 334 | done.</p>
|
---|
| 335 |
|
---|
| 336 | <h3>Resource use</h3>
|
---|
| 337 |
|
---|
| 338 | <p>This implementation is fast because even though dense_hash_set may not
|
---|
| 339 | be space efficient, most lookups are localized: a single lookup may
|
---|
| 340 | need to access table[i], and maybe table[i+1] and table[i+3], but
|
---|
| 341 | nothing other than that. For all but the biggest data structures,
|
---|
| 342 | these will frequently be in a single cache line.</p>
|
---|
| 343 |
|
---|
| 344 | <p>This implementation takes, for every unused bucket, space as big as
|
---|
| 345 | the key-type. Usually between half and two-thirds of the buckets are
|
---|
| 346 | empty.</p>
|
---|
| 347 |
|
---|
| 348 | <p>The doubling method used by dense_hash_set tends to work poorly
|
---|
| 349 | with most memory allocators. This is because memory allocators tend
|
---|
| 350 | to have memory 'buckets' which are a power of two. Since each
|
---|
| 351 | doubling of a dense_hash_set doubles the memory use, a single
|
---|
| 352 | hashtable doubling will require a new memory 'bucket' from the memory
|
---|
| 353 | allocator, leaving the old bucket stranded as fragmented memory.
|
---|
| 354 | Hence, it's not recommended this data structure be used with many
|
---|
| 355 | inserts in memory-constrained situations.</p>
|
---|
| 356 |
|
---|
| 357 | <p>You can also look at some specific <A
|
---|
| 358 | HREF="performance.html">performance numbers</A>.</p>
|
---|
| 359 |
|
---|
| 360 |
|
---|
| 361 | <hr>
|
---|
| 362 | <h2><tt>dense_hash_map</tt></h2>
|
---|
| 363 |
|
---|
| 364 | <p>dense_hash_map is identical to dense_hash_set except for what values
|
---|
| 365 | are stored in each table entry.</p>
|
---|
| 366 |
|
---|
| 367 | <hr>
|
---|
| 368 | <author>
|
---|
| 369 | Craig Silverstein<br>
|
---|
| 370 | Thu Jan 6 20:15:42 PST 2005
|
---|
| 371 | </author>
|
---|
| 372 |
|
---|
| 373 | </body>
|
---|
| 374 | </html>
|
---|