source: GTP/trunk/Lib/Vis/Preprocessing/src/sparsehash/doc/implementation.html @ 2162

Revision 2162, 16.0 KB checked in by mattausch, 17 years ago (diff)

improved hash performance with google hashmap

Line 
1<HTML>
2
3<HEAD>
4<title>Implementation notes: sparse_hash, dense_hash, sparsetable</title>
5</HEAD>
6
7<BODY>
8
9<h1>Implementation of sparse_hash_map, dense_hash_map, and
10sparsetable</h1>
11
12This document contains a few notes on how the data structures in this
13package are implemented.  This discussion refers at several points to
14the classic text in this area: Knuth, <i>The Art of Computer
15Programming</i>, Vol 3, Hashing.
16
17
18<hr>
19<h2><tt>sparsetable</tt></h2>
20
21<p>For specificity, consider the declaration </p>
22
23<pre>
24   sparsetable&lt;Foo&gt; t(100);        // a sparse array with 100 elements
25</pre>
26
27<p>A sparsetable is a random container that implements a sparse array,
28that is, an array that uses very little memory to store unassigned
29indices (in this case, between 1-2 bits per unassigned index).  For
30instance, if you allocate an array of size 5 and assign a[2] = [big
31struct], then a[2] will take up a lot of memory but a[0], a[1], a[3],
32and a[4] will not.  Array elements that have a value are called
33"assigned".  Array elements that have no value yet, or have had their
34value cleared using erase() or clear(), are called "unassigned".
35For assigned elements, lookups return the assigned value; for
36unassigned elements, they return the default value, which for t is
37Foo().</p>
38
39<p>sparsetable is implemented as an array of "groups".  Each group is
40responsible for M array indices.  The first group knows about
41t[0]..t[M-1], the second about t[M]..t[2M-1], and so forth.  (M is 48
42by default.)  At construct time, t creates an array of (99/M + 1)
43groups.  From this point on, all operations -- insert, delete, lookup
44-- are passed to the appropriate group.  In particular, any operation
45on t[i] is actually performed on (t.group[i / M])[i % M].</p>
46
47<p>Each group contains of a vector, which holds assigned values, and a
48bitmap of size M, which indicates which indices are assigned.  A
49lookup works as follows: the group is asked to look up index i, where
50i &lt; M.  The group looks at bitmap[i].  If it's 0, the lookup fails.
51If it's 1, then the group has to find the appropriate value in the
52vector.</p>
53
54<h3><tt>find()</tt></h3>
55
56<p>Finding the appropriate vector element is the most expensive part of
57the lookup.  The code counts all bitmap entries &lt;= i that are set to
581.  (There's at least 1 of them, since bitmap[i] is 1.)  Suppose there
59are 4 such entries.  Then the right value to return is the 4th element
60of the vector: vector[3].  This takes time O(M), which is a constant
61since M is a constant.</p>
62
63<h3><tt>insert()</tt></h3>
64
65<p>Insert starts with a lookup.  If the lookup succeeds, the code merely
66replaces vector[3] with the new value.  If the lookup fails, then the
67code must insert a new entry into the middle of the vector.  Again, to
68insert at position i, the code must count all the bitmap entries &lt;= i
69that are set to i.  This indicates the position to insert into the
70vector.  All vector entries above that position must be moved to make
71room for the new entry.  This takes time, but still constant time
72since the vector has size at most M.</p>
73
74<p>(Inserts could be made faster by using a list instead of a vector to
75hold group values, but this would use much more memory, since each
76list element requires a full pointer of overhead.)</p>
77
78<p>The only metadata that needs to be updated, after the actual value is
79inserted, is to set bitmap[i] to 1.  No other counts must be
80maintained.</p>
81
82<h3><tt>delete()</tt></h3>
83
84<p>Deletes are similar to inserts.  They start with a lookup.  If it
85fails, the delete is a noop.  Otherwise, the appropriate entry is
86removed from the vector, all the vector elements above it are moved
87down one, and bitmap[i] is set to 0.</p>
88
89<p>Currently, the code uses memmove() to move vector elements around when
90making space for inserts and deletes.  This is why the value stored in
91a sparsetable must be Plain Old Data.  This requirement is easy to
92remove, though it may come at a speed cost.</p>
93
94<h3>iterators</h3>
95
96<p>Sparsetable iterators pose a special burden.  They must iterate over
97unassigned array values, but the act of iterating should not cause an
98assignment to happen -- otherwise, iterating over a sparsetable would
99cause it to take up much more room.  For const iterators, the matter
100is simple: the iterator is merely programmed to return the default
101value -- Foo() -- when dereferenced while pointing to an unassigned
102entry.</p>
103
104<p>For non-const iterators, such simple techniques fail.  Instead,
105dereferencing a sparsetable_iterator returns an opaque object that
106acts like a Foo in almost all situations, but isn't actually a Foo.
107(It does this by defining operator=(), operator value_type(), and,
108most sneakily, operator&().)  This works in almost all cases.  If it
109doesn't, an explicit cast to value_type will solve the problem:</p>
110
111<pre>
112   printf("%d", static_cast&lt;Foo&gt;(*t.find(0)));
113</pre>
114
115<p>To avoid such problems, consider using get() and set() instead of an
116iterator:</p>
117
118<pre>
119   for (int i = 0; i &lt; t.size(); ++i)
120      if (t.get(i) == ...)  t.set(i, ...);
121</pre>
122
123<p>Sparsetable also has a special class of iterator, besides normal and
124const: nonempty_iterator.  This only iterates over array values that
125are assigned.  This is particularly fast given the sparsetable
126implementation, since it can ignore the bitmaps entirely and just
127iterate over the various group vectors.</p>
128
129<h3>Resource use</h3>
130
131<p>The space overhead for an sparsetable of size N is N + 48N/M bits.
132For the default value of M, this is exactly 2 bits per array entry.
133A larger M would use less overhead -- approaching 1 bit per array
134entry -- but take longer for inserts, deletes, and lookups.  A smaller
135M would use more overhead but make operations somewhat faster.</p>
136
137<p>You can also look at some specific <A
138HREF="performance.html">performance numbers</A>.</p>
139
140
141<hr>
142<h2><tt>sparse_hash_set</tt></h2>
143
144<p>For specificity, consider the declaration </p>
145
146<pre>
147   sparse_hash_set&lt;Foo&gt; t;           // Foo is a Plain Old Data type.
148</pre>
149
150<p>sparse_hash_set is a hashtable.  For more information on hashtables,
151see Knuth.  Hashtables are basically arrays with complicated logic on
152top of them.  sparse_hash_set uses a sparsetable to implement the
153underlying array.</p>
154
155<p>In particular, sparse_hash_set stores its data in a sparsetable using
156quadratic internal probing (see Knuth).  Many hashtable
157implementations use external probing, so each table element is
158actually a pointer chain, holding many hashtable values.
159sparse_hash_set, on the other hand, always stores at most one value in
160each table location.  If the hashtable wants to store a second value
161at a given table location, it can't; it's forced to look somewhere
162else.</p>
163
164<h3><tt>insert()</tt></h3>
165
166<p>As a specific example, suppose t is a new sparse_hash_set.  It then
167holds a sparsetable of size 32.  The code for t.insert(foo) works as
168follows:</p>
169
170<p>
1711) Call hash&lt;Foo&gt;(foo) to convert foo into an integer i.  (hash&lt;Foo&gt; is
172   the default hash function; you can specify a different one in the
173   template arguments.)
174
175</p><p>
1762a) Look at t.sparsetable[i % 32].  If it's unassigned, assign it to
177    foo.  foo is now in the hashtable.
178
179</p><p>
1802b) If t.sparsetable[i % 32] is assigned, and its value is foo, then
181    do nothing: foo was already in t and the insert is a noop.
182
183</p><p>
1842c) If t.sparsetable[i % 32] is assigned, but to a value other than
185    foo, look at t.sparsetable[(i+1) % 32].  If that also fails, try
186    t.sparsetable[(i+3) % 32], then t.sparsetable[(i+6) % 32].  In
187    general, keep trying the next triangular number.
188
189</p><p>
1903) If the table is now "too full" -- say, 25 of the 32 table entries
191   are now assigned -- grow the table by creating a new sparsetable
192   that's twice as big, and rehashing every single element from the
193   old table into the new one.  This keeps the table from ever filling
194   up.
195
196</p><p>
1974) If the table is now "too empty" -- say, only 3 of the 32 table
198   entries are now assigned -- shrink the table by creating a new
199   sparsetable that's half as big, and rehashing every element as in
200   the growing case.  This keeps the table overhead proportional to
201   the number of elements in the table.
202</p>
203
204<p>Instead of using triangular numbers as offsets, one could just use
205regular integers: try i, then i+1, then i+2, then i+3.  This has bad
206'clumping' behavior, as explored in Knuth.  Quadratic probing, using
207the triangular numbers, avoids the clumping while keeping cache
208coherency in the common case.  As long as the table size is a power of
2092, the quadratic-probing method described above will explore every
210table element if necessary, to find a good place to insert.</p>
211
212<p>(As a side note, using a table size that's a power of two has several
213advantages, including the speed of calculating (i % table_size).  On
214the other hand, power-of-two tables are not very forgiving of a poor
215hash function.  Make sure your hash function is a good one!  There are
216plenty of dos and don'ts on the web (and in Knuth), for writing hash
217functions.)</p>
218
219<p>The "too full" value, also called the "maximum occupancy", determines
220a time-space tradeoff: in general, the higher it is, the less space is
221wasted but the more probes must be performed for each insert.
222sparse_hash_set uses a high maximum occupancy, since space is more
223important than speed for this data structure.</p>
224
225<p>The "too empty" value is not necessary for performance but helps with
226space use.  It's rare for hashtable implementations to check this
227value at insert() time -- after all, how will inserting cause a
228hashtable to get too small?  However, the sparse_hash_set
229implementation never resizes on erase(); it's nice to have an erase()
230that does not invalidate iterators.  Thus, the first insert() after a
231long string of erase()s could well trigger a hashtable shrink.</p>
232
233<h3><tt>find()</tt></h3>
234
235<p>find() works similarly to insert.  The only difference is in step
236(2a): if the value is unassigned, then the lookup fails immediately.</p>
237
238<h3><tt>delete()</tt></h3>
239
240<p>delete() is tricky in an internal-probing scheme.  The obvious
241implementation of just "unassigning" the relevant table entry doesn't
242work.  Consider the following scenario:</p>
243
244<pre>
245    t.insert(foo1);         // foo1 hashes to 4, is put in table[4]
246    t.insert(foo2);         // foo2 hashes to 4, is put in table[5]
247    t.erase(foo1);          // table[4] is now 'unassigned'
248    t.lookup(foo2);         // fails since table[hash(foo2)] is unassigned
249</pre>
250
251<p>To avoid these failure situations, delete(foo1) is actually
252implemented by replacing foo1 by a special 'delete' value in the
253hashtable.  This 'delete' value causes the table entry to be
254considered unassigned for the purposes of insertion -- if foo3 hashes
255to 4 as well, it can go into table[4] no problem -- but assigned for
256the purposes of lookup.</p>
257
258<p>What is this special 'delete' value?  The delete value has to be an
259element of type Foo, since the table can't hold anything else.  It
260obviously must be an element the client would never want to insert on
261its own, or else the code couldn't distinguish deleted entries from
262'real' entries with the same value.  There's no way to determine a
263good value automatically.  The client has to specify it explicitly.
264This is what the set_deleted_key() method does.</p>
265
266<p>Note that set_deleted_key() is only necessary if the client actually
267wants to call t.erase().  For insert-only hash-sets, set_deleted_key()
268is unnecessary.</p>
269
270<p>When copying the hashtable, either to grow it or shrink it, the
271special 'delete' values are <b>not</b> copied into the new table.  The
272copy-time rehash makes them unnecessary.</p>
273
274<h3>Resource use</h3>
275
276<p>The data is stored in a sparsetable, so space use is the same as
277for sparsetable.  Time use is also determined in large part by the
278sparsetable implementation.  However, there is also an extra probing
279cost in hashtables, which depends in large part on the "too full"
280value.  It should be rare to need more than 4-5 probes per lookup, and
281usually significantly less will suffice.</p>
282
283<p>A note on growing and shrinking the hashtable: all hashtable
284implementations use the most memory when growing a hashtable, since
285they must have room for both the old table and the new table at the
286same time.  sparse_hash_set is careful to delete entries from the old
287hashtable as soon as they're copied into the new one, to minimize this
288space overhead.  (It does this efficiently by using its knowledge of
289the sparsetable class and copying one sparsetable group at a time.)</p>
290
291<p>You can also look at some specific <A
292HREF="performance.html">performance numbers</A>.</p>
293
294
295<hr>
296<h2><tt>sparse_hash_map</tt></h2>
297
298<p>sparse_hash_map is implemented identically to sparse_hash_set.  The
299only difference is instead of storing just Foo in each table entry,
300the data structure stores pair&lt;Foo, Value&gt;.</p>
301
302
303<hr>
304<h2><tt>dense_hash_set</tt></h2>
305
306<p>The hashtable aspects of dense_hash_set are identical to
307sparse_hash_set: it uses quadratic internal probing, and resizes
308hashtables in exactly the same way.  The difference is in the
309underlying array: instead of using a sparsetable, dense_hash_set uses
310a C array.  This means much more space is used, especially if Foo is
311big.  However, it makes all operations faster, since sparsetable has
312memory management overhead that C arrays do not.</p>
313
314<p>The use of C arrays instead of sparsetables points to one immediate
315complication dense_hash_set has that sparse_hash_set does not: the
316need to distinguish assigned from unassigned entries.  In a
317sparsetable, this is accomplished by a bitmap.  dense_hash_set, on the
318other hand, uses a dedicated value to specify unassigned entries.
319Thus, dense_hash_set has two special values: one to indicate deleted
320table entries, and one to indicated unassigned table entries.  At
321construct time, all table entries are initialized to 'unassigned'.</p>
322
323<p>dense_hash_set provides the method set_empty_key() to indicate the
324value that should be used for unassigned entries.  Like
325set_deleted_key(), set_empty_key() requires a value that will not be
326used by the client for any legitimate purpose.  Unlike
327set_deleted_key(), set_empty_key() is always required, no matter what
328hashtable operations the client wishes to perform.</p>
329
330<p>Since this implementation uses C arrays rather than an equivalent C++
331data structure like vectors, data in it must have a Plain Old Data
332type.  This restriction is probably easy to remove -- using a vector
333instead of an array would probably work fine -- but it has not been
334done.</p>
335
336<h3>Resource use</h3>
337
338<p>This implementation is fast because even though dense_hash_set may not
339be space efficient, most lookups are localized: a single lookup may
340need to access table[i], and maybe table[i+1] and table[i+3], but
341nothing other than that.  For all but the biggest data structures,
342these will frequently be in a single cache line.</p>
343
344<p>This implementation takes, for every unused bucket, space as big as
345the key-type.  Usually between half and two-thirds of the buckets are
346empty.</p>
347
348<p>The doubling method used by dense_hash_set tends to work poorly
349with most memory allocators.  This is because memory allocators tend
350to have memory 'buckets' which are a power of two.  Since each
351doubling of a dense_hash_set doubles the memory use, a single
352hashtable doubling will require a new memory 'bucket' from the memory
353allocator, leaving the old bucket stranded as fragmented memory.
354Hence, it's not recommended this data structure be used with many
355inserts in memory-constrained situations.</p>
356
357<p>You can also look at some specific <A
358HREF="performance.html">performance numbers</A>.</p>
359
360
361<hr>
362<h2><tt>dense_hash_map</tt></h2>
363
364<p>dense_hash_map is identical to dense_hash_set except for what values
365are stored in each table entry.</p>
366
367<hr>
368<author>
369Craig Silverstein<br>
370Thu Jan  6 20:15:42 PST 2005
371</author>
372
373</body>
374</html>
Note: See TracBrowser for help on using the repository browser.