Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

implementation.html @ 2162

Revision 2162, 16.0 KB checked in by mattausch, 18 years ago (diff)
improved hash performance with google hashmap

Line
1	<HTML>
2
3	<HEAD>
4	<title>Implementation notes: sparse_hash, dense_hash, sparsetable</title>
5	</HEAD>
6
7	<BODY>
8
9	<h1>Implementation of sparse_hash_map, dense_hash_map, and
10	sparsetable</h1>
11
12	This document contains a few notes on how the data structures in this
13	package are implemented. This discussion refers at several points to
14	the classic text in this area: Knuth, <i>The Art of Computer
15	Programming</i>, Vol 3, Hashing.
16
17
18	<hr>
19	<h2><tt>sparsetable</tt></h2>
20
21	<p>For specificity, consider the declaration </p>
22
23	<pre>
24	sparsetable<Foo> t(100); // a sparse array with 100 elements
25	</pre>
26
27	<p>A sparsetable is a random container that implements a sparse array,
28	that is, an array that uses very little memory to store unassigned
29	indices (in this case, between 1-2 bits per unassigned index). For
30	instance, if you allocate an array of size 5 and assign a[2] = [big
31	struct], then a[2] will take up a lot of memory but a[0], a[1], a[3],
32	and a[4] will not. Array elements that have a value are called
33	"assigned". Array elements that have no value yet, or have had their
34	value cleared using erase() or clear(), are called "unassigned".
35	For assigned elements, lookups return the assigned value; for
36	unassigned elements, they return the default value, which for t is
37	Foo().</p>
38
39	<p>sparsetable is implemented as an array of "groups". Each group is
40	responsible for M array indices. The first group knows about
41	t[0]..t[M-1], the second about t[M]..t[2M-1], and so forth. (M is 48
42	by default.) At construct time, t creates an array of (99/M + 1)
43	groups. From this point on, all operations -- insert, delete, lookup
44	-- are passed to the appropriate group. In particular, any operation
45	on t[i] is actually performed on (t.group[i / M])[i % M].</p>
46
47	<p>Each group contains of a vector, which holds assigned values, and a
48	bitmap of size M, which indicates which indices are assigned. A
49	lookup works as follows: the group is asked to look up index i, where
50	i < M. The group looks at bitmap[i]. If it's 0, the lookup fails.
51	If it's 1, then the group has to find the appropriate value in the
52	vector.</p>
53
54	<h3><tt>find()</tt></h3>
55
56	<p>Finding the appropriate vector element is the most expensive part of
57	the lookup. The code counts all bitmap entries <= i that are set to
58	1. (There's at least 1 of them, since bitmap[i] is 1.) Suppose there
59	are 4 such entries. Then the right value to return is the 4th element
60	of the vector: vector[3]. This takes time O(M), which is a constant
61	since M is a constant.</p>
62
63	<h3><tt>insert()</tt></h3>
64
65	<p>Insert starts with a lookup. If the lookup succeeds, the code merely
66	replaces vector[3] with the new value. If the lookup fails, then the
67	code must insert a new entry into the middle of the vector. Again, to
68	insert at position i, the code must count all the bitmap entries <= i
69	that are set to i. This indicates the position to insert into the
70	vector. All vector entries above that position must be moved to make
71	room for the new entry. This takes time, but still constant time
72	since the vector has size at most M.</p>
73
74	<p>(Inserts could be made faster by using a list instead of a vector to
75	hold group values, but this would use much more memory, since each
76	list element requires a full pointer of overhead.)</p>
77
78	<p>The only metadata that needs to be updated, after the actual value is
79	inserted, is to set bitmap[i] to 1. No other counts must be
80	maintained.</p>
81
82	<h3><tt>delete()</tt></h3>
83
84	<p>Deletes are similar to inserts. They start with a lookup. If it
85	fails, the delete is a noop. Otherwise, the appropriate entry is
86	removed from the vector, all the vector elements above it are moved
87	down one, and bitmap[i] is set to 0.</p>
88
89	<p>Currently, the code uses memmove() to move vector elements around when
90	making space for inserts and deletes. This is why the value stored in
91	a sparsetable must be Plain Old Data. This requirement is easy to
92	remove, though it may come at a speed cost.</p>
93
94	<h3>iterators</h3>
95
96	<p>Sparsetable iterators pose a special burden. They must iterate over
97	unassigned array values, but the act of iterating should not cause an
98	assignment to happen -- otherwise, iterating over a sparsetable would
99	cause it to take up much more room. For const iterators, the matter
100	is simple: the iterator is merely programmed to return the default
101	value -- Foo() -- when dereferenced while pointing to an unassigned
102	entry.</p>
103
104	<p>For non-const iterators, such simple techniques fail. Instead,
105	dereferencing a sparsetable_iterator returns an opaque object that
106	acts like a Foo in almost all situations, but isn't actually a Foo.
107	(It does this by defining operator=(), operator value_type(), and,
108	most sneakily, operator&().) This works in almost all cases. If it
109	doesn't, an explicit cast to value_type will solve the problem:</p>
110
111	<pre>
112	printf("%d", static_cast<Foo>(*t.find(0)));
113	</pre>
114
115	<p>To avoid such problems, consider using get() and set() instead of an
116	iterator:</p>
117
118	<pre>
119	for (int i = 0; i < t.size(); ++i)
120	if (t.get(i) == ...) t.set(i, ...);
121	</pre>
122
123	<p>Sparsetable also has a special class of iterator, besides normal and
124	const: nonempty_iterator. This only iterates over array values that
125	are assigned. This is particularly fast given the sparsetable
126	implementation, since it can ignore the bitmaps entirely and just
127	iterate over the various group vectors.</p>
128
129	<h3>Resource use</h3>
130
131	<p>The space overhead for an sparsetable of size N is N + 48N/M bits.
132	For the default value of M, this is exactly 2 bits per array entry.
133	A larger M would use less overhead -- approaching 1 bit per array
134	entry -- but take longer for inserts, deletes, and lookups. A smaller
135	M would use more overhead but make operations somewhat faster.</p>
136
137	<p>You can also look at some specific <A
138	HREF="performance.html">performance numbers</A>.</p>
139
140
141	<hr>
142	<h2><tt>sparse_hash_set</tt></h2>
143
144	<p>For specificity, consider the declaration </p>
145
146	<pre>
147	sparse_hash_set<Foo> t; // Foo is a Plain Old Data type.
148	</pre>
149
150	<p>sparse_hash_set is a hashtable. For more information on hashtables,
151	see Knuth. Hashtables are basically arrays with complicated logic on
152	top of them. sparse_hash_set uses a sparsetable to implement the
153	underlying array.</p>
154
155	<p>In particular, sparse_hash_set stores its data in a sparsetable using
156	quadratic internal probing (see Knuth). Many hashtable
157	implementations use external probing, so each table element is
158	actually a pointer chain, holding many hashtable values.
159	sparse_hash_set, on the other hand, always stores at most one value in
160	each table location. If the hashtable wants to store a second value
161	at a given table location, it can't; it's forced to look somewhere
162	else.</p>
163
164	<h3><tt>insert()</tt></h3>
165
166	<p>As a specific example, suppose t is a new sparse_hash_set. It then
167	holds a sparsetable of size 32. The code for t.insert(foo) works as
168	follows:</p>
169
170	<p>
171	1) Call hash<Foo>(foo) to convert foo into an integer i. (hash<Foo> is
172	the default hash function; you can specify a different one in the
173	template arguments.)
174
175	</p><p>
176	2a) Look at t.sparsetable[i % 32]. If it's unassigned, assign it to
177	foo. foo is now in the hashtable.
178
179	</p><p>
180	2b) If t.sparsetable[i % 32] is assigned, and its value is foo, then
181	do nothing: foo was already in t and the insert is a noop.
182
183	</p><p>
184	2c) If t.sparsetable[i % 32] is assigned, but to a value other than
185	foo, look at t.sparsetable[(i+1) % 32]. If that also fails, try
186	t.sparsetable[(i+3) % 32], then t.sparsetable[(i+6) % 32]. In
187	general, keep trying the next triangular number.
188
189	</p><p>
190	3) If the table is now "too full" -- say, 25 of the 32 table entries
191	are now assigned -- grow the table by creating a new sparsetable
192	that's twice as big, and rehashing every single element from the
193	old table into the new one. This keeps the table from ever filling
194	up.
195
196	</p><p>
197	4) If the table is now "too empty" -- say, only 3 of the 32 table
198	entries are now assigned -- shrink the table by creating a new
199	sparsetable that's half as big, and rehashing every element as in
200	the growing case. This keeps the table overhead proportional to
201	the number of elements in the table.
202	</p>
203
204	<p>Instead of using triangular numbers as offsets, one could just use
205	regular integers: try i, then i+1, then i+2, then i+3. This has bad
206	'clumping' behavior, as explored in Knuth. Quadratic probing, using
207	the triangular numbers, avoids the clumping while keeping cache
208	coherency in the common case. As long as the table size is a power of
209	2, the quadratic-probing method described above will explore every
210	table element if necessary, to find a good place to insert.</p>
211
212	<p>(As a side note, using a table size that's a power of two has several
213	advantages, including the speed of calculating (i % table_size). On
214	the other hand, power-of-two tables are not very forgiving of a poor
215	hash function. Make sure your hash function is a good one! There are
216	plenty of dos and don'ts on the web (and in Knuth), for writing hash
217	functions.)</p>
218
219	<p>The "too full" value, also called the "maximum occupancy", determines
220	a time-space tradeoff: in general, the higher it is, the less space is
221	wasted but the more probes must be performed for each insert.
222	sparse_hash_set uses a high maximum occupancy, since space is more
223	important than speed for this data structure.</p>
224
225	<p>The "too empty" value is not necessary for performance but helps with
226	space use. It's rare for hashtable implementations to check this
227	value at insert() time -- after all, how will inserting cause a
228	hashtable to get too small? However, the sparse_hash_set
229	implementation never resizes on erase(); it's nice to have an erase()
230	that does not invalidate iterators. Thus, the first insert() after a
231	long string of erase()s could well trigger a hashtable shrink.</p>
232
233	<h3><tt>find()</tt></h3>
234
235	<p>find() works similarly to insert. The only difference is in step
236	(2a): if the value is unassigned, then the lookup fails immediately.</p>
237
238	<h3><tt>delete()</tt></h3>
239
240	<p>delete() is tricky in an internal-probing scheme. The obvious
241	implementation of just "unassigning" the relevant table entry doesn't
242	work. Consider the following scenario:</p>
243
244	<pre>
245	t.insert(foo1); // foo1 hashes to 4, is put in table[4]
246	t.insert(foo2); // foo2 hashes to 4, is put in table[5]
247	t.erase(foo1); // table[4] is now 'unassigned'
248	t.lookup(foo2); // fails since table[hash(foo2)] is unassigned
249	</pre>
250
251	<p>To avoid these failure situations, delete(foo1) is actually
252	implemented by replacing foo1 by a special 'delete' value in the
253	hashtable. This 'delete' value causes the table entry to be
254	considered unassigned for the purposes of insertion -- if foo3 hashes
255	to 4 as well, it can go into table[4] no problem -- but assigned for
256	the purposes of lookup.</p>
257
258	<p>What is this special 'delete' value? The delete value has to be an
259	element of type Foo, since the table can't hold anything else. It
260	obviously must be an element the client would never want to insert on
261	its own, or else the code couldn't distinguish deleted entries from
262	'real' entries with the same value. There's no way to determine a
263	good value automatically. The client has to specify it explicitly.
264	This is what the set_deleted_key() method does.</p>
265
266	<p>Note that set_deleted_key() is only necessary if the client actually
267	wants to call t.erase(). For insert-only hash-sets, set_deleted_key()
268	is unnecessary.</p>
269
270	<p>When copying the hashtable, either to grow it or shrink it, the
271	special 'delete' values are <b>not</b> copied into the new table. The
272	copy-time rehash makes them unnecessary.</p>
273
274	<h3>Resource use</h3>
275
276	<p>The data is stored in a sparsetable, so space use is the same as
277	for sparsetable. Time use is also determined in large part by the
278	sparsetable implementation. However, there is also an extra probing
279	cost in hashtables, which depends in large part on the "too full"
280	value. It should be rare to need more than 4-5 probes per lookup, and
281	usually significantly less will suffice.</p>
282
283	<p>A note on growing and shrinking the hashtable: all hashtable
284	implementations use the most memory when growing a hashtable, since
285	they must have room for both the old table and the new table at the
286	same time. sparse_hash_set is careful to delete entries from the old
287	hashtable as soon as they're copied into the new one, to minimize this
288	space overhead. (It does this efficiently by using its knowledge of
289	the sparsetable class and copying one sparsetable group at a time.)</p>
290
291	<p>You can also look at some specific <A
292	HREF="performance.html">performance numbers</A>.</p>
293
294
295	<hr>
296	<h2><tt>sparse_hash_map</tt></h2>
297
298	<p>sparse_hash_map is implemented identically to sparse_hash_set. The
299	only difference is instead of storing just Foo in each table entry,
300	the data structure stores pair<Foo, Value>.</p>
301
302
303	<hr>
304	<h2><tt>dense_hash_set</tt></h2>
305
306	<p>The hashtable aspects of dense_hash_set are identical to
307	sparse_hash_set: it uses quadratic internal probing, and resizes
308	hashtables in exactly the same way. The difference is in the
309	underlying array: instead of using a sparsetable, dense_hash_set uses
310	a C array. This means much more space is used, especially if Foo is
311	big. However, it makes all operations faster, since sparsetable has
312	memory management overhead that C arrays do not.</p>
313
314	<p>The use of C arrays instead of sparsetables points to one immediate
315	complication dense_hash_set has that sparse_hash_set does not: the
316	need to distinguish assigned from unassigned entries. In a
317	sparsetable, this is accomplished by a bitmap. dense_hash_set, on the
318	other hand, uses a dedicated value to specify unassigned entries.
319	Thus, dense_hash_set has two special values: one to indicate deleted
320	table entries, and one to indicated unassigned table entries. At
321	construct time, all table entries are initialized to 'unassigned'.</p>
322
323	<p>dense_hash_set provides the method set_empty_key() to indicate the
324	value that should be used for unassigned entries. Like
325	set_deleted_key(), set_empty_key() requires a value that will not be
326	used by the client for any legitimate purpose. Unlike
327	set_deleted_key(), set_empty_key() is always required, no matter what
328	hashtable operations the client wishes to perform.</p>
329
330	<p>Since this implementation uses C arrays rather than an equivalent C++
331	data structure like vectors, data in it must have a Plain Old Data
332	type. This restriction is probably easy to remove -- using a vector
333	instead of an array would probably work fine -- but it has not been
334	done.</p>
335
336	<h3>Resource use</h3>
337
338	<p>This implementation is fast because even though dense_hash_set may not
339	be space efficient, most lookups are localized: a single lookup may
340	need to access table[i], and maybe table[i+1] and table[i+3], but
341	nothing other than that. For all but the biggest data structures,
342	these will frequently be in a single cache line.</p>
343
344	<p>This implementation takes, for every unused bucket, space as big as
345	the key-type. Usually between half and two-thirds of the buckets are
346	empty.</p>
347
348	<p>The doubling method used by dense_hash_set tends to work poorly
349	with most memory allocators. This is because memory allocators tend
350	to have memory 'buckets' which are a power of two. Since each
351	doubling of a dense_hash_set doubles the memory use, a single
352	hashtable doubling will require a new memory 'bucket' from the memory
353	allocator, leaving the old bucket stranded as fragmented memory.
354	Hence, it's not recommended this data structure be used with many
355	inserts in memory-constrained situations.</p>
356
357	<p>You can also look at some specific <A
358	HREF="performance.html">performance numbers</A>.</p>
359
360
361	<hr>
362	<h2><tt>dense_hash_map</tt></h2>
363
364	<p>dense_hash_map is identical to dense_hash_set except for what values
365	are stored in each table entry.</p>
366
367	<hr>
368	<author>
369	Craig Silverstein<br>
370	Thu Jan 6 20:15:42 PST 2005
371	</author>
372
373	</body>
374	</html>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: GTP/trunk/Lib/Vis/Preprocessing/src/sparsehash/doc/implementation.html @ 2162

Download in other formats: