Context Navigation

implementation.html @ 2162

Revision 2162, 16.0 KB checked in by mattausch, 18 years ago (diff)
improved hash performance with google hashmap

Rev	Line
[2162]	1	<HTML>
	2
	3	<HEAD>
	4	<title>Implementation notes: sparse_hash, dense_hash, sparsetable</title>
	5	</HEAD>
	6
	7	<BODY>
	8
	9	<h1>Implementation of sparse_hash_map, dense_hash_map, and
	10	sparsetable</h1>
	11
	12	This document contains a few notes on how the data structures in this
	13	package are implemented. This discussion refers at several points to
	14	the classic text in this area: Knuth, <i>The Art of Computer
	15	Programming</i>, Vol 3, Hashing.
	16
	17
	18	<hr>
	19	<h2><tt>sparsetable</tt></h2>
	20
	21	<p>For specificity, consider the declaration </p>
	22
	23	<pre>
	24	sparsetable<Foo> t(100); // a sparse array with 100 elements
	25	</pre>
	26
	27	<p>A sparsetable is a random container that implements a sparse array,
	28	that is, an array that uses very little memory to store unassigned
	29	indices (in this case, between 1-2 bits per unassigned index). For
	30	instance, if you allocate an array of size 5 and assign a[2] = [big
	31	struct], then a[2] will take up a lot of memory but a[0], a[1], a[3],
	32	and a[4] will not. Array elements that have a value are called
	33	"assigned". Array elements that have no value yet, or have had their
	34	value cleared using erase() or clear(), are called "unassigned".
	35	For assigned elements, lookups return the assigned value; for
	36	unassigned elements, they return the default value, which for t is
	37	Foo().</p>
	38
	39	<p>sparsetable is implemented as an array of "groups". Each group is
	40	responsible for M array indices. The first group knows about
	41	t[0]..t[M-1], the second about t[M]..t[2M-1], and so forth. (M is 48
	42	by default.) At construct time, t creates an array of (99/M + 1)
	43	groups. From this point on, all operations -- insert, delete, lookup
	44	-- are passed to the appropriate group. In particular, any operation
	45	on t[i] is actually performed on (t.group[i / M])[i % M].</p>
	46
	47	<p>Each group contains of a vector, which holds assigned values, and a
	48	bitmap of size M, which indicates which indices are assigned. A
	49	lookup works as follows: the group is asked to look up index i, where
	50	i < M. The group looks at bitmap[i]. If it's 0, the lookup fails.
	51	If it's 1, then the group has to find the appropriate value in the
	52	vector.</p>
	53
	54	<h3><tt>find()</tt></h3>
	55
	56	<p>Finding the appropriate vector element is the most expensive part of
	57	the lookup. The code counts all bitmap entries <= i that are set to
	58	1. (There's at least 1 of them, since bitmap[i] is 1.) Suppose there
	59	are 4 such entries. Then the right value to return is the 4th element
	60	of the vector: vector[3]. This takes time O(M), which is a constant
	61	since M is a constant.</p>
	62
	63	<h3><tt>insert()</tt></h3>
	64
	65	<p>Insert starts with a lookup. If the lookup succeeds, the code merely
	66	replaces vector[3] with the new value. If the lookup fails, then the
	67	code must insert a new entry into the middle of the vector. Again, to
	68	insert at position i, the code must count all the bitmap entries <= i
	69	that are set to i. This indicates the position to insert into the
	70	vector. All vector entries above that position must be moved to make
	71	room for the new entry. This takes time, but still constant time
	72	since the vector has size at most M.</p>
	73
	74	<p>(Inserts could be made faster by using a list instead of a vector to
	75	hold group values, but this would use much more memory, since each
	76	list element requires a full pointer of overhead.)</p>
	77
	78	<p>The only metadata that needs to be updated, after the actual value is
	79	inserted, is to set bitmap[i] to 1. No other counts must be
	80	maintained.</p>
	81
	82	<h3><tt>delete()</tt></h3>
	83
	84	<p>Deletes are similar to inserts. They start with a lookup. If it
	85	fails, the delete is a noop. Otherwise, the appropriate entry is
	86	removed from the vector, all the vector elements above it are moved
	87	down one, and bitmap[i] is set to 0.</p>
	88
	89	<p>Currently, the code uses memmove() to move vector elements around when
	90	making space for inserts and deletes. This is why the value stored in
	91	a sparsetable must be Plain Old Data. This requirement is easy to
	92	remove, though it may come at a speed cost.</p>
	93
	94	<h3>iterators</h3>
	95
	96	<p>Sparsetable iterators pose a special burden. They must iterate over
	97	unassigned array values, but the act of iterating should not cause an
	98	assignment to happen -- otherwise, iterating over a sparsetable would
	99	cause it to take up much more room. For const iterators, the matter
	100	is simple: the iterator is merely programmed to return the default
	101	value -- Foo() -- when dereferenced while pointing to an unassigned
	102	entry.</p>
	103
	104	<p>For non-const iterators, such simple techniques fail. Instead,
	105	dereferencing a sparsetable_iterator returns an opaque object that
	106	acts like a Foo in almost all situations, but isn't actually a Foo.
	107	(It does this by defining operator=(), operator value_type(), and,
	108	most sneakily, operator&().) This works in almost all cases. If it
	109	doesn't, an explicit cast to value_type will solve the problem:</p>
	110
	111	<pre>
	112	printf("%d", static_cast<Foo>(*t.find(0)));
	113	</pre>
	114
	115	<p>To avoid such problems, consider using get() and set() instead of an
	116	iterator:</p>
	117
	118	<pre>
	119	for (int i = 0; i < t.size(); ++i)
	120	if (t.get(i) == ...) t.set(i, ...);
	121	</pre>
	122
	123	<p>Sparsetable also has a special class of iterator, besides normal and
	124	const: nonempty_iterator. This only iterates over array values that
	125	are assigned. This is particularly fast given the sparsetable
	126	implementation, since it can ignore the bitmaps entirely and just
	127	iterate over the various group vectors.</p>
	128
	129	<h3>Resource use</h3>
	130
	131	<p>The space overhead for an sparsetable of size N is N + 48N/M bits.
	132	For the default value of M, this is exactly 2 bits per array entry.
	133	A larger M would use less overhead -- approaching 1 bit per array
	134	entry -- but take longer for inserts, deletes, and lookups. A smaller
	135	M would use more overhead but make operations somewhat faster.</p>
	136
	137	<p>You can also look at some specific <A
	138	HREF="performance.html">performance numbers</A>.</p>
	139
	140
	141	<hr>
	142	<h2><tt>sparse_hash_set</tt></h2>
	143
	144	<p>For specificity, consider the declaration </p>
	145
	146	<pre>
	147	sparse_hash_set<Foo> t; // Foo is a Plain Old Data type.
	148	</pre>
	149
	150	<p>sparse_hash_set is a hashtable. For more information on hashtables,
	151	see Knuth. Hashtables are basically arrays with complicated logic on
	152	top of them. sparse_hash_set uses a sparsetable to implement the
	153	underlying array.</p>
	154
	155	<p>In particular, sparse_hash_set stores its data in a sparsetable using
	156	quadratic internal probing (see Knuth). Many hashtable
	157	implementations use external probing, so each table element is
	158	actually a pointer chain, holding many hashtable values.
	159	sparse_hash_set, on the other hand, always stores at most one value in
	160	each table location. If the hashtable wants to store a second value
	161	at a given table location, it can't; it's forced to look somewhere
	162	else.</p>
	163
	164	<h3><tt>insert()</tt></h3>
	165
	166	<p>As a specific example, suppose t is a new sparse_hash_set. It then
	167	holds a sparsetable of size 32. The code for t.insert(foo) works as
	168	follows:</p>
	169
	170	<p>
	171	1) Call hash<Foo>(foo) to convert foo into an integer i. (hash<Foo> is
	172	the default hash function; you can specify a different one in the
	173	template arguments.)
	174
	175	</p><p>
	176	2a) Look at t.sparsetable[i % 32]. If it's unassigned, assign it to
	177	foo. foo is now in the hashtable.
	178
	179	</p><p>
	180	2b) If t.sparsetable[i % 32] is assigned, and its value is foo, then
	181	do nothing: foo was already in t and the insert is a noop.
	182
	183	</p><p>
	184	2c) If t.sparsetable[i % 32] is assigned, but to a value other than
	185	foo, look at t.sparsetable[(i+1) % 32]. If that also fails, try
	186	t.sparsetable[(i+3) % 32], then t.sparsetable[(i+6) % 32]. In
	187	general, keep trying the next triangular number.
	188
	189	</p><p>
	190	3) If the table is now "too full" -- say, 25 of the 32 table entries
	191	are now assigned -- grow the table by creating a new sparsetable
	192	that's twice as big, and rehashing every single element from the
	193	old table into the new one. This keeps the table from ever filling
	194	up.
	195
	196	</p><p>
	197	4) If the table is now "too empty" -- say, only 3 of the 32 table
	198	entries are now assigned -- shrink the table by creating a new
	199	sparsetable that's half as big, and rehashing every element as in
	200	the growing case. This keeps the table overhead proportional to
	201	the number of elements in the table.
	202	</p>
	203
	204	<p>Instead of using triangular numbers as offsets, one could just use
	205	regular integers: try i, then i+1, then i+2, then i+3. This has bad
	206	'clumping' behavior, as explored in Knuth. Quadratic probing, using
	207	the triangular numbers, avoids the clumping while keeping cache
	208	coherency in the common case. As long as the table size is a power of
	209	2, the quadratic-probing method described above will explore every
	210	table element if necessary, to find a good place to insert.</p>
	211
	212	<p>(As a side note, using a table size that's a power of two has several
	213	advantages, including the speed of calculating (i % table_size). On
	214	the other hand, power-of-two tables are not very forgiving of a poor
	215	hash function. Make sure your hash function is a good one! There are
	216	plenty of dos and don'ts on the web (and in Knuth), for writing hash
	217	functions.)</p>
	218
	219	<p>The "too full" value, also called the "maximum occupancy", determines
	220	a time-space tradeoff: in general, the higher it is, the less space is
	221	wasted but the more probes must be performed for each insert.
	222	sparse_hash_set uses a high maximum occupancy, since space is more
	223	important than speed for this data structure.</p>
	224
	225	<p>The "too empty" value is not necessary for performance but helps with
	226	space use. It's rare for hashtable implementations to check this
	227	value at insert() time -- after all, how will inserting cause a
	228	hashtable to get too small? However, the sparse_hash_set
	229	implementation never resizes on erase(); it's nice to have an erase()
	230	that does not invalidate iterators. Thus, the first insert() after a
	231	long string of erase()s could well trigger a hashtable shrink.</p>
	232
	233	<h3><tt>find()</tt></h3>
	234
	235	<p>find() works similarly to insert. The only difference is in step
	236	(2a): if the value is unassigned, then the lookup fails immediately.</p>
	237
	238	<h3><tt>delete()</tt></h3>
	239
	240	<p>delete() is tricky in an internal-probing scheme. The obvious
	241	implementation of just "unassigning" the relevant table entry doesn't
	242	work. Consider the following scenario:</p>
	243
	244	<pre>
	245	t.insert(foo1); // foo1 hashes to 4, is put in table[4]
	246	t.insert(foo2); // foo2 hashes to 4, is put in table[5]
	247	t.erase(foo1); // table[4] is now 'unassigned'
	248	t.lookup(foo2); // fails since table[hash(foo2)] is unassigned
	249	</pre>
	250
	251	<p>To avoid these failure situations, delete(foo1) is actually
	252	implemented by replacing foo1 by a special 'delete' value in the
	253	hashtable. This 'delete' value causes the table entry to be
	254	considered unassigned for the purposes of insertion -- if foo3 hashes
	255	to 4 as well, it can go into table[4] no problem -- but assigned for
	256	the purposes of lookup.</p>
	257
	258	<p>What is this special 'delete' value? The delete value has to be an
	259	element of type Foo, since the table can't hold anything else. It
	260	obviously must be an element the client would never want to insert on
	261	its own, or else the code couldn't distinguish deleted entries from
	262	'real' entries with the same value. There's no way to determine a
	263	good value automatically. The client has to specify it explicitly.
	264	This is what the set_deleted_key() method does.</p>
	265
	266	<p>Note that set_deleted_key() is only necessary if the client actually
	267	wants to call t.erase(). For insert-only hash-sets, set_deleted_key()
	268	is unnecessary.</p>
	269
	270	<p>When copying the hashtable, either to grow it or shrink it, the
	271	special 'delete' values are <b>not</b> copied into the new table. The
	272	copy-time rehash makes them unnecessary.</p>
	273
	274	<h3>Resource use</h3>
	275
	276	<p>The data is stored in a sparsetable, so space use is the same as
	277	for sparsetable. Time use is also determined in large part by the
	278	sparsetable implementation. However, there is also an extra probing
	279	cost in hashtables, which depends in large part on the "too full"
	280	value. It should be rare to need more than 4-5 probes per lookup, and
	281	usually significantly less will suffice.</p>
	282
	283	<p>A note on growing and shrinking the hashtable: all hashtable
	284	implementations use the most memory when growing a hashtable, since
	285	they must have room for both the old table and the new table at the
	286	same time. sparse_hash_set is careful to delete entries from the old
	287	hashtable as soon as they're copied into the new one, to minimize this
	288	space overhead. (It does this efficiently by using its knowledge of
	289	the sparsetable class and copying one sparsetable group at a time.)</p>
	290
	291	<p>You can also look at some specific <A
	292	HREF="performance.html">performance numbers</A>.</p>
	293
	294
	295	<hr>
	296	<h2><tt>sparse_hash_map</tt></h2>
	297
	298	<p>sparse_hash_map is implemented identically to sparse_hash_set. The
	299	only difference is instead of storing just Foo in each table entry,
	300	the data structure stores pair<Foo, Value>.</p>
	301
	302
	303	<hr>
	304	<h2><tt>dense_hash_set</tt></h2>
	305
	306	<p>The hashtable aspects of dense_hash_set are identical to
	307	sparse_hash_set: it uses quadratic internal probing, and resizes
	308	hashtables in exactly the same way. The difference is in the
	309	underlying array: instead of using a sparsetable, dense_hash_set uses
	310	a C array. This means much more space is used, especially if Foo is
	311	big. However, it makes all operations faster, since sparsetable has
	312	memory management overhead that C arrays do not.</p>
	313
	314	<p>The use of C arrays instead of sparsetables points to one immediate
	315	complication dense_hash_set has that sparse_hash_set does not: the
	316	need to distinguish assigned from unassigned entries. In a
	317	sparsetable, this is accomplished by a bitmap. dense_hash_set, on the
	318	other hand, uses a dedicated value to specify unassigned entries.
	319	Thus, dense_hash_set has two special values: one to indicate deleted
	320	table entries, and one to indicated unassigned table entries. At
	321	construct time, all table entries are initialized to 'unassigned'.</p>
	322
	323	<p>dense_hash_set provides the method set_empty_key() to indicate the
	324	value that should be used for unassigned entries. Like
	325	set_deleted_key(), set_empty_key() requires a value that will not be
	326	used by the client for any legitimate purpose. Unlike
	327	set_deleted_key(), set_empty_key() is always required, no matter what
	328	hashtable operations the client wishes to perform.</p>
	329
	330	<p>Since this implementation uses C arrays rather than an equivalent C++
	331	data structure like vectors, data in it must have a Plain Old Data
	332	type. This restriction is probably easy to remove -- using a vector
	333	instead of an array would probably work fine -- but it has not been
	334	done.</p>
	335
	336	<h3>Resource use</h3>
	337
	338	<p>This implementation is fast because even though dense_hash_set may not
	339	be space efficient, most lookups are localized: a single lookup may
	340	need to access table[i], and maybe table[i+1] and table[i+3], but
	341	nothing other than that. For all but the biggest data structures,
	342	these will frequently be in a single cache line.</p>
	343
	344	<p>This implementation takes, for every unused bucket, space as big as
	345	the key-type. Usually between half and two-thirds of the buckets are
	346	empty.</p>
	347
	348	<p>The doubling method used by dense_hash_set tends to work poorly
	349	with most memory allocators. This is because memory allocators tend
	350	to have memory 'buckets' which are a power of two. Since each
	351	doubling of a dense_hash_set doubles the memory use, a single
	352	hashtable doubling will require a new memory 'bucket' from the memory
	353	allocator, leaving the old bucket stranded as fragmented memory.
	354	Hence, it's not recommended this data structure be used with many
	355	inserts in memory-constrained situations.</p>
	356
	357	<p>You can also look at some specific <A
	358	HREF="performance.html">performance numbers</A>.</p>
	359
	360
	361	<hr>
	362	<h2><tt>dense_hash_map</tt></h2>
	363
	364	<p>dense_hash_map is identical to dense_hash_set except for what values
	365	are stored in each table entry.</p>
	366
	367	<hr>
	368	<author>
	369	Craig Silverstein<br>
	370	Thu Jan 6 20:15:42 PST 2005
	371	</author>
	372
	373	</body>
	374	</html>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: GTP/trunk/Lib/Vis/Preprocessing/src/sparsehash/doc/implementation.html @ 2162

Download in other formats: