Context Navigation

online.tex @ 277

Revision 277, 54.3 KB checked in by bittner, 19 years ago (diff)
changes in the structure: renamed tools to algorithms

Line
1	\chapter{Online Visibility Culling}
2	%*****************************************************************************
3	\label{chap:online}
4
5	\section{Introduction}
6
7	Visibility culling is one of the major acceleration techniques for
8	the real-time rendering of complex scenes. The ultimate goal of
9	visibility culling techniques is to prevent invisible objects from
10	being sent to the rendering pipeline. A standard visibility-culling
11	technique is {\em view-frustum culling}, which eliminates objects
12	outside of the current view frustum. View-frustum culling is a fast
13	and simple technique, but it does not eliminate objects in the view
14	frustum that are occluded by other objects. This can lead to significant
15	{\em overdraw}, i.e., the same image area gets covered more than
16	once. The overdraw causes a waste of computational effort both in the
17	pixel and the vertex processing stages of modern graphic hardware.
18	%that is, we render each pixel in the image several times. For many complex
19	%scenes with high occlusion, the effort wasted on overdraw cannot simply be
20	%eliminated by the increasing the raw computational power of the hardware.
21	%This problem becomes even more apparent on recent graphics hardware with
22	%programmable vertex and pixel shaders: all complex per-vertex or per-pixel
23	%computations on object get completely wasted if the object is occluded.
24	The elimination of occluded objects is addressed by {\em occlusion
25	culling}. In an optimized rendering pipeline, occlusion culling complements
26	other rendering acceleration techniques such as levels of detail or
27	impostors.
28
29	Occlusion culling can either be applied offline or online. When
30	applied offline as a preprocess, we compute a potentially visible set
31	(PVS) for cells of a fixed subdivision of the scene. At runtime, we
32	can quickly identify a PVS for the given viewpoint. However, this
33	approach suffers from four major problems: (1) the PVS is valid only
34	for the original static scene configuration, (2) for a given
35	viewpoint, the corresponding cell-based PVS can be overly
36	conservative, (3) computing all PVSs is computationally expensive, and (4)
37	an accurate PVS computation is difficult to implement for general
38	scenes. Online occlusion culling can solve these problems at the cost
39	of applying extra computations at each frame. To make these additional
40	computations efficient, most online occlusion culling methods rely on a
41	number of assumptions about the scene structure and its occlusion
42	characteristics (e.g. presence of large occluders, occluder
43	connectivity, occlusion by few closest depth layers).
44
45	\begin{figure}%[htb]
46	%\centerline
47	\centering \footnotesize
48	\begin{tabular}{c}
49	\includegraphics[width=0.6\textwidth, draft=\DRAFTFIGS]{images/ogre_terrain} \\
50	%\hline
51	\includegraphics[width=0.3\textwidth, draft=\DRAFTFIGS]{images/vis_viewfrustum} \hfill \includegraphics[width=0.3\textwidth, draft=\DRAFTFIGS]{images/vis_chc} \\
52	\end{tabular}
53	%\label{tab:online_culling_example}
54	\caption{(top) The rendered terrain scene. (bottom) Visualizion of the rendered / culled objects.
55	Using view frustum culling (left image) vs. occlusion queries (right image).
56	The yellow boxes show the actually rendered scene objects. The
57	red boxes depict the view frustum culled hierarchy nodes, the blue boxes depict the
58	occlusion query culled hierarchy nodes.}
59	\label{fig:online_culling_example}
60	\end{figure}
61
62	Recent graphics hardware natively supports an \emph{occlusion query}
63	to detect the visibility of an object against the current contents of
64	the z-buffer. Although the query itself is processed quickly using
65	the raw power of the graphics processing unit (GPU), its result is not
66	available immediately due to the delay between issuing the query and
67	its actual processing in the graphics pipeline. As a result, a naive
68	application of occlusion queries can even decrease the overall
69	application performance due the associated CPU stalls and GPU
70	starvation. In this chapter, we present an algorithm that aims to
71	overcome these problems by reducing the number of issued queries and
72	eliminating the CPU stalls and GPU starvation. To schedule the
73	queries, the algorithm makes use of both the spatial and the temporal
74	coherence of visibility. A major strength of our technique is its
75	simplicity and versatility: the method can be easily integrated in
76	existing real-time rendering packages on architectures supporting the
77	underlying occlusion query~\cite{wimmer05:gpugems}. In
78	figure~\ref{fig:online_culling_example}, the same scene (top row) is
79	rendered using view frustum culling (visualization in the bottom left
80	image) versus online culling using occlusion queries (visualization in
81	the bottom right image). It can be seen that with view frustum culling
82	only many objects are still rendered.
83	%Using spatial and assuming temporal coherence
84
85	\section{Related Work}
86
87	With the demand for rendering scenes of ever increasing size, there
88	have been a number of visibility culling methods developed in the
89	last decade. A comprehensive survey of visibility culling methods
90	was presented by Cohen-Or et al.~\cite{Cohen:2002:survey}. Another
91	recent survey of Bittner and Wonka~\cite{bittner03:jep} discusses
92	visibility culling in a broader context of other visibility problems.
93
94	According to the domain of visibility computation, we distinguish
95	between {\em from-point} and {\em from-region} visibility algorithms.
96	From-region algorithms compute a PVS and are applied offline in a
97	preprocessing phase~\cite{Airey90,Teller91a,Leyvand:2003:RSF}.
98	From-point algorithms are applied online for each particular
99	viewpoint~\cite{Greene93a,Hudson97,EVL-1997-163,bittner98b,Wonka:1999:OSF,Klosowski:2001:ECV}.
100	In our further discussion we focus on online occlusion culling methods
101	that exploit graphics hardware.
102
103	A conceptually important online occlusion culling method is the
104	hierarchical z-buffer introduced by Greene et al.~\cite{Greene93a}. It
105	organizes the z-buffer as a pyramid, where the standard z-buffer is
106	the finest level. At all other levels, each z-value is the farthest in
107	the window corresponding to the adjacent finer level. The hierarchical
108	z-buffer allows to quickly determine if the geometry in question is
109	occluded. To a certain extent this idea is used in the current
110	generation of graphics hardware by applying early z-tests of fragments
111	in the graphics pipeline (e.g., Hyper-Z technology of ATI or Z-cull of
112	NVIDIA). However, the geometry still needs to be sent to the GPU,
113	transformed, and coarsely rasterized even if it is later determined
114	invisible.
115
116	Zhang~\cite{EVL-1997-163} proposed hierarchical occlusion maps, which
117	do not rely on the hardware support for the z-pyramid, but instead
118	make use of hardware texturing. The hierarchical occlusion map is
119	computed on the GPU by rasterizing and down sampling a given set of
120	occluders. The occlusion map is used for overlap tests whereas the
121	depths are compared using a coarse depth estimation buffer. Wonka and
122	Schmalstieg~\cite{Wonka:1999:OSF} use occluder shadows to compute
123	from-point visibility in \m25d scenes with the help of the GPU. This
124	method has been further extended to online computation of from-region
125	visibility executed on a server~\cite{wonka:01:eg}.
126
127	%% In
128	%% parallel to rendering, the visibility server calculates visibility for
129	%% the neighborhood of the given viewpoint and sends them back to the
130	%% display host.
131
132	%% A similar method designed by Aila and
133	%% Miettinen~\cite{aila:00:msc}. The incremental occlusion maps are
134	%% created on the CPU, which takes the CPU computational time but it does
135	%% not suffer the problem of slow read-back from the GPU.
136
137	%Klosowski:2001:ECV
138
139	Bartz et al.~\cite{Bartz98} proposed an OpenGL extension for occlusion
140	queries along with a discussion concerning a potential realization in
141	hardware. A first hardware implementation of occlusion queries came
142	with the VISUALIZE fx graphics hardware~\cite{Scott:1998:OVF}. The
143	corresponding OpenGL extension is called \hpot{}. A more recent OpenGL
144	extension, \nvoq{}, was introduced by NVIDIA with the GeForce 3 graphics
145	card and it is now also available as an official ARB extension.
146
147	%Both
148	%extensions have been used in several several occlusion methods
149	%algorithms~\cite{Meissner01,Hillesland02,Staneker:2004:OCO}.
150	%\cite{Meissner01}
151
152	Hillesland et al.~\cite{Hillesland02} have proposed an algorithm
153	which employs the \nvoq. They subdivide the scene using a uniform
154	grid. Then the cubes are traversed in slabs roughly perpendicular to
155	the viewport. The queries are issued for all cubes of a slab at once,
156	after the visible geometry of this slab has been rendered. The method
157	can also use nested grids: a cell of the grid contains another grid
158	that is traversed if the cell is proven visible. This method however
159	does not exploit temporal and spatial coherence of visibility and it
160	is restricted to regular subdivision data structures. Our new method
161	addresses both these problems and provides natural extensions to
162	balance the accuracy of visibility classification and the associated
163	computational costs.
164
165	Recently, Staneker et al.~\cite{Staneker:2004:OCO} developed a method
166	integrating occlusion culling into the OpenSG scene graph
167	framework. Their technique uses occupancy maps maintained in software
168	to avoid queries on visible scene graph nodes, and temporal coherence
169	to reduce the number of occlusion queries. The drawback of the method
170	is that it performs the queries in a serial fashion and thus it
171	suffers from the CPU stalls and GPU starvation.
172
173	%% The methods proposed in this chapter can be used to make use of temporal
174	%% and spatial coherence in the scope of existing visibility algorithms,
175	%% that utilise a spatial hierarchy. Examples of these are algorithms
176	%% based on hierarchical occlusion maps~\cite{Zhang97}, coverage
177	%% masks~\cite{Greene:1996:HPT}, shadow frusta~\cite{Hudson97}, and
178	%% occlusion trees~\cite{bittner98b_long}.
179
180	On a theoretical level, our method is related to methods aiming to
181	exploit the temporal coherence of visibility. Greene et
182	al.~\cite{Greene93a} used the set of visible objects from one frame to
183	initialize the z-pyramid in the next frame in order to reduce the
184	overdraw of the hierarchical z-buffer. The algorithm of Coorg and
185	Teller~\cite{Coorg96b} restricts the hierarchical traversal to nodes
186	associated with visual events that were crossed between successive
187	viewpoint positions. Another method of Coorg and Teller~\cite{Coorg97}
188	exploits temporal coherence by caching occlusion relationships.
189	Chrysanthou and Slater have proposed a probabilistic scheme for
190	view-frustum culling~\cite{Chrysanthou:97}.
191
192	The above mentioned methods for exploiting temporal coherence are
193	tightly interwoven with the particular culling algorithm. On the
194	contrary, Bittner et al.~\cite{bittner01:jvca} presented a general
195	acceleration technique for exploiting spatial and temporal coherence
196	in hierarchical visibility algorithms. The central idea, which is also
197	vital for the online occlusion culling, is to avoid repeated
198	visibility tests of interior nodes of the hierarchy. The problem of
199	direct adoption of this method is that it is designed for the use with
200	instantaneous CPU based occlusion queries, whereas hardware occlusion
201	queries exhibit significant latency. The method presented herein
202	efficiently overcomes the problem of latency while keeping the
203	benefits of a generality and simplicity of the original hierarchical
204	technique. As a result we obtain a simple and efficient occlusion
205	culling algorithm utilizing hardware occlusion queries.
206
207
208	%In this paper we show how to overcome this problem while
209	%keeping the benefits of exploiting.
210
211
212	% The rest of the chapter is organized as follows:
213	% Section~\ref{sec:hwqueries} discusses hardware supported occlusion
214	% queries and a basic application of these queries using a kD-tree.
215	% Section~\ref{sec:hoq} presents our new algorithm and
216	% Section~\ref{sec:optimizations} describes several additional
217	% optimizations. Section~\ref{sec:results} presents results obtained by
218	% experimental evaluation of the method and discusses its
219	% behavior. Finally Section~\ref{sec:conclusion} concludes the chapter.
220
221
222
223
224	\section{Hardware Occlusion Queries}
225	\label{sec:hwqueries}
226
227
228	Hardware occlusion queries follow a simple pattern: To test visibility
229	of an occludee, we send its bounding volume to the GPU. The volume is
230	rasterized and its fragments are compared to the current contents of
231	the z-buffer. The GPU then returns the number of visible
232	fragments. If there is no visible fragment, the occludee is invisible
233	and it need not be rendered.
234
235	\subsection{Advantages of hardware occlusion queries}
236
237	There are several advantages of hardware occlusion queries:
238
239	\begin{itemize}
240
241	\item {\em Generality of occluders.} We can use the original scene geometry
242	as occluders, since the queries use the current contents of the z-buffer.
243
244
245	\item {\em Occluder fusion.} The occluders are merged in the z-buffer,
246	so the queries automatically account for occluder fusion. Additionally
247	this fusion comes for free since we use the intermediate result of the
248	rendering itself.
249
250	\item {\em Generality of occludees.} We can use complex
251	occludees. Anything that can be rasterized quickly is suitable.
252
253	\item {\em Exploiting the GPU power.} The queries take full advantage of
254	the high fill rates and internal parallelism provided by modern GPUs.
255
256	\item {\em Simple use.} Hardware occlusion queries can be easily
257	integrated into a rendering algorithm. They provide a powerful tool to
258	minimize the implementation effort, especially when compared to
259	CPU-based occlusion culling.
260
261
262	\end{itemize}
263
264
265
266	\subsection{Problems of hardware occlusion queries}
267
268	Currently there are two main hardware supported variants of occlusion
269	queries: the HP test (\hpot) and the more recent NV query (\nvoq, now
270	also available as \arboq). The most important difference between the
271	HP test and the NV query is that multiple NV queries can be
272	issued before asking for their results, while only one HP test is
273	allowed at a time, which severely limits its possible algorithmic
274	usage. Additionally the NV query returns the number of visible pixels
275	whereas the HP test returns only a binary visibility
276	classification.
277
278	%% Issuing multiple independent NV queries is crucial for
279	%% the design of algorithms that strive to exploit power of the GPU.
280
281	The main problem of both the HP test and the NV query is the latency
282	between issuing the query and the availability of the result. The
283	latency occurs due to the delayed processing of the query in a long
284	graphics pipeline, the cost of processing the query itself, and the
285	cost of transferring the result back to the CPU. The latency causes
286	two major problems: CPU stalls and GPU starvation. After issuing the
287	query, the CPU waits for its result and does not feed the GPU with new
288	data. When the result finally becomes available, the GPU pipeline can
289	already be empty. Thus the GPU needs to wait for the CPU to process
290	the result of the query and to feed the GPU with new data.
291
292	A major challenge when using hardware occlusion queries is to avoid
293	the CPU stalls by filling the latency time with other tasks, such as
294	rendering visible scene objects or issuing other, independent
295	occlusion queries (see Figure~\ref{fig:latency})
296
297
298	\begin{figure}[htb]
299	\centerline{
300	\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{figs/latency2}
301	}
302
303	\caption{(top) Illustration of CPU stalls and GPU starvation.
304	Qn, Rn, and Cn denote querying, rendering, and culling of object
305	n, respectively. Note that object 5 is found invisible by Q5 and
306	thus not rendered. (bottom) More efficient query scheduling. The
307	scheduling assumes that objects 4 and 6 will be visible in the
308	current frame and renders them without waiting for the result of
309	the corresponding queries. }
310	\label{fig:latency}
311	\end{figure}
312
313
314
315
316
317	\subsection{Hierarchical stop-and-wait method}
318	\label{sec:basic}
319
320	Many rendering algorithms rely on hierarchical structures in order to
321	deal with complex scenes. In the context of occlusion culling, such a
322	data structure allows to efficiently cull large scene blocks, and thus
323	to exploit spatial coherence of visibility and provide a key to
324	achieving output sensitivity.
325
326	This section outlines a naive application of occlusion queries in the
327	scope of a hierarchical algorithm. We refer to this approach as the
328	{\em hierarchical stop-and-wait} method. Our discussion is based on
329	kD-trees, which proved to be efficient for point location, ray
330	tracing, and visibility
331	culling~\cite{MacDonald90,Hudson97,Coorg97,bittner01:jvca}. The
332	concept applies to general hierarchical data structures as well, though.
333
334	The hierarchical stop-and-wait method proceeds as follows: Once a
335	kD-tree node passes view-frustum culling, it is tested for
336	occlusion by issuing the occlusion query and waiting for its
337	result. If the node is found visible, we continue by recursively
338	testing its children in a front-to-back order. If the node is a
339	leaf, we render its associated objects.
340
341	The problem with this approach is that we can continue the tree
342	traversal only when the result of the last occlusion query becomes
343	available. If the result is not available, we have to stall the CPU,
344	which causes significant performance penalties. As we document in
345	Section~\ref{sec:results}, these penalties together with the overhead
346	of the queries themselves can even decrease the overall application
347	performance compared to pure view-frustum culling. Our new method aims
348	to eliminate this problem by issuing multiple occlusion queries for
349	independent scene parts and exploiting temporal coherence of
350	visibility classifications.
351
352
353	\section{Coherent Hierarchical Culling}
354
355	\label{sec:hoq}
356
357	In this section we first present an overview of our new
358	algorithm. Then we discuss its steps in more detail.
359
360	%and present a
361	%generalization of the method to other spatial data structures.
362
363	\subsection{Algorithm Overview}
364
365	\label{sec:overview}
366
367
368	Our method is based on exploiting temporal coherence of visibility
369	classification. In particular, it is centered on the following
370	three ideas:
371
372	\begin{itemize}
373	\item We initiate occlusion queries on nodes of the hierarchy where
374	the traversal terminated in the last frame. Thus we avoid queries
375	on all previously visible interior nodes~\cite{bittner01:jvca}.
376	\item We assume that a previously visible leaf node remains visible
377	and render the associated geometry without waiting for the result
378	of the corresponding occlusion query.
379	\item Issued occlusion queries are stored in a query queue until they are known to be carried out by the GPU. This allows
380	interleaving the queries with the rendering of visible geometry.
381	\end{itemize}
382
383
384	The algorithm performs a traversal of the hierarchy that is
385	terminated either at leaf nodes or nodes that are classified as
386	invisible. Let us call such nodes the {\em termination nodes}, and
387	interior nodes that have been classified visible the {\em opened
388	nodes}. We denote sets of termination and opened nodes in the $i$-th
389	frame $\mathcal{T}_i$ and $\mathcal{ O}_i$, respectively. In the
390	$i$-th frame, we traverse the kD-tree in a front-to-back order, skip
391	all nodes of $\mathcal{ O}_{i-1}$ and apply occlusion queries first on
392	the termination nodes $\mathcal{ T}_{i-1}$. When reaching a
393	termination node, the algorithm proceeds as follows:
394
395	\begin{itemize}
396	\item For a previously visible node (this must be a leaf), we issue
397	the occlusion query and store it in the query queue. Then we
398	immediately render the associated geometry without waiting for the
399	result of the query.
400	\item For a previously invisible node, we issue the query and store
401	it in the query queue.
402	\end{itemize}
403
404	\begin{figure}[htb]
405	{\footnotesize
406	\input{code/pseudocode2}
407	}
408	\caption{Pseudo-code of coherent hierarchical culling.}
409	\label{fig:pseudocode}
410	\end{figure}
411
412	When the query queue is not empty, we check if the result of the
413	oldest query in the queue is already available. If the query result is
414	not available, we continue by recursively processing other nodes of
415	the kD-tree as described above. If the query result is available, we
416	fetch the result and remove the node from the query queue. If the node
417	is visible, we process its children recursively. Otherwise, the whole
418	subtree of the node is invisible and thus it is culled.
419
420
421	In order to propagate changes in visibility upwards in the hierarchy,
422	the visibility classification is \emph{pulled up} according to the
423	following rule: An interior node is invisible only if all its children
424	have been classified invisible. Otherwise, it remains visible and thus
425	opened. The pseudo-code of the complete algorithm is given in
426	Figure~\ref{fig:pseudocode}. An example of the behavior of the method
427	on a small kD-tree for two subsequent frames is depicted
428	Figure~\ref{fig:cut}.
429
430	\begin{figure*}[htb]
431	\centerline{
432	\includegraphics[width=0.73\textwidth,draft=\DRAFTFIGS]{figs/cut} }
433	\caption{(left) Visibility classification of a node of the kD-tree and
434	the termination nodes. (right) Visibility classification after the
435	application of the occlusion test and the new set of termination nodes.
436	Nodes on which occlusion queries were applied are depicted with a
437	solid outline. Note the pull-up and pull-down due to visibility changes.}
438	\label{fig:cut}
439	\end{figure*}
440
441
442
443	The sets of opened nodes and termination nodes need not be maintained
444	explicitly. Instead, these sets can be easily identified by
445	associating with each node an information about its visibility and an
446	id of the last frame when it was visited. The node is an opened node
447	if it is an interior visible node that was visited in the last frame
448	(line 23 in the pseudocode). Note that in the actual implementation of
449	the pull up we can set all visited nodes to invisible by default and
450	then pull up any changes from invisible to visible (lines 25 and line
451	12 in Figure~\ref{fig:pseudocode}). This modification eliminates
452	checking children for invisibility during the pull up.
453
454
455
456	%% \begin{figure*}[htb]
457	%% \centerline{\includegraphics[width=0.8\textwidth,draft=\DRAFTFIGS]{figs/basic}}
458	%% \caption{Illustration of the hierarchy traversal. Initially the algorithm
459	%% starts at the root of the hierarchy (left). In the second frame the
460	%% opened nodes $\mathcal{ O}_0$ are skipped and the occlusion queries
461	%% are first applied on the termination nodes $\mathcal{T}_0$. Visibility changes are propagated upwarsds in the hierarchy and a
462	%% new set of termination nodes $\mathcal{ T}_1$ is established.}
463	%% \label{fig:basic}
464	%% \end{figure*}
465
466
467	\subsection{Reduction of the number of queries}
468
469	%Skipping the opened nodes and issuing occlusion queries for the
470	%termination nodes assists in exploiting two main characteristics of
471	%scene visibility:
472
473	%Identifying the set of termination nodes assists in finding a subdivision
474
475	Our method reduces the number of visibility queries in two ways:
476	Firstly, as other hierarchical culling methods we consider only a
477	subtree of the whole hierarchy (opened nodes + termination
478	nodes). Secondly, by avoiding queries on opened nodes we eliminate
479	part of the overhead of identification of this subtree. These
480	reductions reflect the following coherence properties of scene
481	visibility:
482
483	\begin{itemize}
484
485
486	\item {\em Spatial coherence.} The invisible termination nodes
487	approximate the occluded part of the scene with the smallest number of
488	nodes with respect to the given hierarchy, i.e., each invisible
489	termination node has a visible parent. This induces an adaptive
490	spatial subdivision that reflects spatial coherence of visibility,
491	more precisely the coherence of occluded regions. The adaptive nature
492	of the subdivision allows to minimize the number of subsequent
493	occlusion queries by applying the queries on the largest spatial
494	regions that are expected to remain occluded.
495
496	%, and visible termination
497	%nodes correspond to the smallest unoccluded regions (visible
498	%leafs)
499
500	\item {\em Temporal coherence.} If visibility remains constant the set
501	of termination nodes needs no adaptation. If an occluded node becomes
502	visible we recursively process its children (pull-down). If a visible
503	node becomes occluded we propagate the change higher in the hierarchy
504	(pull-up). A pull-down reflects a spatial growing of visible
505	regions. Similarly, a pull-up reflects a spatial growing of occluded
506	regions.
507
508
509	%The first case corresponds to pull-down of the cut, the latter to pull-up.
510
511	\end{itemize}
512
513	By avoiding queries on the opened nodes, we can save $1/k$ of the queries
514	for a hierarchy with branching factor $k$ (assuming visibility remains
515	constant). Thus for the kD-tree, up to half of the queries can be
516	saved. The actual savings in the total query time are even larger: the
517	higher we are at the hierarchy, the larger boxes we would have to
518	check for occlusion. Consequently, the higher is the fill rate that
519	would have been required to rasterize the boxes. In particular,
520	assuming that the sum of the screen space projected area for nodes at
521	each level of the kD-tree is equal and the opened nodes form a
522	complete binary subtree of depth $d$, the fill rate is reduced $(d+2)$
523	times.
524
525
526
527	\subsection{Reduction of CPU stalls and GPU starvation}
528
529	\label{sec:latency}
530
531	The reduction of CPU stalls and GPU starvation is achieved by
532	interleaving occlusion queries with the rendering of visible geometry. The
533	immediate rendering of previously visible termination nodes and the
534	subsequent issuing of occlusion queries eliminates the requirement of
535	waiting for the query result during the processing of the initial
536	depth layers containing previously visible nodes. In an optimal case,
537	new query results become available in between and thus we completely
538	eliminate CPU stalls. In a static scenario, we achieve exactly the
539	same visibility classification as the hierarchical stop-and-wait
540	method.
541
542	If the visibility is changing, the situation can be different: if the
543	results of the queries arrive too late, it is possible that we
544	initiated an occlusion query on a previously occluded node $A$ that is
545	in fact occluded by another previously occluded node $B$ that became
546	visible. If $B$ is still in the query queue, we do not capture a
547	possible occlusion of $A$ by $B$ since the geometry associated with
548	$B$ has not yet been rendered. In Section~\ref{sec:results} we show
549	that the increase of the number of rendered objects compared to the
550	stop-and-wait method is usually very small.
551
552	%% It is possible that we have already traversed all previously
553	%% visible nodes and we must stall the CPU by waiting for the result
554	%% of the oldest occlusion query. A technique completely eliminating
555	%% CPU stalls at the cost of an increase of the number of rendered
556	%% objects will be discussed in Section~\ref{sec:full}.
557
558	\subsection{Front-to-back scene traversal}
559
560	For kD-trees the front-to-back scene traversal can be easily
561	implemented using a depth first
562	traversal~\cite{bittner01:jvca}. However, at a modest increase in
563	computational cost we can also use a more general breadth-first
564	traversal based on a priority queue. The priority of the node then
565	corresponds to an inverse of the minimal distance of the viewpoint and
566	the bounding box associated with the given node of the
567	kD-tree~\cite{Klosowski:2001:ECV,Staneker:2004:OCO}.
568
569	In the context of our culling algorithm, there are two main advantages
570	of the breadth-first front-to-back traversal :
571
572	\begin{itemize}
573	\item {\em Better query scheduling.} By spreading the traversal of the
574	scene in a breadth-first manner, we process the scene in depth
575	layers. Within each layer, the node processing order is practically
576	independent, which minimizes the problem of occlusion query
577	dependence. The breadth-first traversal interleaves
578	occlusion-independent nodes, which can provide a more accurate
579	visibility classification if visibility changes quickly. In
580	particular, it reduces the problem of false classifications due to
581	missed occlusion by nodes waiting in the query queue (discussed in
582	Section~\ref{sec:latency}).
583
584	\item {\em Using other spatial data structures.} By using a
585	breadth-first traversal, we are no longer restricted to the
586	kD-tree. Instead we can use an arbitrary spatial data structure
587	such as a bounding volume hierarchy, octree, grid, hierarchical grid,
588	etc. Once we compute a distance from a node to the viewpoint, the
589	node processing order is established by the priority queue.
590	\end{itemize}
591
592	When using the priority queue, our culling algorithm can also be
593	applied directly to the scene graph hierarchy, thus avoiding the
594	construction of any auxiliary data structure for spatial
595	partitioning. This is especially important for dynamic scenes, in
596	which maintenance of a spatial classification of moving objects can be
597	costly.
598
599	\subsection{Checking the query result}
600
601	The presented algorithm repeatedly checks if the result of the
602	occlusion query is available before fetching any node from the
603	traversal stack (line 6 in Figure~\ref{fig:pseudocode}). Our
604	practical experiments have proven that the cost of this check is
605	negligible and thus it can used frequently without any performance
606	penalty. If the cost of this check were significantly higher, we could
607	delay asking for the query result by a time established by empirical
608	measurements for the particular hardware. This delay should also
609	reflect the size of the queried node to match the expected
610	availability of the query result as precise as possible.
611
612
613	\section{Further Optimizations}
614
615	\label{sec:optimizations}
616
617	This section discusses a couple of optimizations of our method that
618	can further improve the overall rendering performance. In contrast to
619	the basic algorithm from the previous section, these optimizations
620	rely on some user specified parameters that should be tuned for a
621	particular scene and hardware configuration.
622
623
624	\subsection{Conservative visibility testing}
625
626	The first optimization addresses the reduction of the number of
627	visibility tests at the cost of a possible increase in the number of
628	rendered objects. This optimization is based on the idea of skipping
629	some occlusion tests of visible nodes. We assume that whenever a node
630	becomes visible, it remains visible for a number of frames. Within the
631	given number of frames we avoid issuing occlusion queries and simply
632	assume the node remains visible~\cite{bittner01:jvca}.
633
634	This technique can significantly reduce the number of visibility tests
635	applied on visible nodes of the hierarchy. Especially in the case of
636	sparsely occluded scenes, there is a large number of visible nodes
637	being tested, which does not provide any benefit since most of them
638	remain visible. On the other hand, we do not immediately capture all
639	changes from visibility to invisibility, and thus we may render
640	objects that have already become invisible from the moment when the
641	last occlusion test was issued.
642
643	In the simplest case, the number of frames a node is assumed visible
644	can be a predefined constant. In a more complicated scenario this
645	number should be influenced by the history of the success of occlusion
646	queries and/or the current speed of camera movement.
647
648
649	\subsection{Approximate visibility}
650
651	The algorithm as presented computes a conservative visibility
652	classification with respect to the resolution of the z-buffer. We
653	can easily modify the algorithm to cull nodes more aggressively in
654	cases when a small part of the node is visible. We compare the
655	number of visible pixels returned by the occlusion query with a
656	user specified constant and cull the node if this number drops
657	below this constant.
658
659
660	\subsection{Complete elimination of CPU stalls}
661
662	\label{sec:full}
663
664	The basic algorithm eliminates CPU stalls unless the traversal stack
665	is empty. If there is no node to traverse in the traversal stack and
666	the result of the oldest query in the query queue is still not
667	available, it stalls the CPU by waiting for the query result. To
668	completely eliminate the CPU stalls, we can speculatively render some
669	nodes with undecided visibility. In particular, we select a node from
670	the query queue and render the geometry associated with the node (or
671	the whole subtree if it is an interior node). The node is marked as
672	rendered but the associated occlusion query is kept in the queue to
673	fetch its result later. If we are unlucky and the node remains
674	invisible, the effort of rendering the node's geometry is wasted. On
675	the other hand, if the node has become visible, we have used the time
676	slot before the next query arrives in an optimal manner.
677
678	To avoid the problem of spending more time on rendering invisible
679	nodes than would be spent by waiting for the result of the query, we
680	select a node with the lowest estimated rendering cost and compare
681	this cost with a user specified constant. If the cost is larger than
682	the constant we conclude that it is too risky to render the node and
683	wait till the result of the query becomes available.
684
685	\begin{figure}%[htb]
686	{\footnotesize
687	\input{code/pseudocode_batching}
688	}
689	\caption{Pseudo-code of coherent hierarchical culling using multiple queries.}
690	\label{fig:pseudocode_batching}
691	\end{figure}
692
693	\subsection{Initial depth pass}
694	\label{sec:initial}
695
696	\begin{figure}
697	\centering
698	\includegraphics[width=0.49\textwidth,draft=\DRAFTIMAGES]{images/transp_wrong}
699	\includegraphics[width=0.49\textwidth,draft=\DRAFTIMAGES]{images/transp_delayed}
700	\caption{(left) all passes are rendered with CHC. Note that the soldiers are
701	visible through the tree. (right) Only the solid passes are rendered using CHC, afterwards the transparent passes.}
702	\label{fig:online_transparency}
703	\end{figure}
704
705
706	To achieve maximal performance on modern GPU's, one has to take care of a number of issues.
707	First, it is very important to reduce material switching. Thus modern rendering engines sort the
708	objects (or patches) by materials in order to eliminate the material switching as good as possible.
709
710	Next, materials can be very costly, sometimes complicated shaders have to be evaluated several times per batch. Hence it should be avoided to render the full material for fragments which eventually turn out to be occluded. This can be achieved
711	by rendering an initial depth pass (i.e., enabling only depth write to fill the depth
712	buffer). Afterwards the geometry is rendered again, this time with full shading. Because the
713	depth buffer is already established, invisible fragments will be discarded before any shading is done calculated.
714
715	This approach can be naturally adapted for use with the CHC algorithm. Only an initial depth pass is rendered
716	in front-to-back order using the CHC algorithm. The initial pass is sufficient to fill the depth buffer and determine the visible geometry. Then only the visible geometry is rendered again, exploiting the full optimization and material sorting capability of the rendering engine.
717
718	If the materials requires several rendering passes, we can use a variant of the depth pass method. We render
719	only the first passes using the algorithm (e.g., the solid passes), determining the visibility of the patches, and
720	render all the other passes afterwards. This approach can be used when there are passes which require a special
721	kind of sorting to be rendered correctly (e.g., transparent passes, shadow passes). In figure~\ref{fig:online_transparency},
722	we can see that artifacts occur in the left image if the transparent passes are not rendered in the correct order after
723	applying the hierarchical algorithm (right image). In a similar fashion, we are able to handle shadows~\ref{fig:chc_shadows}.
724
725	\begin{figure}
726	\centering
727	\includegraphics[width=0.4\textwidth,draft=\DRAFTIMAGES]{images/chc_shadows}
728	\caption{We can correctly handle shadow volumes together with CHC.}
729	\label{fig:chc_shadows}
730	\end{figure}
731
732	\subsection{Batching multiple queries}
733	\label{sec:batching}
734
735	When occlusion queries are rendered interleaved with geometry, there is always a state change involved. To
736	reduce state changes, it is beneficial not to execute one query at a time, but multiple queries at once~\cite{Staneker:2004:EMO}.
737	Instead of immediately executing a query for a node when we fetch it from the traversal stack, we add it to the {\em pending queue}. If {\em n} of these queries are accumulated in the queue, we can execute them at once.
738	To optain an optimal value for {\em n}, several some heuristics can be applied, e.g., a fraction of the number of queries issued in the last frame. The pseudo-code of the algorithm including the batching is given in Figure~\ref{fig:pseudocode_batching}.
739
740
741	\section{Results}
742	\label{sec:results}
743
744	We have incorporated our method into an OpenGL-based scene graph
745	library and tested it on three scenes of different types. All tests
746	were conducted on a PC with a 3.2GHz P4, 1GB of memory, and a
747	GeForce FX5950 graphics card.
748
749	\subsection{Test scenes}
750
751	The three test scenes comprise a synthetic arrangement of 5000
752	randomly positioned teapots (11.6M polygons); an urban environment (1M
753	polygons); and the UNC power plant model (13M polygons). The test
754	scenes are depicted in Figure~\ref{fig:scenes}. All scenes were
755	partitioned using a kD-tree constructed according to the surface-area
756	heuristics~\cite{MacDonald90}.
757
758
759	Although the teapot scene would intuitively offer good occlusion, it
760	is a complicated case to handle for occlusion culling. Firstly, the
761	teapots consist of small triangles and so only the effect of fused
762	occlusion due to a large number of visible triangles can bring a
763	culling benefit. Secondly, there are many thin holes through which it
764	is possible to see quite far into the arrangement of teapots. Thirdly,
765	the arrangement is long and thin and so we can see almost
766	half of the teapots along the longer side of the arrangement.
767
768	%Although the teapot scene would intuitively seem to offer good
769	%occlusion, it is one of the more complicated cases for occlusion
770	%culling. Due to the high number of very dense objects, a very fine
771	%kD-tree subdivision was necessary, which in turn leads to higher costs
772	%for testing many nodes in the hierarchy. Additionally, misclassifying
773	%a single teapot can lead to a significant increase in frame time due
774	%to the high polygon density (2,304 polygons per teapot). There are also
775	%many locations where it is possible to see very far into the block of
776	%teapots. This can be seen in the kD-tree node classification which is
777	%shown in the accompanying video.
778
779	The complete power plant model is quite challenging even to load into
780	memory, but on the other hand it offers good occlusion. This scene is
781	an interesting candidate for testing not only due to its size, but
782	also due to significant changes in visibility and depth complexity in
783	its different parts.
784
785	%as can also be seen in the
786	%walkthrough. A relatively coarse subdivision of the model triangles
787	%into about 4,600 nodes was sufficient to capture most occlusion
788	%interdependencies.
789
790	The city scene is a classical target for occlusion culling
791	algorithms. Due to the urban structure consisting of buildings and
792	streets, most of the model is occluded when viewed from the
793	streets. Note that the scene does not contain any detailed geometry
794	inside the buildings. See Figure~\ref{fig:city_vis} for a
795	visualization of the visibility classification of the kD-tree nodes
796	for the city scene.
797
798	%that
799	%could further
800
801
802
803	%% However, we still included a section in the walkthrough where the
804	%% viewpoint is elevated above the building roofs in order to show that
805	%% the algorithm continues to work without modification, albeit at some
806	%% expense. It will be an interesting topic of research to automatically
807	%% recognize such situations and decrease the frequency of visibility
808	%% queries or turn the occlusion culling algorithm off completely.
809
810
811
812	\begin{figure}
813	\centering \includegraphics[width=0.4\textwidth,draft=\DRAFTIMAGES]{images/city_vis}
814	\caption{Visibility classification of the kD-tree nodes in the city scene.
815	The orange nodes were found visible, all the other depicted nodes are invisible.
816	Note the increasing size of the occluded nodes with increasing distance from the visible set.}
817	\label{fig:city_vis}
818	\end{figure}
819
820
821	\subsection{Basic tests}
822
823
824	We have measured the frame times for rendering with only view-frustum culling
825	(VFC), the hierarchical stop-and-wait method (S\&W), and our new
826	coherent hierarchical culling method (CHC). Additionally, we have
827	evaluated the time for an ``ideal'' algorithm. The ideal algorithm
828	renders the visible objects found by the S\&W algorithm without
829	performing any visibility tests. This is an optimal solution with
830	respect to the given hierarchy, i.e., no occlusion culling algorithm
831	operating on the same hierarchy can be faster. For the basic tests we
832	did not apply any of the optimizations discussed in
833	Section~\ref{sec:optimizations}, which require user specified
834	parameters.
835
836	%% In order to obtain repeatable and comparable timings, the process
837	%% priority of our application was set to high, timings were obtained
838	%% using the CPU cycle counter instruction, and the OpenGL {\tt Finish()}
839	%% instruction was executed before each invocation of the timer at the
840	%% start and the end of a frame. Furthermore, we did not include clear
841	%% and buffer swapping times in our results, since they depend strongly
842	%% on the windowing mode used.
843
844	For each test scene, we have constructed a walkthrough which is shown
845	in full in the accompanying
846	video. Figures~\ref{fig:teapot_walkthrough},~\ref{fig:city_walkthrough},
847	and~\ref{fig:plant_walkthrough} depict the frame times measured for
848	the walkthroughs. Note that Figure~\ref{fig:plant_walkthrough} uses a
849	logarithmic scale to capture the high variations in frame times during
850	the power plant walkthrough.
851	\begin{figure}
852	\centering
853	\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/teapotgraph1}
854	\caption{Frame times for the teapot scene.}\label{fig:teapot_walkthrough}
855	\end{figure}
856	\begin{figure}
857	\centering
858	\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/citygraph1}
859	\caption{Frame times for the city walkthrough. Note
860	the spike around frame 1600, where the viewpoint was elevated above the roofs,
861	practically eliminating any occlusion.}\label{fig:city_walkthrough}
862	\end{figure}
863	\begin{figure}
864	\centering
865	\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/plantgraph1log}
866	\caption{Frame times for the power plant walkthrough.
867	The plot shows the weakness of the S\&W method: when there is not much occlusion it becomes slower than VFC (near frame 2200).
868	The CHC can keep up even in these situations and in the same time it can exploit occlusion when it appears
869	(e.g. near frame 3700).}
870	\label{fig:plant_walkthrough}
871	\end{figure}
872	To better demonstrate the behavior of our algorithm, all walkthroughs
873	contain sections with both restricted and unrestricted visibility. For
874	the teapots, we viewed the arrangement of teapots along the longer
875	side of the arrangement (frames 25--90). In the city we elevated the
876	viewpoint above the roofs and gained sight over most of the city
877	(frames 1200--1800). The power plant walkthrough contains several
878	viewpoints from which a large part of the model is visible (spikes in
879	Figure~\ref{fig:plant_walkthrough} where all algorithms are slow),
880	viewpoints along the border of the model directed outwards with low depth complexity (holes in
881	Figure~\ref{fig:plant_walkthrough} where all algorithms are fast), and
882	viewpoints inside the power plant with high depth complexity where occlusion culling produces a
883	significant speedup over VFC (e.g. frame 3800).
884
885	As we can see for a number frames in the walkthroughs, the CHC method
886	can produce a speedup of more than one order of magnitude compared to
887	VFC. The maximum speedup for the teapots, the city, and the power
888	plant walkthroughs is 8, 20, and 70, respectively. We can also observe
889	that CHC maintains a significant gain over S\&W and in many cases it
890	almost matches the performance of the ideal algorithm. In complicated
891	scenarios the S\&W method caused a significant slowdown compared to
892	VFC (e.g. frames 1200--1800 of
893	Figure~\ref{fig:city_walkthrough}). Even in these cases, the CHC
894	method maintained a good speedup over VFC except for a small number of
895	frames.
896
897	\begin{table*}
898	\centering \footnotesize
899	\begin{tabular}{\|c\|c\|c\|c\|c\|c\|c\|c\|}
900	\hline\hline
901	scene & method & \#queries & wait time [ms] & rendered triangles & frame time [ms] & speedup \\\hline\hline
902
903	Teapots & VFC & --- & --- & 11,139,928 & 310.42 & 1.0 \\\hhline{~\|-\|-\|-\|-\|-\|-\|-\|}
904	11,520,000 triangles & S\&W & 4704 & 83.19 & 2,617,801 & 154.95 & 2.3 \\\hhline{~\|-\|-\|-\|-\|-\|-\|-\|}
905	21,639 kD-Tree nodes & {\bf CHC} &{\bf 2827} & {\bf 1.31} & {\bf 2,852,514 } & {\bf 81,18 } & {\bf 4.6 } \\\hhline{~\|-\|-\|-\|-\|-\|-\|-\|}
906	& Ideal & --- & --- & 2,617,801 & 72.19 & 5.2 \\
907	\hline\hline
908	City & VFC & --- & --- & 156,521 & 19.79 & 1.0 \\\hhline{~\|-\|-\|-\|-\|-\|-\|-\|}
909	1,036,146 triangles & S\&W & 663 & 9.49 & 30,594 & 19.9 & 1.5 \\\hhline{~\|-\|-\|-\|-\|-\|-\|-\|}
910	33,195 kD-Tree nodes &{\bf CHC} & {\bf 345} & {\bf 0.18} & {\bf 31,034} & {\bf 8.47} & {\bf 4.0} \\\hhline{~\|-\|-\|-\|-\|-\|-\|-\|}
911	& Ideal & --- & --- & 30,594 & 4.55 & 6.6 \\
912	\hline\hline
913	Power Plant & VFC & --- & --- & 1,556,300 & 138.76 & 1.0 \\\hhline{~\|-\|-\|-\|-\|-\|-\|-\|}
914	12,748,510 triangles & S\&W & 485 & 16.16 & 392,962 & 52.29 & 3.2 \\\hhline{~\|-\|-\|-\|-\|-\|-\|-\|}
915	18,719 kD-Tree nodes &{\bf CHC} & {\bf 263} & {\bf 0.70} & {\bf 397,920} & {\bf 38.73} & {\bf 4.7} \\\hhline{~\|-\|-\|-\|-\|-\|-\|-\|}
916	& Ideal & --- & --- & 392,962 & 36.34 & 5.8 \\\hline\hline
917	\end{tabular}
918	\caption{Statistics for the three test scenes. VFC is rendering with only view-frustum culling, S\&W is the
919	hierarchical stop and wait method, CHC is our new method, and Ideal
920	is a perfect method with respect to the given hierarchy. All values
921	are averages over all frames (including the speedup).}
922	\label{tab:averages}
923	\end{table*}
924
925	%VFC S&W CHC Ideal
926	%(01:50:59) salzrat: 1 3,243823523 4,744645293 5,765981148
927	%(01:51:04) salzrat: for the powerplant
928	% for the powerplant
929	%(01:53:59) salzrat: 1 1,519898136 4,01897468 6,599413211
930	%(01:54:01) salzrat: for the city
931	%(01:55:15) salzrat: and
932	%(01:55:15) salzrat: 1 2,321137905 4,646185621 5,199086993
933
934	%salzrat: teapot: 1,977162565 1,128631431
935	%(02:01:52) salzrat: city: 2,607396227 1,674188137
936	%(02:02:59) salzrat: powerplant: 1,635036084 1,244446469
937
938
939	Next, we summarized the scene statistics and the average values per
940	frame in Table~\ref{tab:averages}. The table shows the number of
941	issued occlusion queries, the wait time representing the CPU stalls,
942	the number of rendered triangles, the total frame time, and the
943	speedup over VFC.
944
945
946	We can see that the CHC method practically eliminates the CPU stalls
947	(wait time) compared to the S\&W method. This is paid for by a slight
948	increase in the number of rendered triangles. For the three
949	walkthroughs, the CHC method produces average speedups of 4.6, 4.0, and
950	4.7 over view frustum culling and average speedups of 2.0, 2.6, and
951	1.6 over the S\&W method. CHC is only 1.1, 1.7, and 1.2 times
952	slower than the ideal occlusion culling algorithm. Concerning the
953	accuracy, the increase of the average number of rendered triangles for
954	CHC method compared to S\&W was 9\%, 1.4\%, and 1.3\%. This increase
955	was always recovered by the reduction of CPU stalls for the tested
956	walkthroughs.
957
958	%% While this number may be very low, the algorithm also incurs a
959	%% non-negligible overhead which is caused by the graphics card
960	%% reconfiguration required for the queries and rasterization of the
961	%% bounding volume geometries.
962
963	% triangles
964	% 9%
965	% 1.4%
966	% 1.3%
967
968
969	% speedup over SW
970	% 1.9
971	% 2.32
972	% 1.33
973
974	% compare to ideal
975	% 1.13
976	% 1.9
977	% 1.05
978
979
980
981	\subsection{Optimizations}
982
983	First of all we have observed that the technique of complete
984	elimination of CPU stalls discussed in Section~\ref{sec:full} has a
985	very limited scope. In fact for all our tests the stalls were almost
986	completely eliminated by the basic algorithm already (see wait time in
987	Table~\ref{tab:averages}). We did not find constants that could
988	produce additional speedup using this technique.
989
990	The measurements for the other optimizations discussed in
991	Section~\ref{sec:optimizations} are summarized in
992	Table~\ref{tab:assumed}. We have measured the average number of issued queries
993	and the average frame time in dependence on the number of frames a node is
994	assumed visible and the pixel threshold of approximate visibility. We
995	have observed that the effectiveness of the optimizations depends strongly on the scene. If
996	the hierarchy is deep and the geometry associated with a leaf node is
997	not too complex, the conservative visibility testing produces a
998	significant speedup (city and power plant). For the teapot scene the
999	penalty for false rendering of actually occluded objects became larger
1000	than savings achieved by the reduction of the number of queries. On the other
1001	hand since the teapot scene contains complex visible geometry the
1002	approximate visibility optimization produced a significant
1003	speedup. This is however paid for by introducing errors in the image
1004	proportional to the pixel threshold used.
1005
1006	%Good parameters for these
1007	%optimizations have to be estimated on a scene-by-scene basis.
1008
1009
1010
1011
1012	%% , where we compare the reference frame time and
1013	%% number of queries already shown in the previous table to the
1014	%% approximate visibility optimization using a higher pixel threshold,
1015	%% and to the conservative visibility testing optimization assuming a
1016	%% node remains visible for 2 frames, both for the teapot and the city
1017	%% scene. These tests clearly show that a high visibility threshold
1018	%% improves frame times, albeit at the expense of image quality. The
1019	%% benefit of the conservative visibility optimization, is however
1020	%% limited. While the number of queries can be reduced significantly, the
1021	%% number of misclassified objects rises and increases the frame time
1022	%% again.
1023
1024
1025
1026	%% \begin{table}
1027	%% \centering \footnotesize
1028	%% \begin{tabular}{\|l\|c\|c\|}
1029	%% \hline\hline
1030	%% & Teapots & City \\\hline\hline
1031	%% frame time CHC & 130.42 & 13.91\\\hline
1032	%% \#queries & 2267 & 94.04\\\hline\hline
1033	%% \multicolumn{3}{\|l\|}{assume 2 frames visible}\\\hline
1034	%% frame time & 123.84 & 13.27 \\\hline
1035	%% \#queries & 1342 & 42 \\\hline\hline
1036	%% \multicolumn{3}{\|l\|}{pixel threshold 50 pixels and assume 2 frames visible}\\\hline
1037	%% frame time & 73.22 & 11.47 \\\hline
1038	%% \#queries & 1132 & 50 \\\hline\hline
1039	%% \end{tabular}
1040	%% \caption{Influence of optimizations on two test scenes. All values are averages
1041	%% over all frames, frame times are in [ms].}\label{tab:assumed}
1042	%% \end{table}
1043
1044
1045	\begin{table}
1046	\centering \footnotesize
1047	\begin{tabular}{\|l\|c\|c\|c\|c\|}
1048	\hline\hline
1049	scene & t$_{av}$ & n$_{vp}$ & \#queries & frame time [ms] \\ \hline\hline
1050	\multirow{3}{*}{Teapots} & 0 & 0 & 2827 & 81.18 \\ \hhline{~----}
1051	& 2 & 0 & 1769 & 86.31 \\ \hhline{~----}
1052	& 2 & 25 & 1468 & 55.90 \\ \hline\hline
1053	\multirow{3}{*}{City} & 0 & 0 & 345 & 8.47 \\ \hhline{~----}
1054	& 2 & 0 & 192 & 6.70 \\ \hhline{~----}
1055	& 2 & 25 & 181 & 6.11 \\ \hline\hline
1056	\multirow{3}{*}{Power Plant} & 0 & 0 & 263 & 38.73 \\ \hhline{~----}
1057	& 2 & 0 & 126 & 31.17 \\ \hhline{~----}
1058	& 2 & 25 & 120 & 36.62 \\ \hline\hline
1059	\end{tabular}
1060
1061	\caption{Influence of optimizations on the CHC method.
1062	$t_{av}$ is the number of assumed visibility frames for conservative
1063	visibility testing, n$_{vp}$ is the pixel threshold for
1064	approximate visibility.}\label{tab:assumed}
1065	\end{table}
1066
1067
1068	\subsection{Comparison to PVS-based rendering}
1069
1070	We also compared the CHC method against precalculated visibility. In
1071	particular, we used the PVS computed by an offline visibility
1072	algorithm~\cite{wonka00}. While the walkthrough using the PVS was
1073	1.26ms faster per frame on average, our method does not require costly
1074	precomputation and can be used at any general 3D position in the
1075	model, not only in a predefined view space.
1076
1077
1078	\section{Summary}
1079	\label{sec:conclusion}
1080
1081	We have presented a method for the optimized scheduling of hardware
1082	accelerated occlusion queries. The method schedules occlusion queries
1083	in order to minimize the number of the queries and their latency.
1084	This is achieved by exploiting spatial and temporal coherence of
1085	visibility. Our results show that the CPU stalls and GPU starvation
1086	are almost completely eliminated at the cost of a slight increase in
1087	the number of rendered objects.
1088
1089	%Additionally it schedules the queries in an order
1090	%minimizing the overhead due to the latency of availability of query
1091	%results.
1092
1093	Our technique can be used with practically arbitrary scene
1094	partitioning data structures such as kD-trees, bounding volume
1095	hierarchies, or hierarchical grids. The implementation of the method
1096	is straightforward as it uses a simple OpenGL interface to the
1097	hardware occlusion queries. In particular, the method requires no
1098	complicated geometrical operations or data structures. The algorithm
1099	is suitable for application on scenes of arbitrary structure and it
1100	requires no preprocessing or scene dependent tuning.
1101
1102	We have experimentally verified that the method is well suited to the
1103	\nvoq{} supported on current consumer grade graphics hardware. We
1104	have obtained an average speedup of 4.0--4.7 compared to pure view-frustum
1105	culling and 1.6--2.6 compared to the hierarchical
1106	stop-and-wait application of occlusion queries.
1107
1108
1109	% The major potential in improving the method is a better estimation
1110	% of changes in the visibility classification of hierarchy nodes. If
1111	% nodes tend to be mostly visible, we could automatically decrease the
1112	% frequency of occlusion tests and thus better adapt the method to the
1113	% actual occlusion in the scene. Another possibility for improvement is
1114	% better tuning for a particular graphics hardware by means of more
1115	% accurate rendering cost estimation. Skipping occlusion tests for
1116	% simpler geometry can be faster than issuing comparably expensive
1117	% occlusion queries.
1118
1119
1120	\begin{figure*}[htb]
1121	\centerline{
1122	\hfill
1123	\includegraphics[height=0.3\textwidth,draft=\DRAFTFIGS]{images/teapots}
1124	\hfill
1125	\includegraphics[height=0.3\textwidth,draft=\DRAFTFIGS]{images/city}
1126	\hfill
1127	}
1128	\centerline{
1129	\includegraphics[width=0.35\textwidth,draft=\DRAFTFIGS]{images/pplant}
1130	}
1131
1132	\caption{The test scenes: the teapots, the city, and the power plant.}
1133	\label{fig:scenes}
1134	\end{figure*}
1135
1136	%% \begin{figure}[htb]
1137	%% \centerline{
1138	%% \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/city}
1139	%% }
1140	%% \caption{The the city scene.}
1141	%% \label{fig:images1}
1142	%% \end{figure}
1143
1144	%% \begin{figure}[htb]
1145	%% \centerline{
1146	%% \includegraphics[width=0.25\textwidth,draft=\DRAFTFIGS]{images/teapots}
1147	%% }
1148	%% \caption{The the teapot scene.}
1149	%% \label{fig:images2}
1150	%% \end{figure}
1151
1152	%% \begin{figure}[htb]
1153	%% \centerline{
1154	%% \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/pplant}
1155	%% }
1156	%% \caption{The the power plant scene.}
1157	%% \label{fig:images3}
1158	%% \end{figure}
1159
1160

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: GTP-Internal/trunk/Lib/Vis/docs/SciReport/SciReport/online.tex @ 277

Download in other formats: