source: trunk/VUT/doc/SciReport/online.tex @ 277

Revision 277, 54.3 KB checked in by bittner, 19 years ago (diff)

changes in the structure: renamed tools to algorithms

RevLine 
[277]1\chapter{Online Visibility Culling}
[243]2%*****************************************************************************
[277]3\label{chap:online}
[243]4
5\section{Introduction}
6
7 Visibility culling is one of the major acceleration techniques for
8the real-time rendering of complex scenes. The ultimate goal of
9visibility culling techniques is to prevent invisible objects from
10being sent to the rendering pipeline. A standard visibility-culling
11technique is {\em view-frustum culling}, which eliminates objects
12outside of the current view frustum. View-frustum culling is a fast
13and simple technique, but it does not eliminate objects in the view
14frustum that are occluded by other objects. This can lead to significant
15{\em overdraw}, i.e., the same image area gets covered more than
16once. The overdraw causes a waste of computational effort both in the
17pixel and the vertex processing stages of modern graphic hardware.
18%that is, we render each pixel in the image several times. For many complex
19%scenes with high occlusion, the effort wasted on overdraw cannot simply be
20%eliminated by the increasing the raw computational power of the hardware.
21%This problem becomes even more apparent on recent graphics hardware with
22%programmable vertex and pixel shaders: all complex per-vertex or per-pixel
23%computations on object get completely wasted if the object is occluded.
24The elimination of occluded objects is addressed by {\em occlusion
25culling}. In an optimized rendering pipeline, occlusion culling complements
26other rendering acceleration techniques such as levels of detail or
27impostors.
28
29 Occlusion culling can either be applied offline or online.  When
30applied offline as a preprocess, we compute a potentially visible set
31(PVS) for cells of a fixed subdivision of the scene. At runtime, we
32can quickly identify a PVS for the given viewpoint. However, this
33approach suffers from four major problems: (1) the PVS is  valid only
34for the original static scene configuration, (2) for a given
35viewpoint, the corresponding cell-based PVS can be overly
36conservative, (3) computing all PVSs is computationally expensive, and (4)
37an accurate PVS computation is difficult to implement for general
38scenes. Online occlusion culling can solve these problems at the cost
39of applying extra computations at each frame. To make these additional
40computations efficient, most online occlusion culling methods rely on a
41number of assumptions about the scene structure and its occlusion
42characteristics (e.g. presence of large occluders, occluder
43connectivity, occlusion by few closest depth layers).
44
[252]45\begin{figure}%[htb]
46  %\centerline
[266]47  \centering \footnotesize
48  \begin{tabular}{c}
49  \includegraphics[width=0.6\textwidth, draft=\DRAFTFIGS]{images/ogre_terrain} \\
50  %\hline
51  \includegraphics[width=0.3\textwidth, draft=\DRAFTFIGS]{images/vis_viewfrustum} \hfill \includegraphics[width=0.3\textwidth, draft=\DRAFTFIGS]{images/vis_chc} \\
52  \end{tabular}
[253]53%\label{tab:online_culling_example}
54  \caption{(top) The rendered terrain scene. (bottom) Visualizion of the rendered / culled objects.
[252]55    Using view frustum culling (left image) vs. occlusion queries (right image).
56    The yellow boxes show the actually rendered scene objects. The
[266]57    red boxes depict the view frustum culled hierarchy nodes, the blue boxes depict the
58    occlusion query culled hierarchy nodes.}
[252]59  \label{fig:online_culling_example}
60\end{figure}
61
[243]62Recent graphics hardware natively supports an \emph{occlusion query}
[266]63to detect the visibility of an object against the current contents of
64the z-buffer.  Although the query itself is processed quickly using
65the raw power of the graphics processing unit (GPU), its result is not
[243]66available immediately due to the delay between issuing the query and
67its actual processing in the graphics pipeline. As a result, a naive
68application of occlusion queries can even decrease the overall
69application performance due the associated CPU stalls and GPU
[251]70starvation. In this chapter, we present an algorithm that aims to
[243]71overcome these problems by reducing the number of issued queries and
72eliminating the CPU stalls and GPU starvation. To schedule the
73queries, the algorithm makes use of both the spatial and the temporal
74coherence of visibility. A major strength of our technique is its
75simplicity and versatility: the method can be easily integrated in
76existing real-time rendering packages on architectures supporting the
[277]77underlying occlusion query~\cite{wimmer05:gpugems}.  In
[266]78figure~\ref{fig:online_culling_example}, the same scene (top row) is
79rendered using view frustum culling (visualization in the bottom left
[277]80image) versus online culling using occlusion queries (visualization in
81the bottom right image). It can be seen that with view frustum culling
82only many objects are still rendered.
[243]83%Using spatial and assuming temporal coherence
84
85\section{Related Work}
86
87With the demand for rendering scenes of ever increasing size, there
88have been a number of visibility culling methods developed in the
89last decade.  A comprehensive survey of visibility culling methods
90was presented by Cohen-Or et al.~\cite{Cohen:2002:survey}. Another
91recent survey of Bittner and Wonka~\cite{bittner03:jep} discusses
92visibility culling in a broader context of other visibility problems.
93
94According to the domain of visibility computation, we distinguish
95between {\em from-point} and {\em from-region} visibility algorithms.
96From-region algorithms compute a PVS and are applied offline in a
97preprocessing phase~\cite{Airey90,Teller91a,Leyvand:2003:RSF}.
98From-point algorithms are applied online for each particular
99viewpoint~\cite{Greene93a,Hudson97,EVL-1997-163,bittner98b,Wonka:1999:OSF,Klosowski:2001:ECV}.
100In our further discussion we focus on online occlusion culling methods
101that exploit graphics hardware.
102
103A conceptually important online occlusion culling method is the
104hierarchical z-buffer introduced by Greene et al.~\cite{Greene93a}. It
105organizes the z-buffer as a pyramid, where the standard z-buffer is
106the finest level. At all other levels, each z-value is the farthest in
107the window corresponding to the adjacent finer level. The hierarchical
108z-buffer allows to quickly determine if the geometry in question is
109occluded. To a certain extent this idea is used in the current
[266]110generation of graphics hardware by applying early z-tests of fragments
111in the graphics pipeline (e.g., Hyper-Z technology of ATI or Z-cull of
112NVIDIA). However, the geometry still needs to be sent to the GPU,
113transformed, and coarsely rasterized even if it is later determined
114invisible.
[243]115
116 Zhang~\cite{EVL-1997-163} proposed hierarchical occlusion maps, which
117do not rely on the hardware support for the z-pyramid, but instead
118make use of hardware texturing. The hierarchical occlusion map is
119computed on the GPU by rasterizing and down sampling a given set of
120occluders. The occlusion map is used for overlap tests whereas the
121depths are compared using a coarse depth estimation buffer. Wonka and
122Schmalstieg~\cite{Wonka:1999:OSF} use occluder shadows to compute
123from-point visibility in \m25d scenes with the help of the GPU. This
124method has been further extended to online computation of from-region
125visibility executed on a server~\cite{wonka:01:eg}.
126
127%% In
128%% parallel to rendering, the visibility server calculates visibility for
129%% the neighborhood of the given viewpoint and sends them back to the
130%% display host.
131
132%% A similar method designed by Aila and
133%% Miettinen~\cite{aila:00:msc}. The incremental occlusion maps are
134%% created on the CPU, which takes the CPU computational time but it does
135%% not suffer the problem of slow read-back from the GPU.
136
137%Klosowski:2001:ECV
138
139Bartz et al.~\cite{Bartz98} proposed an OpenGL extension for occlusion
140queries along with a discussion concerning a potential realization in
141hardware. A first hardware implementation of occlusion queries came
142with the VISUALIZE fx graphics hardware~\cite{Scott:1998:OVF}. The
143corresponding OpenGL extension is called \hpot{}. A more recent OpenGL
144extension, \nvoq{}, was introduced by NVIDIA with the GeForce 3 graphics
145card and it is now also available as an official ARB extension.
146
147%Both
148%extensions have been used in several several occlusion methods
149%algorithms~\cite{Meissner01,Hillesland02,Staneker:2004:OCO}.
150%\cite{Meissner01}
151
152 Hillesland et al.~\cite{Hillesland02} have proposed an algorithm
153which employs the \nvoq. They subdivide the scene using a uniform
154grid. Then the cubes are traversed in slabs roughly perpendicular to
155the viewport. The queries are issued for all cubes of a slab at once,
156after the visible geometry of this slab has been rendered. The method
157can also use nested grids: a cell of the grid contains another grid
158that is traversed if the cell is proven visible. This method however
159does not exploit temporal and spatial coherence of visibility and it
160is restricted to regular subdivision data structures. Our new method
161addresses both these problems and provides natural extensions to
162balance the accuracy of visibility classification and the associated
163computational costs.
164
165 Recently, Staneker et al.~\cite{Staneker:2004:OCO} developed a method
166integrating occlusion culling into the OpenSG scene graph
167framework. Their technique uses occupancy maps maintained in software
168to avoid queries on visible scene graph nodes, and temporal coherence
169to reduce the number of occlusion queries. The drawback of the method
170is that it performs the queries in a serial fashion and thus it
171suffers from the CPU stalls and GPU starvation.
172
[251]173%% The methods proposed in this chapter can be used to make use of temporal
[243]174%% and spatial coherence in the scope of existing visibility algorithms,
175%% that utilise a spatial hierarchy. Examples of these are algorithms
176%% based on hierarchical occlusion maps~\cite{Zhang97}, coverage
177%% masks~\cite{Greene:1996:HPT}, shadow frusta~\cite{Hudson97}, and
178%% occlusion trees~\cite{bittner98b_long}.
179
[277]180On a theoretical level, our method is related to methods aiming to
[243]181exploit the temporal coherence of visibility. Greene et
182al.~\cite{Greene93a} used the set of visible objects from one frame to
183initialize the z-pyramid in the next frame in order to reduce the
184overdraw of the hierarchical z-buffer. The algorithm of Coorg and
185Teller~\cite{Coorg96b} restricts the hierarchical traversal to nodes
186associated with visual events that were crossed between successive
187viewpoint positions. Another method of Coorg and Teller~\cite{Coorg97}
188exploits temporal coherence by caching occlusion relationships.
189Chrysanthou and Slater have proposed a probabilistic scheme for
190view-frustum culling~\cite{Chrysanthou:97}.
191
192The above mentioned methods for exploiting temporal coherence are
193tightly interwoven with the particular culling algorithm. On the
194contrary, Bittner et al.~\cite{bittner01:jvca} presented a general
195acceleration technique for exploiting spatial and temporal coherence
196in hierarchical visibility algorithms. The central idea, which is also
[277]197vital for the online occlusion culling, is to avoid repeated
198visibility tests of interior nodes of the hierarchy. The problem of
199direct adoption of this method is that it is designed for the use with
200instantaneous CPU based occlusion queries, whereas hardware occlusion
201queries exhibit significant latency. The method presented herein
202efficiently overcomes the problem of latency while keeping the
203benefits of a generality and simplicity of the original hierarchical
204technique. As a result we obtain a simple and efficient occlusion
205culling algorithm utilizing hardware occlusion queries.
[243]206
207
208%In this paper we show how to overcome this problem while
209%keeping the benefits of exploiting.
210
211
[277]212% The rest of the chapter is organized as follows:
213% Section~\ref{sec:hwqueries} discusses hardware supported occlusion
214% queries and a basic application of these queries using a kD-tree.
215% Section~\ref{sec:hoq} presents our new algorithm and
216% Section~\ref{sec:optimizations} describes several additional
217% optimizations. Section~\ref{sec:results} presents results obtained by
218% experimental evaluation of the method and discusses its
219% behavior. Finally Section~\ref{sec:conclusion} concludes the chapter.
[243]220
221
222
223
224\section{Hardware Occlusion Queries}
225\label{sec:hwqueries}
226
227
228Hardware occlusion queries follow a simple pattern: To test visibility
229of an occludee, we send its bounding volume to the GPU. The volume is
230rasterized and its fragments are compared to the current contents of
231the z-buffer. The GPU then returns the number of visible
232fragments. If there is no visible fragment, the occludee is invisible
233and it need not be rendered.
234
235\subsection{Advantages of hardware occlusion queries}
236
237There are several advantages of hardware occlusion queries:
238
239\begin{itemize}
240
241\item {\em Generality of occluders.} We can use the original scene geometry
242as occluders, since the queries use the current contents of the z-buffer.
243
244
245\item {\em Occluder fusion.} The occluders are merged in the z-buffer,
246so the queries automatically account for occluder fusion. Additionally
247this fusion comes for free since we use the intermediate result of the
248rendering itself.
249
250\item {\em Generality of occludees.} We can use complex
251occludees. Anything that can be rasterized quickly is suitable.
252
253\item {\em Exploiting the GPU power.} The queries take full advantage of
254the high fill rates and internal parallelism provided by modern GPUs.
255
256\item {\em Simple use.} Hardware occlusion queries can be easily
257integrated into a rendering algorithm. They provide a powerful tool to
258minimize the implementation effort, especially when compared to
259CPU-based occlusion culling.
260
261
262\end{itemize}
263
264
265
266\subsection{Problems of hardware occlusion queries}
267
268 Currently there are two main hardware supported variants of occlusion
269queries: the HP test (\hpot) and the more recent NV query (\nvoq, now
270also available as \arboq). The most important difference between the
271HP test and the NV query is that multiple NV queries can be
272issued before asking for their results, while only one HP test is
273allowed at a time, which severely limits its possible algorithmic
274usage. Additionally the NV query returns the number of visible pixels
275whereas the HP test returns only a binary visibility
276classification.
277
278%% Issuing multiple independent NV queries is crucial for
279%% the design of algorithms that strive to exploit power of the GPU.
280
281 The main problem of both the HP test and the NV query is the latency
282between issuing the query and the availability of the result. The
283latency occurs due to the delayed processing of the query in a long
284graphics pipeline, the cost of processing the query itself, and the
285cost of transferring the result back to the CPU. The latency causes
286two major problems: CPU stalls and GPU starvation. After issuing the
287query, the CPU waits for its result and does not feed the GPU with new
288data. When the result finally becomes available, the GPU pipeline can
289already be empty. Thus the GPU needs to wait for the CPU to process
290the result of the query and to feed the GPU with new data.
291
292A major challenge when using hardware occlusion queries is to avoid
293the CPU stalls by filling the latency time with other tasks, such as
294rendering visible scene objects or issuing other, independent
295occlusion queries (see Figure~\ref{fig:latency})
296
297
298\begin{figure}[htb]
299  \centerline{
300    \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{figs/latency2}
301    }
302
303  \caption{(top) Illustration of CPU stalls and GPU starvation.
304    Qn, Rn, and Cn denote querying, rendering, and culling of object
305    n, respectively. Note that object 5 is found invisible by Q5 and
306    thus not rendered. (bottom) More efficient query scheduling. The
307    scheduling assumes that objects 4 and 6 will be visible in the
308    current frame and renders them without waiting for the result of
309    the corresponding queries. }
310  \label{fig:latency}
311\end{figure}
312
313
314
315
316
317\subsection{Hierarchical stop-and-wait method}
318\label{sec:basic}
319
320Many rendering algorithms rely on hierarchical structures in order to
321deal with complex scenes.  In the context of occlusion culling, such a
322data structure allows to efficiently cull large scene blocks, and thus
323to exploit spatial coherence of visibility and provide a key to
324achieving output sensitivity.
325
326This section outlines a naive application of occlusion queries in the
327scope of a hierarchical algorithm. We refer to this approach as the
328{\em hierarchical stop-and-wait} method. Our discussion is based on
329kD-trees, which proved to be efficient for point location, ray
330tracing, and visibility
331culling~\cite{MacDonald90,Hudson97,Coorg97,bittner01:jvca}. The
332concept applies to general hierarchical data structures as well, though.
333
334The hierarchical stop-and-wait method proceeds as follows: Once a
335kD-tree node passes view-frustum culling, it is tested for
336occlusion by issuing the occlusion query and waiting for its
337result. If the node is found visible, we continue by recursively
338testing its children in a front-to-back order. If the node is a
339leaf, we render its associated objects.
340
341 The problem with this approach is that we can continue the tree
342traversal only when the result of the last occlusion query becomes
343available. If the result is not available, we have to stall the CPU,
344which causes significant performance penalties. As we document in
345Section~\ref{sec:results}, these penalties together with the overhead
346of the queries themselves can even decrease the overall application
347performance compared to pure view-frustum culling. Our new method aims
348to eliminate this problem by issuing multiple occlusion queries for
349independent scene parts and exploiting temporal coherence of
350visibility classifications.
351
352
353\section{Coherent Hierarchical Culling}
354
355\label{sec:hoq}
356
357 In this section we first present an overview of our new
358algorithm. Then we discuss its steps in more detail.
359
360%and present a
361%generalization of the method to other spatial data structures.
362
363\subsection{Algorithm Overview}
364
365\label{sec:overview}
366
367
368Our method is based on exploiting temporal coherence of visibility
369classification. In particular, it is centered on the following
370three ideas:
371
372\begin{itemize}
373 \item We initiate occlusion queries on nodes of the hierarchy where
374   the traversal terminated in the last frame. Thus we avoid queries
375   on all previously visible interior nodes~\cite{bittner01:jvca}.
376 \item We assume that a previously visible leaf node remains visible
377   and render the associated geometry without waiting for the result
378   of the corresponding occlusion query.
379\item Issued occlusion queries are stored in a query queue until they are known to be carried out by the GPU. This allows
380  interleaving the queries with the rendering of visible geometry.
381\end{itemize}
382
383
384The algorithm performs a traversal of the hierarchy that is
385terminated either at leaf nodes or nodes that are classified as
386invisible. Let us call such nodes the {\em termination nodes}, and
387interior nodes that have been classified visible the {\em opened
388nodes}. We denote sets of termination and opened nodes in the $i$-th
389frame $\mathcal{T}_i$ and $\mathcal{ O}_i$, respectively. In the
390$i$-th frame, we traverse the kD-tree in a front-to-back order, skip
391all nodes of $\mathcal{ O}_{i-1}$ and apply occlusion queries first on
392the termination nodes $\mathcal{ T}_{i-1}$. When reaching a
393termination node, the algorithm proceeds as follows:
394
395\begin{itemize}
396\item For a previously visible node (this must be a leaf), we issue
397the occlusion query and store it in the query queue. Then we
398immediately render the associated geometry without waiting for the
399result of the query.
400 \item For a previously invisible node, we issue the query and store
401 it in the query queue.
402\end{itemize}
403
404\begin{figure}[htb]
405{\footnotesize
406  \input{code/pseudocode2}
407 }
408  \caption{Pseudo-code of coherent hierarchical culling.}
409  \label{fig:pseudocode}
410\end{figure}
411
412When the query queue is not empty, we check if the result of the
413oldest query in the queue is already available. If the query result is
414not available, we continue by recursively processing other nodes of
415the kD-tree as described above. If the query result is available, we
416fetch the result and remove the node from the query queue. If the node
417is visible, we process its children recursively. Otherwise, the whole
418subtree of the node is invisible and thus it is culled.
419
420
421In order to propagate changes in visibility upwards in the hierarchy,
422the visibility classification is \emph{pulled up} according to the
423following rule: An interior node is invisible only if all its children
424have been classified invisible. Otherwise, it remains visible and thus
425opened. The pseudo-code of the complete algorithm is given in
426Figure~\ref{fig:pseudocode}. An example of the behavior of the method
427on a small kD-tree for two subsequent frames is depicted
428Figure~\ref{fig:cut}.
429
430\begin{figure*}[htb]
431  \centerline{
432    \includegraphics[width=0.73\textwidth,draft=\DRAFTFIGS]{figs/cut} }
433  \caption{(left) Visibility classification of a node of the kD-tree and
434    the termination nodes. (right) Visibility classification after the
435    application of the occlusion test and the new set of termination nodes.
436    Nodes on which occlusion queries were applied are depicted with a
437    solid outline. Note the pull-up and pull-down due to visibility changes.}
438  \label{fig:cut}
439\end{figure*}
440
441
442
443The sets of opened nodes and termination nodes need not be maintained
444explicitly. Instead, these sets can be easily identified by
445associating with each node an information about its visibility and an
446id of the last frame when it was visited. The node is an opened node
447if it is an interior visible node that was visited in the last frame
448(line 23 in the pseudocode). Note that in the actual implementation of
449the pull up we can set all visited nodes to invisible by default and
450then pull up any changes from invisible to visible (lines 25 and line
45112 in Figure~\ref{fig:pseudocode}). This modification eliminates
452checking children for invisibility during the pull up.
453
454
455
456%% \begin{figure*}[htb]
457%% \centerline{\includegraphics[width=0.8\textwidth,draft=\DRAFTFIGS]{figs/basic}}
458%% \caption{Illustration of the hierarchy traversal. Initially the algorithm
459%% starts at the root of the hierarchy (left). In the second frame the
460%% opened nodes $\mathcal{ O}_0$ are skipped and the occlusion queries
461%% are first applied on the termination nodes $\mathcal{T}_0$. Visibility changes are propagated upwarsds in the hierarchy and a
462%% new set of termination nodes $\mathcal{ T}_1$ is established.}
463%% \label{fig:basic}
464%% \end{figure*}
465
466
467\subsection{Reduction of the number of queries}
468
469%Skipping the opened nodes and issuing occlusion queries for the
470%termination nodes assists in exploiting two main characteristics of
471%scene visibility:
472
473%Identifying the set of termination nodes assists in finding a subdivision
474
475Our method reduces the number of visibility queries in two ways:
476Firstly, as other hierarchical culling methods we consider only a
477subtree of the whole hierarchy (opened nodes + termination
478nodes). Secondly, by avoiding queries on opened nodes we eliminate
479part of the overhead of identification of this subtree. These
480reductions reflect the following coherence properties of scene
481visibility:
482
483\begin{itemize}
484
485
486\item {\em Spatial coherence.} The invisible termination nodes
487approximate the occluded part of the scene with the smallest number of
488nodes with respect to the given hierarchy, i.e., each invisible
489termination node has a visible parent. This induces an adaptive
490spatial subdivision that reflects spatial coherence of visibility,
491more precisely the coherence of occluded regions. The adaptive nature
492of the subdivision allows to minimize the number of subsequent
493occlusion queries by applying the queries on the largest spatial
494regions that are expected to remain occluded.
495
496%, and visible termination
497%nodes correspond to the smallest unoccluded regions (visible
498%leafs)
499
500\item {\em Temporal coherence.} If visibility remains constant the set
501of termination nodes needs no adaptation. If an occluded node becomes
502visible we recursively process its children (pull-down). If a visible
503node becomes occluded we propagate the change higher in the hierarchy
504(pull-up). A pull-down reflects a spatial growing of visible
505regions. Similarly, a pull-up reflects a spatial growing of occluded
506regions.
507
508
509%The first case corresponds to pull-down of the cut, the latter to pull-up.
510
511\end{itemize}
512
513By avoiding queries on the opened nodes, we can save $1/k$ of the queries
514for a hierarchy with branching factor $k$ (assuming visibility remains
515constant). Thus for the kD-tree, up to half of the queries can be
516saved. The actual savings in the total query time are even larger: the
517higher we are at the hierarchy, the larger boxes we would have to
518check for occlusion. Consequently, the higher is the fill rate that
519would have been required to rasterize the boxes. In particular,
520assuming that the sum of the screen space projected area for nodes at
521each level of the kD-tree is equal and the opened nodes form a
522complete binary subtree of depth $d$, the fill rate is reduced $(d+2)$
523times.
524
525
526
527\subsection{Reduction of CPU stalls and GPU starvation}
528
529\label{sec:latency}
530
531The reduction of CPU stalls and GPU starvation is achieved by
532interleaving occlusion queries with the rendering of visible geometry. The
533immediate rendering of previously visible termination nodes and the
534subsequent issuing of occlusion queries eliminates the requirement of
535waiting for the query result during the processing of the initial
536depth layers containing previously visible nodes. In an optimal case,
537new query results become available in between and thus we completely
538eliminate CPU stalls. In a static scenario, we achieve exactly the
539same visibility classification as the hierarchical stop-and-wait
540method.
541
542If the visibility is changing, the situation can be different: if the
543results of the queries arrive too late, it is possible that we
544initiated an occlusion query on a previously occluded node $A$ that is
545in fact occluded by another previously occluded node $B$ that became
546visible. If $B$ is still in the query queue, we do not capture a
547possible occlusion of $A$ by $B$ since the geometry associated with
548$B$ has not yet been rendered. In Section~\ref{sec:results} we show
549that the increase of the number of rendered objects compared to the
550stop-and-wait method is usually very small.
551
552%% It is possible that we have already traversed all previously
553%% visible nodes and we must stall the CPU by waiting for the result
554%% of the oldest occlusion query. A technique completely eliminating
555%% CPU stalls at the cost of an increase of the number of rendered
556%% objects will be discussed in Section~\ref{sec:full}.
557
558\subsection{Front-to-back scene traversal}
559
560For kD-trees the front-to-back scene traversal can be easily
561implemented using a depth first
562traversal~\cite{bittner01:jvca}. However, at a modest increase in
563computational cost we can also use a more general breadth-first
564traversal based on a priority queue. The priority of the node then
565corresponds to an inverse of the minimal distance of the viewpoint and
566the bounding box associated with the given node of the
567kD-tree~\cite{Klosowski:2001:ECV,Staneker:2004:OCO}.
568
569In the context of our culling algorithm, there are two main advantages
570of the breadth-first front-to-back traversal :
571
572\begin{itemize}
573\item {\em Better query scheduling.} By spreading the traversal of the
574scene in a breadth-first manner, we process the scene in depth
575layers. Within each layer, the node processing order is practically
576independent, which minimizes the problem of occlusion query
577dependence. The breadth-first traversal interleaves
578occlusion-independent nodes, which can provide a more accurate
579visibility classification if visibility changes quickly. In
580particular, it reduces the problem of false classifications due to
581missed occlusion by nodes waiting in the query queue (discussed in
582Section~\ref{sec:latency}).
583
584\item {\em Using other spatial data structures.} By using a
585breadth-first traversal, we are no longer restricted to the
586kD-tree. Instead we can use an arbitrary spatial data structure
587such as a bounding volume hierarchy, octree, grid, hierarchical grid,
588etc. Once we compute a distance from a node to the viewpoint, the
589node processing order is established by the priority queue.
590\end{itemize}
591
592 When using the priority queue, our culling algorithm can also be
593applied directly to the scene graph hierarchy, thus avoiding the
594construction of any auxiliary data structure for spatial
595partitioning. This is especially important for dynamic scenes, in
596which maintenance of a spatial classification of moving objects can be
597costly.
598
599\subsection{Checking the query result}
600
601  The presented algorithm repeatedly checks if the result of the
602 occlusion query is available before fetching any node from the
603 traversal stack (line 6 in Figure~\ref{fig:pseudocode}). Our
604 practical experiments have proven that the cost of this check is
605 negligible and thus it can used frequently without any performance
606 penalty. If the cost of this check were significantly higher, we could
607 delay asking for the query result by a time established by empirical
608 measurements for the particular hardware. This delay should also
609 reflect the size of the queried node to match the expected
610 availability of the query result as precise as possible.
611
612
613\section{Further Optimizations}
614
615\label{sec:optimizations}
616
617This section discusses a couple of optimizations of our method that
618can further improve the overall rendering performance. In contrast to
619the basic algorithm from the previous section, these optimizations
620rely on some user specified parameters that should be tuned for a
621particular scene and hardware configuration.
622
623
624\subsection{Conservative visibility testing}
625
626The first optimization addresses the reduction of the number of
627visibility tests at the cost of a possible increase in the number of
628rendered objects. This optimization is based on the idea of skipping
629some occlusion tests of visible nodes. We assume that whenever a node
630becomes visible, it remains visible for a number of frames. Within the
631given number of frames we avoid issuing occlusion queries and simply
632assume the node remains visible~\cite{bittner01:jvca}.
633
634This technique can significantly reduce the number of visibility tests
635applied on visible nodes of the hierarchy. Especially in the case of
636sparsely occluded scenes, there is a large number of visible nodes
637being tested, which does not provide any benefit since most of them
638remain visible. On the other hand, we do not immediately capture all
639changes from visibility to invisibility, and thus we may render
640objects that have already become invisible from the moment when the
641last occlusion test was issued.
642
643 In the simplest case, the number of frames a node is assumed visible
644can be a predefined constant. In a more complicated scenario this
645number should be influenced by the history of the success of occlusion
646queries and/or the current speed of camera movement.
647
648
[251]649\subsection{Approximate visibility}
[243]650
651 The algorithm as presented computes a conservative visibility
652classification with respect to the resolution of the z-buffer. We
653can easily modify the algorithm to cull nodes more aggressively in
654cases when a small part of the node is visible. We compare the
655number of visible pixels returned by the occlusion query with a
656user specified constant and cull the node if this number drops
657below this constant.
658
659
660\subsection{Complete elimination of CPU stalls}
661
662\label{sec:full}
663
664 The basic algorithm eliminates CPU stalls unless the traversal stack
665is empty. If there is no node to traverse in the traversal stack and
666the result of the oldest query in the query queue is still not
667available, it stalls the CPU by waiting for the query result. To
668completely eliminate the CPU stalls, we can speculatively render some
669nodes with undecided visibility. In particular, we select a node from
670the query queue and render the geometry associated with the node (or
671the whole subtree if it is an interior node). The node is marked as
672rendered but the associated occlusion query is kept in the queue to
673fetch its result later. If we are unlucky and the node remains
674invisible, the effort of rendering the node's geometry is wasted. On
675the other hand, if the node has become visible, we have used the time
676slot before the next query arrives in an optimal manner.
677
678To avoid the problem of spending more time on rendering invisible
679nodes than would be spent by waiting for the result of the query, we
680select a node with the lowest estimated rendering cost and compare
681this cost with a user specified constant. If the cost is larger than
682the constant we conclude that it is too risky to render the node and
683wait till the result of the query becomes available.
684
[252]685\begin{figure}%[htb]
686{\footnotesize
687  \input{code/pseudocode_batching}
688 }
689  \caption{Pseudo-code of coherent hierarchical culling using multiple queries.}
690  \label{fig:pseudocode_batching}
691\end{figure}
692
[251]693\subsection{Initial depth pass}
694\label{sec:initial}
695
[253]696\begin{figure}
697\centering
698\includegraphics[width=0.49\textwidth,draft=\DRAFTIMAGES]{images/transp_wrong}
699\includegraphics[width=0.49\textwidth,draft=\DRAFTIMAGES]{images/transp_delayed}
700\caption{(left) all passes are rendered with CHC. Note that the soldiers are
701visible through the tree. (right) Only the solid passes are rendered using CHC, afterwards the transparent passes.}
702\label{fig:online_transparency}
703\end{figure}
704
705
[251]706To achieve maximal performance on modern GPU's, one has to take care of a number of issues.
707First, it is very important to reduce material switching. Thus modern rendering engines sort the
708objects (or patches) by materials in order to eliminate the material switching as good as possible.
709
[252]710Next, materials can be very costly, sometimes complicated shaders have to be evaluated several times per batch. Hence it should be avoided to render the full material for fragments which eventually turn out to be occluded. This can be achieved
[251]711by rendering an initial depth pass (i.e., enabling only depth write to fill the depth
712buffer). Afterwards the geometry is rendered again, this time with full shading. Because the
713depth buffer is already established, invisible fragments will be discarded before any shading is done calculated.
714
715This approach can be naturally adapted for use with the CHC algorithm. Only an initial depth pass is rendered
716in front-to-back order using the CHC algorithm. The initial pass is sufficient to fill the depth buffer and determine the visible geometry. Then only the visible geometry is rendered again, exploiting the full optimization and material sorting capability of the rendering engine.
717
[252]718If the materials requires several rendering passes, we can use a variant of the depth pass method. We render
719only the first passes using the algorithm (e.g., the solid passes), determining the visibility of the patches, and
720render all the other passes afterwards. This approach can be used when there are passes which require a special
[253]721kind of sorting to be rendered correctly (e.g., transparent passes, shadow passes). In figure~\ref{fig:online_transparency},
722we can see that artifacts occur in the left image if the transparent passes are not rendered in the correct order after
723applying the hierarchical algorithm (right image). In a similar fashion, we are able to handle shadows~\ref{fig:chc_shadows}.
[252]724
[253]725\begin{figure}
726\centering
[273]727\includegraphics[width=0.4\textwidth,draft=\DRAFTIMAGES]{images/chc_shadows}
[253]728\caption{We can correctly handle shadow volumes together with CHC.}
729\label{fig:chc_shadows}
730\end{figure}
[252]731
[251]732\subsection{Batching multiple queries}
[252]733\label{sec:batching}
734
[251]735When occlusion queries are rendered interleaved with geometry, there is always a state change involved. To
736reduce state changes, it is beneficial not to execute one query at a time, but multiple queries at once~\cite{Staneker:2004:EMO}.
737Instead of immediately executing a query for a node when we fetch it from the traversal stack, we add it to the {\em pending queue}. If {\em n} of these queries are accumulated in the queue, we can execute them at once.
[252]738To optain an optimal value for {\em n}, several some heuristics can be applied, e.g., a fraction of the number of queries issued in the last frame. The pseudo-code of the algorithm including the batching is given in Figure~\ref{fig:pseudocode_batching}.
[251]739
740
[243]741\section{Results}
742\label{sec:results}
743
744We have incorporated our method into an OpenGL-based scene graph
745library and tested it on three scenes of different types. All tests
746were conducted on a PC with a 3.2GHz P4, 1GB of memory, and a
747GeForce FX5950 graphics card.
748
749\subsection{Test scenes}
750
751The three test scenes comprise a synthetic arrangement of 5000
752randomly positioned teapots (11.6M polygons); an urban environment (1M
753polygons); and the UNC power plant model (13M polygons). The test
754scenes are depicted in Figure~\ref{fig:scenes}. All scenes were
755partitioned using a kD-tree constructed according to the surface-area
756heuristics~\cite{MacDonald90}.
757
758
759Although the teapot scene would intuitively offer good occlusion, it
760is a complicated case to handle for occlusion culling. Firstly, the
761teapots consist of small triangles and so only the effect of fused
762occlusion due to a large number of visible triangles can bring a
763culling benefit. Secondly, there are many thin holes through which it
764is possible to see quite far into the arrangement of teapots. Thirdly,
765the arrangement is long and thin and so we can see almost
766half of the teapots along the longer side of the arrangement.
767
768%Although the teapot scene would intuitively seem to offer good
769%occlusion, it is one of the more complicated cases for occlusion
770%culling. Due to the high number of very dense objects, a very fine
771%kD-tree subdivision was necessary, which in turn leads to higher costs
772%for testing many nodes in the hierarchy. Additionally, misclassifying
773%a single teapot can lead to a significant increase in frame time due
774%to the high polygon density (2,304 polygons per teapot). There are also
775%many locations where it is possible to see very far into the block of
776%teapots. This can be seen in the kD-tree node classification which is
777%shown in the accompanying video.
778
[266]779The complete power plant model is quite challenging even to load into
780memory, but on the other hand it offers good occlusion. This scene is
781an interesting candidate for testing not only due to its size, but
782also due to significant changes in visibility and depth complexity in
[243]783its different parts.
784
785%as can also be seen in the
786%walkthrough. A relatively coarse subdivision of the model triangles
787%into about 4,600 nodes was sufficient to capture most occlusion
788%interdependencies.
789
790The city scene is a classical target for occlusion culling
791algorithms. Due to the urban structure consisting of buildings and
792streets, most of the model is occluded when viewed from the
793streets. Note that the scene does not contain any detailed geometry
794inside the buildings. See Figure~\ref{fig:city_vis} for a
795visualization of the visibility classification of the kD-tree nodes
796for the city scene.
797
798%that
799%could further
800
801
802
803%% However, we still included a section in the walkthrough where the
804%% viewpoint is elevated above the building roofs in order to show that
805%% the algorithm continues to work without modification, albeit at some
806%% expense. It will be an interesting topic of research to automatically
807%% recognize such situations and decrease the frequency of visibility
808%% queries or turn the occlusion culling algorithm off completely.
809
810
811
812\begin{figure}
[273]813\centering \includegraphics[width=0.4\textwidth,draft=\DRAFTIMAGES]{images/city_vis}
[243]814\caption{Visibility classification of the kD-tree nodes in the city scene.
815  The orange nodes were found visible, all the other depicted nodes are invisible.
816  Note the increasing size of the occluded nodes with increasing distance from the visible set.}
817\label{fig:city_vis}
818\end{figure}
819
820
821\subsection{Basic tests}
822
823
824We have measured the frame times for rendering with only view-frustum culling
825(VFC), the hierarchical stop-and-wait method (S\&W), and our new
826coherent hierarchical culling method (CHC). Additionally, we have
827evaluated the time for an ``ideal'' algorithm. The ideal algorithm
828renders the visible objects found by the S\&W algorithm without
829performing any visibility tests. This is an optimal solution with
830respect to the given hierarchy, i.e., no occlusion culling algorithm
831operating on the same hierarchy can be faster. For the basic tests we
832did not apply any of the optimizations discussed in
833Section~\ref{sec:optimizations}, which require user specified
834parameters.
835
836%% In order to obtain repeatable and comparable timings, the process
837%% priority of our application was set to high, timings were obtained
838%% using the CPU cycle counter instruction, and the OpenGL {\tt Finish()}
839%% instruction was executed before each invocation of the timer at the
840%% start and the end of a frame. Furthermore, we did not include clear
841%% and buffer swapping times in our results, since they depend strongly
842%% on the windowing mode used.
843
844For each test scene, we have constructed a walkthrough which is shown
845in full in the accompanying
846video. Figures~\ref{fig:teapot_walkthrough},~\ref{fig:city_walkthrough},
847and~\ref{fig:plant_walkthrough} depict the frame times measured for
848the walkthroughs. Note that Figure~\ref{fig:plant_walkthrough} uses a
849logarithmic scale to capture the high variations in frame times during
850the power plant walkthrough.
851\begin{figure}
852\centering
853\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/teapotgraph1}
854\caption{Frame times for the teapot scene.}\label{fig:teapot_walkthrough}
855\end{figure}
856\begin{figure}
857\centering
858\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/citygraph1}
859\caption{Frame times for the city walkthrough. Note
860the spike around frame 1600, where the viewpoint was elevated above the roofs,
861practically eliminating any occlusion.}\label{fig:city_walkthrough}
862\end{figure}
863\begin{figure}
864\centering
865\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/plantgraph1log}
866\caption{Frame times for the power plant walkthrough.
867The plot shows the weakness of the S\&W method: when there is not much occlusion it becomes slower than VFC (near frame 2200).
868The CHC can keep up even in these situations and in the same time it can exploit occlusion when it appears
869(e.g. near frame 3700).}
870\label{fig:plant_walkthrough}
871\end{figure}
872To better demonstrate the behavior of our algorithm, all walkthroughs
873contain sections with both restricted and unrestricted visibility. For
874the teapots, we viewed the arrangement of teapots along the longer
875side of the arrangement (frames 25--90). In the city we elevated the
876viewpoint above the roofs and gained sight over most of the city
877(frames 1200--1800). The power plant walkthrough contains several
878viewpoints from which a large part of the model is visible (spikes in
879Figure~\ref{fig:plant_walkthrough} where all algorithms are slow),
880viewpoints along the border of the model directed outwards with low depth complexity (holes in
881Figure~\ref{fig:plant_walkthrough} where all algorithms are fast), and
882viewpoints inside the power plant with high depth complexity where occlusion culling produces a
883significant speedup over VFC (e.g. frame 3800).
884
885 As we can see for a number frames in the walkthroughs, the CHC method
886can produce a speedup of more than one order of magnitude compared to
887VFC. The maximum speedup for the teapots, the city, and the power
888plant walkthroughs is 8, 20, and 70, respectively. We can also observe
889that CHC maintains a significant gain over S\&W and in many cases it
890almost matches the performance of the ideal algorithm. In complicated
891scenarios the S\&W method caused a significant slowdown compared to
892VFC (e.g. frames 1200--1800 of
893Figure~\ref{fig:city_walkthrough}). Even in these cases, the CHC
894method maintained a good speedup over VFC except for a small number of
895frames.
896
897\begin{table*}
898\centering \footnotesize
899\begin{tabular}{|c|c|c|c|c|c|c|c|}
900 \hline\hline
901scene                  & method   & \#queries & wait time [ms] & rendered triangles & frame time [ms]  & speedup \\\hline\hline
902
903Teapots                & VFC      & ---        &  ---     & 11,139,928  & 310.42  & 1.0 \\\hhline{~|-|-|-|-|-|-|-|}
90411,520,000 triangles    & S\&W     & 4704     & 83.19  & 2,617,801  & 154.95 & 2.3 \\\hhline{~|-|-|-|-|-|-|-|}
90521,639 kD-Tree nodes   & {\bf CHC} &{\bf 2827} & {\bf 1.31}  & {\bf 2,852,514 } & {\bf 81,18 } & {\bf 4.6 } \\\hhline{~|-|-|-|-|-|-|-|}
906                       & Ideal    & ---        &  ---     & 2,617,801  & 72.19  & 5.2 \\
907                       \hline\hline
908City                   & VFC      & ---        &  ---     & 156,521    & 19.79 & 1.0   \\\hhline{~|-|-|-|-|-|-|-|}
9091,036,146 triangles    & S\&W     & 663      &  9.49  & 30,594     & 19.9  & 1.5 \\\hhline{~|-|-|-|-|-|-|-|}
91033,195 kD-Tree nodes   &{\bf CHC} & {\bf 345} & {\bf 0.18}   & {\bf 31,034}  & {\bf 8.47}  & {\bf 4.0}  \\\hhline{~|-|-|-|-|-|-|-|}
911                       & Ideal    & ---        &   ---   & 30,594     & 4.55  & 6.6  \\
912                       \hline\hline
913Power Plant            & VFC      & ---    &  ---   & 1,556,300 & 138.76 & 1.0 \\\hhline{~|-|-|-|-|-|-|-|}
91412,748,510 triangles    & S\&W    & 485   & 16.16  & 392,962  & 52.29 & 3.2 \\\hhline{~|-|-|-|-|-|-|-|}
91518,719 kD-Tree nodes    &{\bf CHC} & {\bf 263}  & {\bf 0.70}  & {\bf 397,920} & {\bf 38.73}  & {\bf 4.7} \\\hhline{~|-|-|-|-|-|-|-|}
916                       & Ideal    &    ---      &  ---   &   392,962    &  36.34  & 5.8 \\\hline\hline
917\end{tabular}
918\caption{Statistics for the three test scenes. VFC is rendering with only view-frustum culling, S\&W is the
919  hierarchical stop and wait method, CHC is our new method, and Ideal
[266]920is a perfect method with respect to the given hierarchy. All values
921are averages over all frames (including the speedup).}
[243]922\label{tab:averages}
923\end{table*}
924
925%VFC     S&W     CHC     Ideal
926%(01:50:59) salzrat: 1     3,243823523     4,744645293     5,765981148
927%(01:51:04) salzrat: for the powerplant
928% for the powerplant
929%(01:53:59) salzrat: 1     1,519898136     4,01897468     6,599413211
930%(01:54:01) salzrat: for the city
931%(01:55:15) salzrat: and
932%(01:55:15) salzrat: 1     2,321137905     4,646185621     5,199086993
933
934%salzrat: teapot: 1,977162565     1,128631431
935%(02:01:52) salzrat: city: 2,607396227     1,674188137
936%(02:02:59) salzrat: powerplant: 1,635036084     1,244446469
937
938
939 Next, we summarized the scene statistics and the average values per
940frame in Table~\ref{tab:averages}. The table shows the number of
941issued occlusion queries, the wait time representing the CPU stalls,
942the number of rendered triangles, the total frame time, and the
943speedup over VFC.
944
945
946We can see that the CHC method practically eliminates the CPU stalls
947(wait time) compared to the S\&W method. This is paid for by a slight
948increase in the number of rendered triangles. For the three
949walkthroughs, the CHC method produces average speedups of 4.6, 4.0, and
9504.7 over view frustum culling and average speedups of 2.0, 2.6, and
9511.6 over the S\&W method. CHC is only 1.1, 1.7, and 1.2 times
952slower than the ideal occlusion culling algorithm. Concerning the
953accuracy, the increase of the average number of rendered triangles for
954CHC method compared to S\&W was 9\%, 1.4\%, and 1.3\%. This increase
955was always recovered by the reduction of CPU stalls for the tested
956walkthroughs.
957
958%% While this number may be very low, the algorithm also incurs a
959%% non-negligible overhead which is caused by the graphics card
960%% reconfiguration required for the queries and rasterization of the
961%% bounding volume geometries.
962
963% triangles
964% 9%
965% 1.4%
966% 1.3%
967
968
969% speedup over SW
970% 1.9
971% 2.32
972% 1.33
973
974% compare to ideal
975% 1.13
976% 1.9
977% 1.05
978
979
980
981\subsection{Optimizations}
982
983First of all we have observed that the technique of complete
984elimination of CPU stalls discussed in Section~\ref{sec:full} has a
985very limited scope. In fact for all our tests the stalls were almost
986completely eliminated by the basic algorithm already (see wait time in
987Table~\ref{tab:averages}). We did not find constants that could
988produce additional speedup using this technique.
989
990The measurements for the other optimizations discussed in
991Section~\ref{sec:optimizations} are summarized in
992Table~\ref{tab:assumed}. We have measured the average number of issued queries
993and the average frame time in dependence on the number of frames a node is
994assumed visible and the pixel threshold of approximate visibility. We
995have observed that the effectiveness of the optimizations depends strongly on the scene. If
996the hierarchy is deep and the geometry associated with a leaf node is
997not too complex, the conservative visibility testing produces a
998significant speedup (city and power plant). For the teapot scene the
999penalty for false rendering of actually occluded objects became larger
1000than savings achieved by the reduction of the number of queries. On the other
1001hand since the teapot scene contains complex visible geometry the
1002approximate visibility optimization produced a significant
1003speedup. This is however paid for by introducing errors in the image
1004proportional to the pixel threshold used.
1005
1006%Good parameters for these
1007%optimizations have to be estimated on a scene-by-scene basis.
1008
1009
1010
1011
1012%% , where we compare the reference frame time and
1013%% number of queries already shown in the previous table to the
1014%% approximate visibility optimization using a higher pixel threshold,
1015%% and to the conservative visibility testing optimization assuming a
1016%% node remains visible for 2 frames, both for the teapot and the city
1017%% scene. These tests clearly show that a high visibility threshold
1018%% improves frame times, albeit at the expense of image quality. The
1019%% benefit of the conservative visibility optimization, is however
1020%% limited. While the number of queries can be reduced significantly, the
1021%% number of misclassified objects rises and increases the frame time
1022%% again.
1023
1024
1025
1026%% \begin{table}
1027%% \centering \footnotesize
1028%% \begin{tabular}{|l|c|c|}
1029%%  \hline\hline
1030%%                 & Teapots & City \\\hline\hline
1031%%  frame time CHC & 130.42  & 13.91\\\hline
1032%%  \#queries      & 2267    & 94.04\\\hline\hline
1033%%  \multicolumn{3}{|l|}{assume 2 frames visible}\\\hline
1034%%  frame time      & 123.84 & 13.27 \\\hline
1035%%  \#queries      & 1342    &  42 \\\hline\hline
1036%%  \multicolumn{3}{|l|}{pixel threshold 50 pixels and assume 2 frames visible}\\\hline
1037%%  frame time     & 73.22 & 11.47  \\\hline
1038%%  \#queries      &  1132 &  50 \\\hline\hline
1039%% \end{tabular}
1040%% \caption{Influence of optimizations on two test scenes. All values are averages
1041%% over all frames, frame times are in [ms].}\label{tab:assumed}
1042%% \end{table}
1043
1044
1045\begin{table}
1046\centering \footnotesize
1047\begin{tabular}{|l|c|c|c|c|}
1048 \hline\hline
1049scene                    & t$_{av}$ &  n$_{vp}$ & \#queries & frame time [ms]   \\ \hline\hline
1050\multirow{3}{*}{Teapots} & 0        &    0         & 2827      & 81.18  \\ \hhline{~----}
1051                         & 2       &    0         & 1769      &  86.31 \\ \hhline{~----}
1052                         & 2       &    25         & 1468      & 55.90  \\ \hline\hline
1053\multirow{3}{*}{City}    & 0        &    0         & 345        & 8.47  \\ \hhline{~----}
1054                         & 2        &    0         & 192        & 6.70 \\ \hhline{~----}
1055                         & 2       &    25         & 181       & 6.11  \\ \hline\hline
1056\multirow{3}{*}{Power Plant}    & 0  &    0        & 263        & 38.73  \\ \hhline{~----}
1057                         & 2        &    0         & 126        & 31.17 \\ \hhline{~----}
1058                         & 2       &    25         & 120       & 36.62  \\ \hline\hline
1059\end{tabular}
1060
1061\caption{Influence of optimizations on the CHC method.
1062  $t_{av}$ is the number of assumed visibility frames for conservative
1063  visibility testing, n$_{vp}$ is the pixel threshold for
1064  approximate visibility.}\label{tab:assumed}
1065\end{table}
1066
1067
1068\subsection{Comparison to PVS-based rendering}
1069
1070 We also compared the CHC method against precalculated visibility. In
1071particular, we used the PVS computed by an offline visibility
1072algorithm~\cite{wonka00}. While the walkthrough using the PVS was
10731.26ms faster per frame on average, our method does not require costly
1074precomputation and can be used at any general 3D position in the
1075model, not only in a predefined view space.
1076
1077
[273]1078\section{Summary}
[243]1079\label{sec:conclusion}
1080
1081 We have presented a method for the optimized scheduling of hardware
1082 accelerated occlusion queries. The method schedules occlusion queries
1083 in order to minimize the number of the queries and their latency.
1084 This is achieved by exploiting spatial and temporal coherence of
1085 visibility. Our results show that the CPU stalls and GPU starvation
1086 are almost completely eliminated at the cost of a slight increase in
1087 the number of rendered objects.
1088
1089%Additionally it schedules the queries in an order
1090%minimizing the overhead due to the latency of availability of query
1091%results.
1092
1093 Our technique can be used with practically arbitrary scene
1094 partitioning data structures such as kD-trees, bounding volume
1095 hierarchies, or hierarchical grids. The implementation of the method
1096 is straightforward as it uses a simple OpenGL interface to the
1097 hardware occlusion queries. In particular, the method requires no
1098 complicated geometrical operations or data structures. The algorithm
1099 is suitable for application on scenes of arbitrary structure and it
1100 requires no preprocessing or scene dependent tuning.
1101
1102 We have experimentally verified that the method is well suited to the
1103 \nvoq{} supported on current consumer grade graphics hardware. We
1104 have obtained an average speedup of 4.0--4.7 compared to pure view-frustum
1105 culling and 1.6--2.6 compared to the hierarchical
1106 stop-and-wait application of occlusion queries.
1107
[251]1108
[273]1109%   The major potential in improving the method is a better estimation
1110%  of changes in the visibility classification of hierarchy nodes. If
1111%  nodes tend to be mostly visible, we could automatically decrease the
1112%  frequency of occlusion tests and thus better adapt the method to the
1113%  actual occlusion in the scene. Another possibility for improvement is
1114%  better tuning for a particular graphics hardware by means of more
1115%  accurate rendering cost estimation. Skipping occlusion tests for
1116%  simpler geometry  can be faster than issuing comparably expensive
1117%  occlusion queries.
[243]1118
1119
1120\begin{figure*}[htb]
1121  \centerline{
[277]1122    \hfill
1123    \includegraphics[height=0.3\textwidth,draft=\DRAFTFIGS]{images/teapots}
1124    \hfill
1125    \includegraphics[height=0.3\textwidth,draft=\DRAFTFIGS]{images/city}
1126    \hfill
1127  }
[273]1128  \centerline{
[277]1129    \includegraphics[width=0.35\textwidth,draft=\DRAFTFIGS]{images/pplant}
1130  }
1131 
1132  \caption{The test scenes: the teapots, the city, and the power plant.}
1133  \label{fig:scenes}
[243]1134\end{figure*}
1135
1136%%  \begin{figure}[htb]
1137%%    \centerline{
1138%%      \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/city}
1139%%   }
1140%%   \caption{The the city scene.}
1141%%   \label{fig:images1}
1142%% \end{figure}
1143
1144%%  \begin{figure}[htb]
1145%%    \centerline{
1146%%      \includegraphics[width=0.25\textwidth,draft=\DRAFTFIGS]{images/teapots}
1147%%   }
1148%%   \caption{The the teapot scene.}
1149%%   \label{fig:images2}
1150%% \end{figure}
1151
1152%%  \begin{figure}[htb]
1153%%    \centerline{
1154%%      \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/pplant}
1155%%   }
1156%%   \caption{The the power plant scene.}
1157%%   \label{fig:images3}
1158%% \end{figure}
1159
1160
Note: See TracBrowser for help on using the repository browser.