source: trunk/VUT/doc/SciReport/online.tex @ 253

Revision 253, 54.2 KB checked in by mattausch, 19 years ago (diff)
Line 
1\chapter{Online Culling Tool}
2%*****************************************************************************
3
4\section{Introduction}
5
6 Visibility culling is one of the major acceleration techniques for
7the real-time rendering of complex scenes. The ultimate goal of
8visibility culling techniques is to prevent invisible objects from
9being sent to the rendering pipeline. A standard visibility-culling
10technique is {\em view-frustum culling}, which eliminates objects
11outside of the current view frustum. View-frustum culling is a fast
12and simple technique, but it does not eliminate objects in the view
13frustum that are occluded by other objects. This can lead to significant
14{\em overdraw}, i.e., the same image area gets covered more than
15once. The overdraw causes a waste of computational effort both in the
16pixel and the vertex processing stages of modern graphic hardware.
17%that is, we render each pixel in the image several times. For many complex
18%scenes with high occlusion, the effort wasted on overdraw cannot simply be
19%eliminated by the increasing the raw computational power of the hardware.
20%This problem becomes even more apparent on recent graphics hardware with
21%programmable vertex and pixel shaders: all complex per-vertex or per-pixel
22%computations on object get completely wasted if the object is occluded.
23The elimination of occluded objects is addressed by {\em occlusion
24culling}. In an optimized rendering pipeline, occlusion culling complements
25other rendering acceleration techniques such as levels of detail or
26impostors.
27
28 Occlusion culling can either be applied offline or online.  When
29applied offline as a preprocess, we compute a potentially visible set
30(PVS) for cells of a fixed subdivision of the scene. At runtime, we
31can quickly identify a PVS for the given viewpoint. However, this
32approach suffers from four major problems: (1) the PVS is  valid only
33for the original static scene configuration, (2) for a given
34viewpoint, the corresponding cell-based PVS can be overly
35conservative, (3) computing all PVSs is computationally expensive, and (4)
36an accurate PVS computation is difficult to implement for general
37scenes. Online occlusion culling can solve these problems at the cost
38of applying extra computations at each frame. To make these additional
39computations efficient, most online occlusion culling methods rely on a
40number of assumptions about the scene structure and its occlusion
41characteristics (e.g. presence of large occluders, occluder
42connectivity, occlusion by few closest depth layers).
43
44\begin{figure}%[htb]
45  %\centerline
46        \centering \footnotesize
47        \begin{tabular}{c}
48        \includegraphics[width=0.6\textwidth, draft=\DRAFTFIGS]{images/ogre_terrain} \\
49        %\hline
50        \includegraphics[width=0.3\textwidth, draft=\DRAFTFIGS]{images/vis_viewfrustum} \hfill \includegraphics[width=0.3\textwidth, draft=\DRAFTFIGS]{images/vis_chc} \\
51        \end{tabular}
52%\label{tab:online_culling_example}
53  \caption{(top) The rendered terrain scene. (bottom) Visualizion of the rendered / culled objects.
54    Using view frustum culling (left image) vs. occlusion queries (right image).
55    The yellow boxes show the actually rendered scene objects. The
56        red boxes depict the view frustum culled hierarchy nodes, the blue boxes depict the
57        occlusion query culled hierarchy nodes.}
58  \label{fig:online_culling_example}
59\end{figure}
60
61Recent graphics hardware natively supports an \emph{occlusion query}
62to detect the visibility of an object against the current contents of the
63z-buffer.  Although the query itself is processed quickly using the
64raw power of the graphics processing unit (GPU), its result is not
65available immediately due to the delay between issuing the query and
66its actual processing in the graphics pipeline. As a result, a naive
67application of occlusion queries can even decrease the overall
68application performance due the associated CPU stalls and GPU
69starvation. In this chapter, we present an algorithm that aims to
70overcome these problems by reducing the number of issued queries and
71eliminating the CPU stalls and GPU starvation. To schedule the
72queries, the algorithm makes use of both the spatial and the temporal
73coherence of visibility. A major strength of our technique is its
74simplicity and versatility: the method can be easily integrated in
75existing real-time rendering packages on architectures supporting the
76underlying occlusion query.
77In figure~\ref{fig:online_culling_example}, the same scene (top row) is rendered using view frustum
78culling (visualization in the bottom left image) versus online culling using occlusion queries (visualization
79in the bottom right image). It can be seen that with view frustum culling only many objects are still rendered.
80%Using spatial and assuming temporal coherence
81
82\section{Related Work}
83
84With the demand for rendering scenes of ever increasing size, there
85have been a number of visibility culling methods developed in the
86last decade.  A comprehensive survey of visibility culling methods
87was presented by Cohen-Or et al.~\cite{Cohen:2002:survey}. Another
88recent survey of Bittner and Wonka~\cite{bittner03:jep} discusses
89visibility culling in a broader context of other visibility problems.
90
91According to the domain of visibility computation, we distinguish
92between {\em from-point} and {\em from-region} visibility algorithms.
93From-region algorithms compute a PVS and are applied offline in a
94preprocessing phase~\cite{Airey90,Teller91a,Leyvand:2003:RSF}.
95From-point algorithms are applied online for each particular
96viewpoint~\cite{Greene93a,Hudson97,EVL-1997-163,bittner98b,Wonka:1999:OSF,Klosowski:2001:ECV}.
97In our further discussion we focus on online occlusion culling methods
98that exploit graphics hardware.
99
100A conceptually important online occlusion culling method is the
101hierarchical z-buffer introduced by Greene et al.~\cite{Greene93a}. It
102organizes the z-buffer as a pyramid, where the standard z-buffer is
103the finest level. At all other levels, each z-value is the farthest in
104the window corresponding to the adjacent finer level. The hierarchical
105z-buffer allows to quickly determine if the geometry in question is
106occluded. To a certain extent this idea is used in the current
107generation of graphics hardware by applying early z-tests of
108fragments in the graphics pipeline (e.g., Hyper-Z technology of ATI or
109Z-cull of NVIDIA). However, the geometry still needs to be sent to the
110GPU, transformed, and coarsely rasterized even if it is later
111determined invisible.
112
113 Zhang~\cite{EVL-1997-163} proposed hierarchical occlusion maps, which
114do not rely on the hardware support for the z-pyramid, but instead
115make use of hardware texturing. The hierarchical occlusion map is
116computed on the GPU by rasterizing and down sampling a given set of
117occluders. The occlusion map is used for overlap tests whereas the
118depths are compared using a coarse depth estimation buffer. Wonka and
119Schmalstieg~\cite{Wonka:1999:OSF} use occluder shadows to compute
120from-point visibility in \m25d scenes with the help of the GPU. This
121method has been further extended to online computation of from-region
122visibility executed on a server~\cite{wonka:01:eg}.
123
124%% In
125%% parallel to rendering, the visibility server calculates visibility for
126%% the neighborhood of the given viewpoint and sends them back to the
127%% display host.
128
129%% A similar method designed by Aila and
130%% Miettinen~\cite{aila:00:msc}. The incremental occlusion maps are
131%% created on the CPU, which takes the CPU computational time but it does
132%% not suffer the problem of slow read-back from the GPU.
133
134%Klosowski:2001:ECV
135
136Bartz et al.~\cite{Bartz98} proposed an OpenGL extension for occlusion
137queries along with a discussion concerning a potential realization in
138hardware. A first hardware implementation of occlusion queries came
139with the VISUALIZE fx graphics hardware~\cite{Scott:1998:OVF}. The
140corresponding OpenGL extension is called \hpot{}. A more recent OpenGL
141extension, \nvoq{}, was introduced by NVIDIA with the GeForce 3 graphics
142card and it is now also available as an official ARB extension.
143
144%Both
145%extensions have been used in several several occlusion methods
146%algorithms~\cite{Meissner01,Hillesland02,Staneker:2004:OCO}.
147%\cite{Meissner01}
148
149 Hillesland et al.~\cite{Hillesland02} have proposed an algorithm
150which employs the \nvoq. They subdivide the scene using a uniform
151grid. Then the cubes are traversed in slabs roughly perpendicular to
152the viewport. The queries are issued for all cubes of a slab at once,
153after the visible geometry of this slab has been rendered. The method
154can also use nested grids: a cell of the grid contains another grid
155that is traversed if the cell is proven visible. This method however
156does not exploit temporal and spatial coherence of visibility and it
157is restricted to regular subdivision data structures. Our new method
158addresses both these problems and provides natural extensions to
159balance the accuracy of visibility classification and the associated
160computational costs.
161
162 Recently, Staneker et al.~\cite{Staneker:2004:OCO} developed a method
163integrating occlusion culling into the OpenSG scene graph
164framework. Their technique uses occupancy maps maintained in software
165to avoid queries on visible scene graph nodes, and temporal coherence
166to reduce the number of occlusion queries. The drawback of the method
167is that it performs the queries in a serial fashion and thus it
168suffers from the CPU stalls and GPU starvation.
169
170%% The methods proposed in this chapter can be used to make use of temporal
171%% and spatial coherence in the scope of existing visibility algorithms,
172%% that utilise a spatial hierarchy. Examples of these are algorithms
173%% based on hierarchical occlusion maps~\cite{Zhang97}, coverage
174%% masks~\cite{Greene:1996:HPT}, shadow frusta~\cite{Hudson97}, and
175%% occlusion trees~\cite{bittner98b_long}.
176
177On a theoretical level, our paper is related to methods aiming to
178exploit the temporal coherence of visibility. Greene et
179al.~\cite{Greene93a} used the set of visible objects from one frame to
180initialize the z-pyramid in the next frame in order to reduce the
181overdraw of the hierarchical z-buffer. The algorithm of Coorg and
182Teller~\cite{Coorg96b} restricts the hierarchical traversal to nodes
183associated with visual events that were crossed between successive
184viewpoint positions. Another method of Coorg and Teller~\cite{Coorg97}
185exploits temporal coherence by caching occlusion relationships.
186Chrysanthou and Slater have proposed a probabilistic scheme for
187view-frustum culling~\cite{Chrysanthou:97}.
188
189The above mentioned methods for exploiting temporal coherence are
190tightly interwoven with the particular culling algorithm. On the
191contrary, Bittner et al.~\cite{bittner01:jvca} presented a general
192acceleration technique for exploiting spatial and temporal coherence
193in hierarchical visibility algorithms. The central idea, which is also
194vital for the online culling tool, is to avoid repeated visibility tests of
195interior nodes of the hierarchy. The problem of direct adoption of
196this method is that it is designed for the use with instantaneous CPU
197based occlusion queries, whereas hardware occlusion queries exhibit
198significant latency. The method presented herein efficiently overcomes
199the problem of latency while keeping the benefits of a generality and
200simplicity of the original hierarchical technique. As a result we
201obtain a simple and efficient occlusion culling algorithm utilizing
202hardware occlusion queries.
203
204
205%In this paper we show how to overcome this problem while
206%keeping the benefits of exploiting.
207
208
209The rest of the paper is organized as follows:
210Section~\ref{sec:hwqueries} discusses hardware supported occlusion
211queries and a basic application of these queries using a kD-tree.
212Section~\ref{sec:hoq} presents our new algorithm and
213Section~\ref{sec:optimizations} describes several additional
214optimizations. Section~\ref{sec:results} presents results obtained by
215experimental evaluation of the method and discusses its
216behavior. Finally Section~\ref{sec:conclusion} concludes the paper.
217
218
219
220
221\section{Hardware Occlusion Queries}
222\label{sec:hwqueries}
223
224
225Hardware occlusion queries follow a simple pattern: To test visibility
226of an occludee, we send its bounding volume to the GPU. The volume is
227rasterized and its fragments are compared to the current contents of
228the z-buffer. The GPU then returns the number of visible
229fragments. If there is no visible fragment, the occludee is invisible
230and it need not be rendered.
231
232\subsection{Advantages of hardware occlusion queries}
233
234There are several advantages of hardware occlusion queries:
235
236\begin{itemize}
237
238\item {\em Generality of occluders.} We can use the original scene geometry
239as occluders, since the queries use the current contents of the z-buffer.
240
241
242\item {\em Occluder fusion.} The occluders are merged in the z-buffer,
243so the queries automatically account for occluder fusion. Additionally
244this fusion comes for free since we use the intermediate result of the
245rendering itself.
246
247\item {\em Generality of occludees.} We can use complex
248occludees. Anything that can be rasterized quickly is suitable.
249
250\item {\em Exploiting the GPU power.} The queries take full advantage of
251the high fill rates and internal parallelism provided by modern GPUs.
252
253\item {\em Simple use.} Hardware occlusion queries can be easily
254integrated into a rendering algorithm. They provide a powerful tool to
255minimize the implementation effort, especially when compared to
256CPU-based occlusion culling.
257
258
259\end{itemize}
260
261
262
263\subsection{Problems of hardware occlusion queries}
264
265 Currently there are two main hardware supported variants of occlusion
266queries: the HP test (\hpot) and the more recent NV query (\nvoq, now
267also available as \arboq). The most important difference between the
268HP test and the NV query is that multiple NV queries can be
269issued before asking for their results, while only one HP test is
270allowed at a time, which severely limits its possible algorithmic
271usage. Additionally the NV query returns the number of visible pixels
272whereas the HP test returns only a binary visibility
273classification.
274
275%% Issuing multiple independent NV queries is crucial for
276%% the design of algorithms that strive to exploit power of the GPU.
277
278 The main problem of both the HP test and the NV query is the latency
279between issuing the query and the availability of the result. The
280latency occurs due to the delayed processing of the query in a long
281graphics pipeline, the cost of processing the query itself, and the
282cost of transferring the result back to the CPU. The latency causes
283two major problems: CPU stalls and GPU starvation. After issuing the
284query, the CPU waits for its result and does not feed the GPU with new
285data. When the result finally becomes available, the GPU pipeline can
286already be empty. Thus the GPU needs to wait for the CPU to process
287the result of the query and to feed the GPU with new data.
288
289A major challenge when using hardware occlusion queries is to avoid
290the CPU stalls by filling the latency time with other tasks, such as
291rendering visible scene objects or issuing other, independent
292occlusion queries (see Figure~\ref{fig:latency})
293
294
295\begin{figure}[htb]
296  \centerline{
297    \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{figs/latency2}
298    }
299
300  \caption{(top) Illustration of CPU stalls and GPU starvation.
301    Qn, Rn, and Cn denote querying, rendering, and culling of object
302    n, respectively. Note that object 5 is found invisible by Q5 and
303    thus not rendered. (bottom) More efficient query scheduling. The
304    scheduling assumes that objects 4 and 6 will be visible in the
305    current frame and renders them without waiting for the result of
306    the corresponding queries. }
307  \label{fig:latency}
308\end{figure}
309
310
311
312
313
314\subsection{Hierarchical stop-and-wait method}
315\label{sec:basic}
316
317Many rendering algorithms rely on hierarchical structures in order to
318deal with complex scenes.  In the context of occlusion culling, such a
319data structure allows to efficiently cull large scene blocks, and thus
320to exploit spatial coherence of visibility and provide a key to
321achieving output sensitivity.
322
323This section outlines a naive application of occlusion queries in the
324scope of a hierarchical algorithm. We refer to this approach as the
325{\em hierarchical stop-and-wait} method. Our discussion is based on
326kD-trees, which proved to be efficient for point location, ray
327tracing, and visibility
328culling~\cite{MacDonald90,Hudson97,Coorg97,bittner01:jvca}. The
329concept applies to general hierarchical data structures as well, though.
330
331The hierarchical stop-and-wait method proceeds as follows: Once a
332kD-tree node passes view-frustum culling, it is tested for
333occlusion by issuing the occlusion query and waiting for its
334result. If the node is found visible, we continue by recursively
335testing its children in a front-to-back order. If the node is a
336leaf, we render its associated objects.
337
338 The problem with this approach is that we can continue the tree
339traversal only when the result of the last occlusion query becomes
340available. If the result is not available, we have to stall the CPU,
341which causes significant performance penalties. As we document in
342Section~\ref{sec:results}, these penalties together with the overhead
343of the queries themselves can even decrease the overall application
344performance compared to pure view-frustum culling. Our new method aims
345to eliminate this problem by issuing multiple occlusion queries for
346independent scene parts and exploiting temporal coherence of
347visibility classifications.
348
349
350\section{Coherent Hierarchical Culling}
351
352\label{sec:hoq}
353
354 In this section we first present an overview of our new
355algorithm. Then we discuss its steps in more detail.
356
357%and present a
358%generalization of the method to other spatial data structures.
359
360\subsection{Algorithm Overview}
361
362\label{sec:overview}
363
364
365Our method is based on exploiting temporal coherence of visibility
366classification. In particular, it is centered on the following
367three ideas:
368
369\begin{itemize}
370 \item We initiate occlusion queries on nodes of the hierarchy where
371   the traversal terminated in the last frame. Thus we avoid queries
372   on all previously visible interior nodes~\cite{bittner01:jvca}.
373 \item We assume that a previously visible leaf node remains visible
374   and render the associated geometry without waiting for the result
375   of the corresponding occlusion query.
376\item Issued occlusion queries are stored in a query queue until they are known to be carried out by the GPU. This allows
377  interleaving the queries with the rendering of visible geometry.
378\end{itemize}
379
380
381The algorithm performs a traversal of the hierarchy that is
382terminated either at leaf nodes or nodes that are classified as
383invisible. Let us call such nodes the {\em termination nodes}, and
384interior nodes that have been classified visible the {\em opened
385nodes}. We denote sets of termination and opened nodes in the $i$-th
386frame $\mathcal{T}_i$ and $\mathcal{ O}_i$, respectively. In the
387$i$-th frame, we traverse the kD-tree in a front-to-back order, skip
388all nodes of $\mathcal{ O}_{i-1}$ and apply occlusion queries first on
389the termination nodes $\mathcal{ T}_{i-1}$. When reaching a
390termination node, the algorithm proceeds as follows:
391
392\begin{itemize}
393\item For a previously visible node (this must be a leaf), we issue
394the occlusion query and store it in the query queue. Then we
395immediately render the associated geometry without waiting for the
396result of the query.
397 \item For a previously invisible node, we issue the query and store
398 it in the query queue.
399\end{itemize}
400
401\begin{figure}[htb]
402{\footnotesize
403  \input{code/pseudocode2}
404 }
405  \caption{Pseudo-code of coherent hierarchical culling.}
406  \label{fig:pseudocode}
407\end{figure}
408
409When the query queue is not empty, we check if the result of the
410oldest query in the queue is already available. If the query result is
411not available, we continue by recursively processing other nodes of
412the kD-tree as described above. If the query result is available, we
413fetch the result and remove the node from the query queue. If the node
414is visible, we process its children recursively. Otherwise, the whole
415subtree of the node is invisible and thus it is culled.
416
417
418In order to propagate changes in visibility upwards in the hierarchy,
419the visibility classification is \emph{pulled up} according to the
420following rule: An interior node is invisible only if all its children
421have been classified invisible. Otherwise, it remains visible and thus
422opened. The pseudo-code of the complete algorithm is given in
423Figure~\ref{fig:pseudocode}. An example of the behavior of the method
424on a small kD-tree for two subsequent frames is depicted
425Figure~\ref{fig:cut}.
426
427\begin{figure*}[htb]
428  \centerline{
429    \includegraphics[width=0.73\textwidth,draft=\DRAFTFIGS]{figs/cut} }
430  \caption{(left) Visibility classification of a node of the kD-tree and
431    the termination nodes. (right) Visibility classification after the
432    application of the occlusion test and the new set of termination nodes.
433    Nodes on which occlusion queries were applied are depicted with a
434    solid outline. Note the pull-up and pull-down due to visibility changes.}
435  \label{fig:cut}
436\end{figure*}
437
438
439
440The sets of opened nodes and termination nodes need not be maintained
441explicitly. Instead, these sets can be easily identified by
442associating with each node an information about its visibility and an
443id of the last frame when it was visited. The node is an opened node
444if it is an interior visible node that was visited in the last frame
445(line 23 in the pseudocode). Note that in the actual implementation of
446the pull up we can set all visited nodes to invisible by default and
447then pull up any changes from invisible to visible (lines 25 and line
44812 in Figure~\ref{fig:pseudocode}). This modification eliminates
449checking children for invisibility during the pull up.
450
451
452
453%% \begin{figure*}[htb]
454%% \centerline{\includegraphics[width=0.8\textwidth,draft=\DRAFTFIGS]{figs/basic}}
455%% \caption{Illustration of the hierarchy traversal. Initially the algorithm
456%% starts at the root of the hierarchy (left). In the second frame the
457%% opened nodes $\mathcal{ O}_0$ are skipped and the occlusion queries
458%% are first applied on the termination nodes $\mathcal{T}_0$. Visibility changes are propagated upwarsds in the hierarchy and a
459%% new set of termination nodes $\mathcal{ T}_1$ is established.}
460%% \label{fig:basic}
461%% \end{figure*}
462
463
464\subsection{Reduction of the number of queries}
465
466%Skipping the opened nodes and issuing occlusion queries for the
467%termination nodes assists in exploiting two main characteristics of
468%scene visibility:
469
470%Identifying the set of termination nodes assists in finding a subdivision
471
472Our method reduces the number of visibility queries in two ways:
473Firstly, as other hierarchical culling methods we consider only a
474subtree of the whole hierarchy (opened nodes + termination
475nodes). Secondly, by avoiding queries on opened nodes we eliminate
476part of the overhead of identification of this subtree. These
477reductions reflect the following coherence properties of scene
478visibility:
479
480\begin{itemize}
481
482
483\item {\em Spatial coherence.} The invisible termination nodes
484approximate the occluded part of the scene with the smallest number of
485nodes with respect to the given hierarchy, i.e., each invisible
486termination node has a visible parent. This induces an adaptive
487spatial subdivision that reflects spatial coherence of visibility,
488more precisely the coherence of occluded regions. The adaptive nature
489of the subdivision allows to minimize the number of subsequent
490occlusion queries by applying the queries on the largest spatial
491regions that are expected to remain occluded.
492
493%, and visible termination
494%nodes correspond to the smallest unoccluded regions (visible
495%leafs)
496
497\item {\em Temporal coherence.} If visibility remains constant the set
498of termination nodes needs no adaptation. If an occluded node becomes
499visible we recursively process its children (pull-down). If a visible
500node becomes occluded we propagate the change higher in the hierarchy
501(pull-up). A pull-down reflects a spatial growing of visible
502regions. Similarly, a pull-up reflects a spatial growing of occluded
503regions.
504
505
506%The first case corresponds to pull-down of the cut, the latter to pull-up.
507
508\end{itemize}
509
510By avoiding queries on the opened nodes, we can save $1/k$ of the queries
511for a hierarchy with branching factor $k$ (assuming visibility remains
512constant). Thus for the kD-tree, up to half of the queries can be
513saved. The actual savings in the total query time are even larger: the
514higher we are at the hierarchy, the larger boxes we would have to
515check for occlusion. Consequently, the higher is the fill rate that
516would have been required to rasterize the boxes. In particular,
517assuming that the sum of the screen space projected area for nodes at
518each level of the kD-tree is equal and the opened nodes form a
519complete binary subtree of depth $d$, the fill rate is reduced $(d+2)$
520times.
521
522
523
524\subsection{Reduction of CPU stalls and GPU starvation}
525
526\label{sec:latency}
527
528The reduction of CPU stalls and GPU starvation is achieved by
529interleaving occlusion queries with the rendering of visible geometry. The
530immediate rendering of previously visible termination nodes and the
531subsequent issuing of occlusion queries eliminates the requirement of
532waiting for the query result during the processing of the initial
533depth layers containing previously visible nodes. In an optimal case,
534new query results become available in between and thus we completely
535eliminate CPU stalls. In a static scenario, we achieve exactly the
536same visibility classification as the hierarchical stop-and-wait
537method.
538
539If the visibility is changing, the situation can be different: if the
540results of the queries arrive too late, it is possible that we
541initiated an occlusion query on a previously occluded node $A$ that is
542in fact occluded by another previously occluded node $B$ that became
543visible. If $B$ is still in the query queue, we do not capture a
544possible occlusion of $A$ by $B$ since the geometry associated with
545$B$ has not yet been rendered. In Section~\ref{sec:results} we show
546that the increase of the number of rendered objects compared to the
547stop-and-wait method is usually very small.
548
549%% It is possible that we have already traversed all previously
550%% visible nodes and we must stall the CPU by waiting for the result
551%% of the oldest occlusion query. A technique completely eliminating
552%% CPU stalls at the cost of an increase of the number of rendered
553%% objects will be discussed in Section~\ref{sec:full}.
554
555\subsection{Front-to-back scene traversal}
556
557For kD-trees the front-to-back scene traversal can be easily
558implemented using a depth first
559traversal~\cite{bittner01:jvca}. However, at a modest increase in
560computational cost we can also use a more general breadth-first
561traversal based on a priority queue. The priority of the node then
562corresponds to an inverse of the minimal distance of the viewpoint and
563the bounding box associated with the given node of the
564kD-tree~\cite{Klosowski:2001:ECV,Staneker:2004:OCO}.
565
566In the context of our culling algorithm, there are two main advantages
567of the breadth-first front-to-back traversal :
568
569\begin{itemize}
570\item {\em Better query scheduling.} By spreading the traversal of the
571scene in a breadth-first manner, we process the scene in depth
572layers. Within each layer, the node processing order is practically
573independent, which minimizes the problem of occlusion query
574dependence. The breadth-first traversal interleaves
575occlusion-independent nodes, which can provide a more accurate
576visibility classification if visibility changes quickly. In
577particular, it reduces the problem of false classifications due to
578missed occlusion by nodes waiting in the query queue (discussed in
579Section~\ref{sec:latency}).
580
581\item {\em Using other spatial data structures.} By using a
582breadth-first traversal, we are no longer restricted to the
583kD-tree. Instead we can use an arbitrary spatial data structure
584such as a bounding volume hierarchy, octree, grid, hierarchical grid,
585etc. Once we compute a distance from a node to the viewpoint, the
586node processing order is established by the priority queue.
587\end{itemize}
588
589 When using the priority queue, our culling algorithm can also be
590applied directly to the scene graph hierarchy, thus avoiding the
591construction of any auxiliary data structure for spatial
592partitioning. This is especially important for dynamic scenes, in
593which maintenance of a spatial classification of moving objects can be
594costly.
595
596\subsection{Checking the query result}
597
598  The presented algorithm repeatedly checks if the result of the
599 occlusion query is available before fetching any node from the
600 traversal stack (line 6 in Figure~\ref{fig:pseudocode}). Our
601 practical experiments have proven that the cost of this check is
602 negligible and thus it can used frequently without any performance
603 penalty. If the cost of this check were significantly higher, we could
604 delay asking for the query result by a time established by empirical
605 measurements for the particular hardware. This delay should also
606 reflect the size of the queried node to match the expected
607 availability of the query result as precise as possible.
608
609
610\section{Further Optimizations}
611
612\label{sec:optimizations}
613
614This section discusses a couple of optimizations of our method that
615can further improve the overall rendering performance. In contrast to
616the basic algorithm from the previous section, these optimizations
617rely on some user specified parameters that should be tuned for a
618particular scene and hardware configuration.
619
620
621\subsection{Conservative visibility testing}
622
623The first optimization addresses the reduction of the number of
624visibility tests at the cost of a possible increase in the number of
625rendered objects. This optimization is based on the idea of skipping
626some occlusion tests of visible nodes. We assume that whenever a node
627becomes visible, it remains visible for a number of frames. Within the
628given number of frames we avoid issuing occlusion queries and simply
629assume the node remains visible~\cite{bittner01:jvca}.
630
631This technique can significantly reduce the number of visibility tests
632applied on visible nodes of the hierarchy. Especially in the case of
633sparsely occluded scenes, there is a large number of visible nodes
634being tested, which does not provide any benefit since most of them
635remain visible. On the other hand, we do not immediately capture all
636changes from visibility to invisibility, and thus we may render
637objects that have already become invisible from the moment when the
638last occlusion test was issued.
639
640 In the simplest case, the number of frames a node is assumed visible
641can be a predefined constant. In a more complicated scenario this
642number should be influenced by the history of the success of occlusion
643queries and/or the current speed of camera movement.
644
645
646\subsection{Approximate visibility}
647
648 The algorithm as presented computes a conservative visibility
649classification with respect to the resolution of the z-buffer. We
650can easily modify the algorithm to cull nodes more aggressively in
651cases when a small part of the node is visible. We compare the
652number of visible pixels returned by the occlusion query with a
653user specified constant and cull the node if this number drops
654below this constant.
655
656
657\subsection{Complete elimination of CPU stalls}
658
659\label{sec:full}
660
661 The basic algorithm eliminates CPU stalls unless the traversal stack
662is empty. If there is no node to traverse in the traversal stack and
663the result of the oldest query in the query queue is still not
664available, it stalls the CPU by waiting for the query result. To
665completely eliminate the CPU stalls, we can speculatively render some
666nodes with undecided visibility. In particular, we select a node from
667the query queue and render the geometry associated with the node (or
668the whole subtree if it is an interior node). The node is marked as
669rendered but the associated occlusion query is kept in the queue to
670fetch its result later. If we are unlucky and the node remains
671invisible, the effort of rendering the node's geometry is wasted. On
672the other hand, if the node has become visible, we have used the time
673slot before the next query arrives in an optimal manner.
674
675To avoid the problem of spending more time on rendering invisible
676nodes than would be spent by waiting for the result of the query, we
677select a node with the lowest estimated rendering cost and compare
678this cost with a user specified constant. If the cost is larger than
679the constant we conclude that it is too risky to render the node and
680wait till the result of the query becomes available.
681
682\begin{figure}%[htb]
683{\footnotesize
684  \input{code/pseudocode_batching}
685 }
686  \caption{Pseudo-code of coherent hierarchical culling using multiple queries.}
687  \label{fig:pseudocode_batching}
688\end{figure}
689
690\subsection{Initial depth pass}
691\label{sec:initial}
692
693\begin{figure}
694\centering
695\includegraphics[width=0.49\textwidth,draft=\DRAFTIMAGES]{images/transp_wrong}
696\includegraphics[width=0.49\textwidth,draft=\DRAFTIMAGES]{images/transp_delayed}
697\caption{(left) all passes are rendered with CHC. Note that the soldiers are
698visible through the tree. (right) Only the solid passes are rendered using CHC, afterwards the transparent passes.}
699\label{fig:online_transparency}
700\end{figure}
701
702
703To achieve maximal performance on modern GPU's, one has to take care of a number of issues.
704First, it is very important to reduce material switching. Thus modern rendering engines sort the
705objects (or patches) by materials in order to eliminate the material switching as good as possible.
706
707Next, materials can be very costly, sometimes complicated shaders have to be evaluated several times per batch. Hence it should be avoided to render the full material for fragments which eventually turn out to be occluded. This can be achieved
708by rendering an initial depth pass (i.e., enabling only depth write to fill the depth
709buffer). Afterwards the geometry is rendered again, this time with full shading. Because the
710depth buffer is already established, invisible fragments will be discarded before any shading is done calculated.
711
712This approach can be naturally adapted for use with the CHC algorithm. Only an initial depth pass is rendered
713in front-to-back order using the CHC algorithm. The initial pass is sufficient to fill the depth buffer and determine the visible geometry. Then only the visible geometry is rendered again, exploiting the full optimization and material sorting capability of the rendering engine.
714
715If the materials requires several rendering passes, we can use a variant of the depth pass method. We render
716only the first passes using the algorithm (e.g., the solid passes), determining the visibility of the patches, and
717render all the other passes afterwards. This approach can be used when there are passes which require a special
718kind of sorting to be rendered correctly (e.g., transparent passes, shadow passes). In figure~\ref{fig:online_transparency},
719we can see that artifacts occur in the left image if the transparent passes are not rendered in the correct order after
720applying the hierarchical algorithm (right image). In a similar fashion, we are able to handle shadows~\ref{fig:chc_shadows}.
721
722\begin{figure}
723\centering
724\includegraphics[width=0.35\textwidth,draft=\DRAFTIMAGES]{images/chc_shadows}
725\caption{We can correctly handle shadow volumes together with CHC.}
726\label{fig:chc_shadows}
727\end{figure}
728
729\subsection{Batching multiple queries}
730\label{sec:batching}
731
732When occlusion queries are rendered interleaved with geometry, there is always a state change involved. To
733reduce state changes, it is beneficial not to execute one query at a time, but multiple queries at once~\cite{Staneker:2004:EMO}.
734Instead of immediately executing a query for a node when we fetch it from the traversal stack, we add it to the {\em pending queue}. If {\em n} of these queries are accumulated in the queue, we can execute them at once.
735To optain an optimal value for {\em n}, several some heuristics can be applied, e.g., a fraction of the number of queries issued in the last frame. The pseudo-code of the algorithm including the batching is given in Figure~\ref{fig:pseudocode_batching}.
736
737
738\section{Results}
739\label{sec:results}
740
741We have incorporated our method into an OpenGL-based scene graph
742library and tested it on three scenes of different types. All tests
743were conducted on a PC with a 3.2GHz P4, 1GB of memory, and a
744GeForce FX5950 graphics card.
745
746\subsection{Test scenes}
747
748The three test scenes comprise a synthetic arrangement of 5000
749randomly positioned teapots (11.6M polygons); an urban environment (1M
750polygons); and the UNC power plant model (13M polygons). The test
751scenes are depicted in Figure~\ref{fig:scenes}. All scenes were
752partitioned using a kD-tree constructed according to the surface-area
753heuristics~\cite{MacDonald90}.
754
755
756Although the teapot scene would intuitively offer good occlusion, it
757is a complicated case to handle for occlusion culling. Firstly, the
758teapots consist of small triangles and so only the effect of fused
759occlusion due to a large number of visible triangles can bring a
760culling benefit. Secondly, there are many thin holes through which it
761is possible to see quite far into the arrangement of teapots. Thirdly,
762the arrangement is long and thin and so we can see almost
763half of the teapots along the longer side of the arrangement.
764
765%Although the teapot scene would intuitively seem to offer good
766%occlusion, it is one of the more complicated cases for occlusion
767%culling. Due to the high number of very dense objects, a very fine
768%kD-tree subdivision was necessary, which in turn leads to higher costs
769%for testing many nodes in the hierarchy. Additionally, misclassifying
770%a single teapot can lead to a significant increase in frame time due
771%to the high polygon density (2,304 polygons per teapot). There are also
772%many locations where it is possible to see very far into the block of
773%teapots. This can be seen in the kD-tree node classification which is
774%shown in the accompanying video.
775
776The complete power plant model is quite challenging even to load into memory,
777but on the other hand it offers good
778occlusion. This scene is an interesting candidate for testing not only
779due to its size, but also due to significant changes in visibility and depth complexity in
780its different parts.
781
782%as can also be seen in the
783%walkthrough. A relatively coarse subdivision of the model triangles
784%into about 4,600 nodes was sufficient to capture most occlusion
785%interdependencies.
786
787The city scene is a classical target for occlusion culling
788algorithms. Due to the urban structure consisting of buildings and
789streets, most of the model is occluded when viewed from the
790streets. Note that the scene does not contain any detailed geometry
791inside the buildings. See Figure~\ref{fig:city_vis} for a
792visualization of the visibility classification of the kD-tree nodes
793for the city scene.
794
795%that
796%could further
797
798
799
800%% However, we still included a section in the walkthrough where the
801%% viewpoint is elevated above the building roofs in order to show that
802%% the algorithm continues to work without modification, albeit at some
803%% expense. It will be an interesting topic of research to automatically
804%% recognize such situations and decrease the frequency of visibility
805%% queries or turn the occlusion culling algorithm off completely.
806
807
808
809\begin{figure}
810\centering \includegraphics[width=0.32\textwidth,draft=\DRAFTIMAGES]{images/city_vis}
811\caption{Visibility classification of the kD-tree nodes in the city scene.
812  The orange nodes were found visible, all the other depicted nodes are invisible.
813  Note the increasing size of the occluded nodes with increasing distance from the visible set.}
814\label{fig:city_vis}
815\end{figure}
816
817
818\subsection{Basic tests}
819
820
821We have measured the frame times for rendering with only view-frustum culling
822(VFC), the hierarchical stop-and-wait method (S\&W), and our new
823coherent hierarchical culling method (CHC). Additionally, we have
824evaluated the time for an ``ideal'' algorithm. The ideal algorithm
825renders the visible objects found by the S\&W algorithm without
826performing any visibility tests. This is an optimal solution with
827respect to the given hierarchy, i.e., no occlusion culling algorithm
828operating on the same hierarchy can be faster. For the basic tests we
829did not apply any of the optimizations discussed in
830Section~\ref{sec:optimizations}, which require user specified
831parameters.
832
833%% In order to obtain repeatable and comparable timings, the process
834%% priority of our application was set to high, timings were obtained
835%% using the CPU cycle counter instruction, and the OpenGL {\tt Finish()}
836%% instruction was executed before each invocation of the timer at the
837%% start and the end of a frame. Furthermore, we did not include clear
838%% and buffer swapping times in our results, since they depend strongly
839%% on the windowing mode used.
840
841For each test scene, we have constructed a walkthrough which is shown
842in full in the accompanying
843video. Figures~\ref{fig:teapot_walkthrough},~\ref{fig:city_walkthrough},
844and~\ref{fig:plant_walkthrough} depict the frame times measured for
845the walkthroughs. Note that Figure~\ref{fig:plant_walkthrough} uses a
846logarithmic scale to capture the high variations in frame times during
847the power plant walkthrough.
848\begin{figure}
849\centering
850\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/teapotgraph1}
851\caption{Frame times for the teapot scene.}\label{fig:teapot_walkthrough}
852\end{figure}
853\begin{figure}
854\centering
855\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/citygraph1}
856\caption{Frame times for the city walkthrough. Note
857the spike around frame 1600, where the viewpoint was elevated above the roofs,
858practically eliminating any occlusion.}\label{fig:city_walkthrough}
859\end{figure}
860\begin{figure}
861\centering
862\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/plantgraph1log}
863\caption{Frame times for the power plant walkthrough.
864The plot shows the weakness of the S\&W method: when there is not much occlusion it becomes slower than VFC (near frame 2200).
865The CHC can keep up even in these situations and in the same time it can exploit occlusion when it appears
866(e.g. near frame 3700).}
867\label{fig:plant_walkthrough}
868\end{figure}
869To better demonstrate the behavior of our algorithm, all walkthroughs
870contain sections with both restricted and unrestricted visibility. For
871the teapots, we viewed the arrangement of teapots along the longer
872side of the arrangement (frames 25--90). In the city we elevated the
873viewpoint above the roofs and gained sight over most of the city
874(frames 1200--1800). The power plant walkthrough contains several
875viewpoints from which a large part of the model is visible (spikes in
876Figure~\ref{fig:plant_walkthrough} where all algorithms are slow),
877viewpoints along the border of the model directed outwards with low depth complexity (holes in
878Figure~\ref{fig:plant_walkthrough} where all algorithms are fast), and
879viewpoints inside the power plant with high depth complexity where occlusion culling produces a
880significant speedup over VFC (e.g. frame 3800).
881
882 As we can see for a number frames in the walkthroughs, the CHC method
883can produce a speedup of more than one order of magnitude compared to
884VFC. The maximum speedup for the teapots, the city, and the power
885plant walkthroughs is 8, 20, and 70, respectively. We can also observe
886that CHC maintains a significant gain over S\&W and in many cases it
887almost matches the performance of the ideal algorithm. In complicated
888scenarios the S\&W method caused a significant slowdown compared to
889VFC (e.g. frames 1200--1800 of
890Figure~\ref{fig:city_walkthrough}). Even in these cases, the CHC
891method maintained a good speedup over VFC except for a small number of
892frames.
893
894\begin{table*}
895\centering \footnotesize
896\begin{tabular}{|c|c|c|c|c|c|c|c|}
897 \hline\hline
898scene                  & method   & \#queries & wait time [ms] & rendered triangles & frame time [ms]  & speedup \\\hline\hline
899
900Teapots                & VFC      & ---        &  ---     & 11,139,928  & 310.42  & 1.0 \\\hhline{~|-|-|-|-|-|-|-|}
90111,520,000 triangles    & S\&W     & 4704     & 83.19  & 2,617,801  & 154.95 & 2.3 \\\hhline{~|-|-|-|-|-|-|-|}
90221,639 kD-Tree nodes   & {\bf CHC} &{\bf 2827} & {\bf 1.31}  & {\bf 2,852,514 } & {\bf 81,18 } & {\bf 4.6 } \\\hhline{~|-|-|-|-|-|-|-|}
903                       & Ideal    & ---        &  ---     & 2,617,801  & 72.19  & 5.2 \\
904                       \hline\hline
905City                   & VFC      & ---        &  ---     & 156,521    & 19.79 & 1.0   \\\hhline{~|-|-|-|-|-|-|-|}
9061,036,146 triangles    & S\&W     & 663      &  9.49  & 30,594     & 19.9  & 1.5 \\\hhline{~|-|-|-|-|-|-|-|}
90733,195 kD-Tree nodes   &{\bf CHC} & {\bf 345} & {\bf 0.18}   & {\bf 31,034}  & {\bf 8.47}  & {\bf 4.0}  \\\hhline{~|-|-|-|-|-|-|-|}
908                       & Ideal    & ---        &   ---   & 30,594     & 4.55  & 6.6  \\
909                       \hline\hline
910Power Plant            & VFC      & ---    &  ---   & 1,556,300 & 138.76 & 1.0 \\\hhline{~|-|-|-|-|-|-|-|}
91112,748,510 triangles    & S\&W    & 485   & 16.16  & 392,962  & 52.29 & 3.2 \\\hhline{~|-|-|-|-|-|-|-|}
91218,719 kD-Tree nodes    &{\bf CHC} & {\bf 263}  & {\bf 0.70}  & {\bf 397,920} & {\bf 38.73}  & {\bf 4.7} \\\hhline{~|-|-|-|-|-|-|-|}
913                       & Ideal    &    ---      &  ---   &   392,962    &  36.34  & 5.8 \\\hline\hline
914\end{tabular}
915\caption{Statistics for the three test scenes. VFC is rendering with only view-frustum culling, S\&W is the
916  hierarchical stop and wait method, CHC is our new method, and Ideal
917is a perfect method with respect to the given hierarchy. All values are averages over
918all frames (including the speedup).}
919\label{tab:averages}
920\end{table*}
921
922%VFC     S&W     CHC     Ideal
923%(01:50:59) salzrat: 1     3,243823523     4,744645293     5,765981148
924%(01:51:04) salzrat: for the powerplant
925% for the powerplant
926%(01:53:59) salzrat: 1     1,519898136     4,01897468     6,599413211
927%(01:54:01) salzrat: for the city
928%(01:55:15) salzrat: and
929%(01:55:15) salzrat: 1     2,321137905     4,646185621     5,199086993
930
931%salzrat: teapot: 1,977162565     1,128631431
932%(02:01:52) salzrat: city: 2,607396227     1,674188137
933%(02:02:59) salzrat: powerplant: 1,635036084     1,244446469
934
935
936 Next, we summarized the scene statistics and the average values per
937frame in Table~\ref{tab:averages}. The table shows the number of
938issued occlusion queries, the wait time representing the CPU stalls,
939the number of rendered triangles, the total frame time, and the
940speedup over VFC.
941
942
943We can see that the CHC method practically eliminates the CPU stalls
944(wait time) compared to the S\&W method. This is paid for by a slight
945increase in the number of rendered triangles. For the three
946walkthroughs, the CHC method produces average speedups of 4.6, 4.0, and
9474.7 over view frustum culling and average speedups of 2.0, 2.6, and
9481.6 over the S\&W method. CHC is only 1.1, 1.7, and 1.2 times
949slower than the ideal occlusion culling algorithm. Concerning the
950accuracy, the increase of the average number of rendered triangles for
951CHC method compared to S\&W was 9\%, 1.4\%, and 1.3\%. This increase
952was always recovered by the reduction of CPU stalls for the tested
953walkthroughs.
954
955%% While this number may be very low, the algorithm also incurs a
956%% non-negligible overhead which is caused by the graphics card
957%% reconfiguration required for the queries and rasterization of the
958%% bounding volume geometries.
959
960% triangles
961% 9%
962% 1.4%
963% 1.3%
964
965
966% speedup over SW
967% 1.9
968% 2.32
969% 1.33
970
971% compare to ideal
972% 1.13
973% 1.9
974% 1.05
975
976
977
978\subsection{Optimizations}
979
980First of all we have observed that the technique of complete
981elimination of CPU stalls discussed in Section~\ref{sec:full} has a
982very limited scope. In fact for all our tests the stalls were almost
983completely eliminated by the basic algorithm already (see wait time in
984Table~\ref{tab:averages}). We did not find constants that could
985produce additional speedup using this technique.
986
987The measurements for the other optimizations discussed in
988Section~\ref{sec:optimizations} are summarized in
989Table~\ref{tab:assumed}. We have measured the average number of issued queries
990and the average frame time in dependence on the number of frames a node is
991assumed visible and the pixel threshold of approximate visibility. We
992have observed that the effectiveness of the optimizations depends strongly on the scene. If
993the hierarchy is deep and the geometry associated with a leaf node is
994not too complex, the conservative visibility testing produces a
995significant speedup (city and power plant). For the teapot scene the
996penalty for false rendering of actually occluded objects became larger
997than savings achieved by the reduction of the number of queries. On the other
998hand since the teapot scene contains complex visible geometry the
999approximate visibility optimization produced a significant
1000speedup. This is however paid for by introducing errors in the image
1001proportional to the pixel threshold used.
1002
1003%Good parameters for these
1004%optimizations have to be estimated on a scene-by-scene basis.
1005
1006
1007
1008
1009%% , where we compare the reference frame time and
1010%% number of queries already shown in the previous table to the
1011%% approximate visibility optimization using a higher pixel threshold,
1012%% and to the conservative visibility testing optimization assuming a
1013%% node remains visible for 2 frames, both for the teapot and the city
1014%% scene. These tests clearly show that a high visibility threshold
1015%% improves frame times, albeit at the expense of image quality. The
1016%% benefit of the conservative visibility optimization, is however
1017%% limited. While the number of queries can be reduced significantly, the
1018%% number of misclassified objects rises and increases the frame time
1019%% again.
1020
1021
1022
1023%% \begin{table}
1024%% \centering \footnotesize
1025%% \begin{tabular}{|l|c|c|}
1026%%  \hline\hline
1027%%                 & Teapots & City \\\hline\hline
1028%%  frame time CHC & 130.42  & 13.91\\\hline
1029%%  \#queries      & 2267    & 94.04\\\hline\hline
1030%%  \multicolumn{3}{|l|}{assume 2 frames visible}\\\hline
1031%%  frame time      & 123.84 & 13.27 \\\hline
1032%%  \#queries      & 1342    &  42 \\\hline\hline
1033%%  \multicolumn{3}{|l|}{pixel threshold 50 pixels and assume 2 frames visible}\\\hline
1034%%  frame time     & 73.22 & 11.47  \\\hline
1035%%  \#queries      &  1132 &  50 \\\hline\hline
1036%% \end{tabular}
1037%% \caption{Influence of optimizations on two test scenes. All values are averages
1038%% over all frames, frame times are in [ms].}\label{tab:assumed}
1039%% \end{table}
1040
1041
1042\begin{table}
1043\centering \footnotesize
1044\begin{tabular}{|l|c|c|c|c|}
1045 \hline\hline
1046scene                    & t$_{av}$ &  n$_{vp}$ & \#queries & frame time [ms]   \\ \hline\hline
1047\multirow{3}{*}{Teapots} & 0        &    0         & 2827      & 81.18  \\ \hhline{~----}
1048                         & 2       &    0         & 1769      &  86.31 \\ \hhline{~----}
1049                         & 2       &    25         & 1468      & 55.90  \\ \hline\hline
1050\multirow{3}{*}{City}    & 0        &    0         & 345        & 8.47  \\ \hhline{~----}
1051                         & 2        &    0         & 192        & 6.70 \\ \hhline{~----}
1052                         & 2       &    25         & 181       & 6.11  \\ \hline\hline
1053\multirow{3}{*}{Power Plant}    & 0  &    0        & 263        & 38.73  \\ \hhline{~----}
1054                         & 2        &    0         & 126        & 31.17 \\ \hhline{~----}
1055                         & 2       &    25         & 120       & 36.62  \\ \hline\hline
1056\end{tabular}
1057
1058\caption{Influence of optimizations on the CHC method.
1059  $t_{av}$ is the number of assumed visibility frames for conservative
1060  visibility testing, n$_{vp}$ is the pixel threshold for
1061  approximate visibility.}\label{tab:assumed}
1062\end{table}
1063
1064
1065\subsection{Comparison to PVS-based rendering}
1066
1067 We also compared the CHC method against precalculated visibility. In
1068particular, we used the PVS computed by an offline visibility
1069algorithm~\cite{wonka00}. While the walkthrough using the PVS was
10701.26ms faster per frame on average, our method does not require costly
1071precomputation and can be used at any general 3D position in the
1072model, not only in a predefined view space.
1073
1074
1075\section{Conclusion}
1076\label{sec:conclusion}
1077
1078 We have presented a method for the optimized scheduling of hardware
1079 accelerated occlusion queries. The method schedules occlusion queries
1080 in order to minimize the number of the queries and their latency.
1081 This is achieved by exploiting spatial and temporal coherence of
1082 visibility. Our results show that the CPU stalls and GPU starvation
1083 are almost completely eliminated at the cost of a slight increase in
1084 the number of rendered objects.
1085
1086%Additionally it schedules the queries in an order
1087%minimizing the overhead due to the latency of availability of query
1088%results.
1089
1090 Our technique can be used with practically arbitrary scene
1091 partitioning data structures such as kD-trees, bounding volume
1092 hierarchies, or hierarchical grids. The implementation of the method
1093 is straightforward as it uses a simple OpenGL interface to the
1094 hardware occlusion queries. In particular, the method requires no
1095 complicated geometrical operations or data structures. The algorithm
1096 is suitable for application on scenes of arbitrary structure and it
1097 requires no preprocessing or scene dependent tuning.
1098
1099 We have experimentally verified that the method is well suited to the
1100 \nvoq{} supported on current consumer grade graphics hardware. We
1101 have obtained an average speedup of 4.0--4.7 compared to pure view-frustum
1102 culling and 1.6--2.6 compared to the hierarchical
1103 stop-and-wait application of occlusion queries.
1104
1105
1106  The major potential in improving the method is a better estimation
1107 of changes in the visibility classification of hierarchy nodes. If
1108 nodes tend to be mostly visible, we could automatically decrease the
1109 frequency of occlusion tests and thus better adapt the method to the
1110 actual occlusion in the scene. Another possibility for improvement is
1111 better tuning for a particular graphics hardware by means of more
1112 accurate rendering cost estimation. Skipping occlusion tests for
1113 simpler geometry  can be faster than issuing comparably expensive
1114 occlusion queries.
1115
1116
1117\begin{figure*}[htb]
1118  \centerline{
1119     \hfill
1120     \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/teapots}
1121     \hfill
1122     \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/city}
1123     \hfill
1124     \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/pplant}
1125     \hfill
1126   }
1127   \caption{The test scenes: the teapots, the city, and the power plant.}
1128   \label{fig:scenes}
1129\end{figure*}
1130
1131%%  \begin{figure}[htb]
1132%%    \centerline{
1133%%      \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/city}
1134%%   }
1135%%   \caption{The the city scene.}
1136%%   \label{fig:images1}
1137%% \end{figure}
1138
1139%%  \begin{figure}[htb]
1140%%    \centerline{
1141%%      \includegraphics[width=0.25\textwidth,draft=\DRAFTFIGS]{images/teapots}
1142%%   }
1143%%   \caption{The the teapot scene.}
1144%%   \label{fig:images2}
1145%% \end{figure}
1146
1147%%  \begin{figure}[htb]
1148%%    \centerline{
1149%%      \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/pplant}
1150%%   }
1151%%   \caption{The the power plant scene.}
1152%%   \label{fig:images3}
1153%% \end{figure}
1154
1155
Note: See TracBrowser for help on using the repository browser.