source: trunk/VUT/doc/SciReport/online.tex @ 243

Revision 243, 49.6 KB checked in by bittner, 19 years ago (diff)

SciReport? template

Line 
1\chapter{Online Visibility Culling}
2%*****************************************************************************
3
4\section{Introduction}
5
6 Visibility culling is one of the major acceleration techniques for
7the real-time rendering of complex scenes. The ultimate goal of
8visibility culling techniques is to prevent invisible objects from
9being sent to the rendering pipeline. A standard visibility-culling
10technique is {\em view-frustum culling}, which eliminates objects
11outside of the current view frustum. View-frustum culling is a fast
12and simple technique, but it does not eliminate objects in the view
13frustum that are occluded by other objects. This can lead to significant
14{\em overdraw}, i.e., the same image area gets covered more than
15once. The overdraw causes a waste of computational effort both in the
16pixel and the vertex processing stages of modern graphic hardware.
17%that is, we render each pixel in the image several times. For many complex
18%scenes with high occlusion, the effort wasted on overdraw cannot simply be
19%eliminated by the increasing the raw computational power of the hardware.
20%This problem becomes even more apparent on recent graphics hardware with
21%programmable vertex and pixel shaders: all complex per-vertex or per-pixel
22%computations on object get completely wasted if the object is occluded.
23The elimination of occluded objects is addressed by {\em occlusion
24culling}. In an optimized rendering pipeline, occlusion culling complements
25other rendering acceleration techniques such as levels of detail or
26impostors.
27
28 Occlusion culling can either be applied offline or online.  When
29applied offline as a preprocess, we compute a potentially visible set
30(PVS) for cells of a fixed subdivision of the scene. At runtime, we
31can quickly identify a PVS for the given viewpoint. However, this
32approach suffers from four major problems: (1) the PVS is  valid only
33for the original static scene configuration, (2) for a given
34viewpoint, the corresponding cell-based PVS can be overly
35conservative, (3) computing all PVSs is computationally expensive, and (4)
36an accurate PVS computation is difficult to implement for general
37scenes. Online occlusion culling can solve these problems at the cost
38of applying extra computations at each frame. To make these additional
39computations efficient, most online occlusion culling methods rely on a
40number of assumptions about the scene structure and its occlusion
41characteristics (e.g. presence of large occluders, occluder
42connectivity, occlusion by few closest depth layers).
43
44Recent graphics hardware natively supports an \emph{occlusion query}
45to detect the visibility of an object against the current contents of the
46z-buffer.  Although the query itself is processed quickly using the
47raw power of the graphics processing unit (GPU), its result is not
48available immediately due to the delay between issuing the query and
49its actual processing in the graphics pipeline. As a result, a naive
50application of occlusion queries can even decrease the overall
51application performance due the associated CPU stalls and GPU
52starvation. In this paper, we present an algorithm that aims to
53overcome these problems by reducing the number of issued queries and
54eliminating the CPU stalls and GPU starvation. To schedule the
55queries, the algorithm makes use of both the spatial and the temporal
56coherence of visibility. A major strength of our technique is its
57simplicity and versatility: the method can be easily integrated in
58existing real-time rendering packages on architectures supporting the
59underlying occlusion query.
60
61%Using spatial and assuming temporal coherence
62
63\section{Related Work}
64
65With the demand for rendering scenes of ever increasing size, there
66have been a number of visibility culling methods developed in the
67last decade.  A comprehensive survey of visibility culling methods
68was presented by Cohen-Or et al.~\cite{Cohen:2002:survey}. Another
69recent survey of Bittner and Wonka~\cite{bittner03:jep} discusses
70visibility culling in a broader context of other visibility problems.
71
72According to the domain of visibility computation, we distinguish
73between {\em from-point} and {\em from-region} visibility algorithms.
74From-region algorithms compute a PVS and are applied offline in a
75preprocessing phase~\cite{Airey90,Teller91a,Leyvand:2003:RSF}.
76From-point algorithms are applied online for each particular
77viewpoint~\cite{Greene93a,Hudson97,EVL-1997-163,bittner98b,Wonka:1999:OSF,Klosowski:2001:ECV}.
78In our further discussion we focus on online occlusion culling methods
79that exploit graphics hardware.
80
81A conceptually important online occlusion culling method is the
82hierarchical z-buffer introduced by Greene et al.~\cite{Greene93a}. It
83organizes the z-buffer as a pyramid, where the standard z-buffer is
84the finest level. At all other levels, each z-value is the farthest in
85the window corresponding to the adjacent finer level. The hierarchical
86z-buffer allows to quickly determine if the geometry in question is
87occluded. To a certain extent this idea is used in the current
88generation of graphics hardware by applying early z-tests of
89fragments in the graphics pipeline (e.g., Hyper-Z technology of ATI or
90Z-cull of NVIDIA). However, the geometry still needs to be sent to the
91GPU, transformed, and coarsely rasterized even if it is later
92determined invisible.
93
94 Zhang~\cite{EVL-1997-163} proposed hierarchical occlusion maps, which
95do not rely on the hardware support for the z-pyramid, but instead
96make use of hardware texturing. The hierarchical occlusion map is
97computed on the GPU by rasterizing and down sampling a given set of
98occluders. The occlusion map is used for overlap tests whereas the
99depths are compared using a coarse depth estimation buffer. Wonka and
100Schmalstieg~\cite{Wonka:1999:OSF} use occluder shadows to compute
101from-point visibility in \m25d scenes with the help of the GPU. This
102method has been further extended to online computation of from-region
103visibility executed on a server~\cite{wonka:01:eg}.
104
105%% In
106%% parallel to rendering, the visibility server calculates visibility for
107%% the neighborhood of the given viewpoint and sends them back to the
108%% display host.
109
110%% A similar method designed by Aila and
111%% Miettinen~\cite{aila:00:msc}. The incremental occlusion maps are
112%% created on the CPU, which takes the CPU computational time but it does
113%% not suffer the problem of slow read-back from the GPU.
114
115%Klosowski:2001:ECV
116
117Bartz et al.~\cite{Bartz98} proposed an OpenGL extension for occlusion
118queries along with a discussion concerning a potential realization in
119hardware. A first hardware implementation of occlusion queries came
120with the VISUALIZE fx graphics hardware~\cite{Scott:1998:OVF}. The
121corresponding OpenGL extension is called \hpot{}. A more recent OpenGL
122extension, \nvoq{}, was introduced by NVIDIA with the GeForce 3 graphics
123card and it is now also available as an official ARB extension.
124
125%Both
126%extensions have been used in several several occlusion methods
127%algorithms~\cite{Meissner01,Hillesland02,Staneker:2004:OCO}.
128%\cite{Meissner01}
129
130 Hillesland et al.~\cite{Hillesland02} have proposed an algorithm
131which employs the \nvoq. They subdivide the scene using a uniform
132grid. Then the cubes are traversed in slabs roughly perpendicular to
133the viewport. The queries are issued for all cubes of a slab at once,
134after the visible geometry of this slab has been rendered. The method
135can also use nested grids: a cell of the grid contains another grid
136that is traversed if the cell is proven visible. This method however
137does not exploit temporal and spatial coherence of visibility and it
138is restricted to regular subdivision data structures. Our new method
139addresses both these problems and provides natural extensions to
140balance the accuracy of visibility classification and the associated
141computational costs.
142
143 Recently, Staneker et al.~\cite{Staneker:2004:OCO} developed a method
144integrating occlusion culling into the OpenSG scene graph
145framework. Their technique uses occupancy maps maintained in software
146to avoid queries on visible scene graph nodes, and temporal coherence
147to reduce the number of occlusion queries. The drawback of the method
148is that it performs the queries in a serial fashion and thus it
149suffers from the CPU stalls and GPU starvation.
150
151%% The methods proposed in this paper can be used to make use of temporal
152%% and spatial coherence in the scope of existing visibility algorithms,
153%% that utilise a spatial hierarchy. Examples of these are algorithms
154%% based on hierarchical occlusion maps~\cite{Zhang97}, coverage
155%% masks~\cite{Greene:1996:HPT}, shadow frusta~\cite{Hudson97}, and
156%% occlusion trees~\cite{bittner98b_long}.
157
158On a theoretical level, our paper is related to methods aiming to
159exploit the temporal coherence of visibility. Greene et
160al.~\cite{Greene93a} used the set of visible objects from one frame to
161initialize the z-pyramid in the next frame in order to reduce the
162overdraw of the hierarchical z-buffer. The algorithm of Coorg and
163Teller~\cite{Coorg96b} restricts the hierarchical traversal to nodes
164associated with visual events that were crossed between successive
165viewpoint positions. Another method of Coorg and Teller~\cite{Coorg97}
166exploits temporal coherence by caching occlusion relationships.
167Chrysanthou and Slater have proposed a probabilistic scheme for
168view-frustum culling~\cite{Chrysanthou:97}.
169
170The above mentioned methods for exploiting temporal coherence are
171tightly interwoven with the particular culling algorithm. On the
172contrary, Bittner et al.~\cite{bittner01:jvca} presented a general
173acceleration technique for exploiting spatial and temporal coherence
174in hierarchical visibility algorithms. The central idea, which is also
175vital for this paper, is to avoid repeated visibility tests of
176interior nodes of the hierarchy. The problem of direct adoption of
177this method is that it is designed for the use with instantaneous CPU
178based occlusion queries, whereas hardware occlusion queries exhibit
179significant latency. The method presented herein efficiently overcomes
180the problem of latency while keeping the benefits of a generality and
181simplicity of the original hierarchical technique. As a result we
182obtain a simple and efficient occlusion culling algorithm utilizing
183hardware occlusion queries.
184
185
186%In this paper we show how to overcome this problem while
187%keeping the benefits of exploiting.
188
189
190The rest of the paper is organized as follows:
191Section~\ref{sec:hwqueries} discusses hardware supported occlusion
192queries and a basic application of these queries using a kD-tree.
193Section~\ref{sec:hoq} presents our new algorithm and
194Section~\ref{sec:optimizations} describes several additional
195optimizations. Section~\ref{sec:results} presents results obtained by
196experimental evaluation of the method and discusses its
197behavior. Finally Section~\ref{sec:conclusion} concludes the paper.
198
199
200
201
202\section{Hardware Occlusion Queries}
203\label{sec:hwqueries}
204
205
206Hardware occlusion queries follow a simple pattern: To test visibility
207of an occludee, we send its bounding volume to the GPU. The volume is
208rasterized and its fragments are compared to the current contents of
209the z-buffer. The GPU then returns the number of visible
210fragments. If there is no visible fragment, the occludee is invisible
211and it need not be rendered.
212
213\subsection{Advantages of hardware occlusion queries}
214
215There are several advantages of hardware occlusion queries:
216
217\begin{itemize}
218
219\item {\em Generality of occluders.} We can use the original scene geometry
220as occluders, since the queries use the current contents of the z-buffer.
221
222
223\item {\em Occluder fusion.} The occluders are merged in the z-buffer,
224so the queries automatically account for occluder fusion. Additionally
225this fusion comes for free since we use the intermediate result of the
226rendering itself.
227
228\item {\em Generality of occludees.} We can use complex
229occludees. Anything that can be rasterized quickly is suitable.
230
231\item {\em Exploiting the GPU power.} The queries take full advantage of
232the high fill rates and internal parallelism provided by modern GPUs.
233
234\item {\em Simple use.} Hardware occlusion queries can be easily
235integrated into a rendering algorithm. They provide a powerful tool to
236minimize the implementation effort, especially when compared to
237CPU-based occlusion culling.
238
239
240\end{itemize}
241
242
243
244\subsection{Problems of hardware occlusion queries}
245
246 Currently there are two main hardware supported variants of occlusion
247queries: the HP test (\hpot) and the more recent NV query (\nvoq, now
248also available as \arboq). The most important difference between the
249HP test and the NV query is that multiple NV queries can be
250issued before asking for their results, while only one HP test is
251allowed at a time, which severely limits its possible algorithmic
252usage. Additionally the NV query returns the number of visible pixels
253whereas the HP test returns only a binary visibility
254classification.
255
256%% Issuing multiple independent NV queries is crucial for
257%% the design of algorithms that strive to exploit power of the GPU.
258
259 The main problem of both the HP test and the NV query is the latency
260between issuing the query and the availability of the result. The
261latency occurs due to the delayed processing of the query in a long
262graphics pipeline, the cost of processing the query itself, and the
263cost of transferring the result back to the CPU. The latency causes
264two major problems: CPU stalls and GPU starvation. After issuing the
265query, the CPU waits for its result and does not feed the GPU with new
266data. When the result finally becomes available, the GPU pipeline can
267already be empty. Thus the GPU needs to wait for the CPU to process
268the result of the query and to feed the GPU with new data.
269
270A major challenge when using hardware occlusion queries is to avoid
271the CPU stalls by filling the latency time with other tasks, such as
272rendering visible scene objects or issuing other, independent
273occlusion queries (see Figure~\ref{fig:latency})
274
275
276\begin{figure}[htb]
277  \centerline{
278    \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{figs/latency2}
279    }
280
281  \caption{(top) Illustration of CPU stalls and GPU starvation.
282    Qn, Rn, and Cn denote querying, rendering, and culling of object
283    n, respectively. Note that object 5 is found invisible by Q5 and
284    thus not rendered. (bottom) More efficient query scheduling. The
285    scheduling assumes that objects 4 and 6 will be visible in the
286    current frame and renders them without waiting for the result of
287    the corresponding queries. }
288  \label{fig:latency}
289\end{figure}
290
291
292
293
294
295\subsection{Hierarchical stop-and-wait method}
296\label{sec:basic}
297
298Many rendering algorithms rely on hierarchical structures in order to
299deal with complex scenes.  In the context of occlusion culling, such a
300data structure allows to efficiently cull large scene blocks, and thus
301to exploit spatial coherence of visibility and provide a key to
302achieving output sensitivity.
303
304This section outlines a naive application of occlusion queries in the
305scope of a hierarchical algorithm. We refer to this approach as the
306{\em hierarchical stop-and-wait} method. Our discussion is based on
307kD-trees, which proved to be efficient for point location, ray
308tracing, and visibility
309culling~\cite{MacDonald90,Hudson97,Coorg97,bittner01:jvca}. The
310concept applies to general hierarchical data structures as well, though.
311
312The hierarchical stop-and-wait method proceeds as follows: Once a
313kD-tree node passes view-frustum culling, it is tested for
314occlusion by issuing the occlusion query and waiting for its
315result. If the node is found visible, we continue by recursively
316testing its children in a front-to-back order. If the node is a
317leaf, we render its associated objects.
318
319 The problem with this approach is that we can continue the tree
320traversal only when the result of the last occlusion query becomes
321available. If the result is not available, we have to stall the CPU,
322which causes significant performance penalties. As we document in
323Section~\ref{sec:results}, these penalties together with the overhead
324of the queries themselves can even decrease the overall application
325performance compared to pure view-frustum culling. Our new method aims
326to eliminate this problem by issuing multiple occlusion queries for
327independent scene parts and exploiting temporal coherence of
328visibility classifications.
329
330
331\section{Coherent Hierarchical Culling}
332
333\label{sec:hoq}
334
335 In this section we first present an overview of our new
336algorithm. Then we discuss its steps in more detail.
337
338%and present a
339%generalization of the method to other spatial data structures.
340
341\subsection{Algorithm Overview}
342
343\label{sec:overview}
344
345
346Our method is based on exploiting temporal coherence of visibility
347classification. In particular, it is centered on the following
348three ideas:
349
350\begin{itemize}
351 \item We initiate occlusion queries on nodes of the hierarchy where
352   the traversal terminated in the last frame. Thus we avoid queries
353   on all previously visible interior nodes~\cite{bittner01:jvca}.
354 \item We assume that a previously visible leaf node remains visible
355   and render the associated geometry without waiting for the result
356   of the corresponding occlusion query.
357\item Issued occlusion queries are stored in a query queue until they are known to be carried out by the GPU. This allows
358  interleaving the queries with the rendering of visible geometry.
359\end{itemize}
360
361
362The algorithm performs a traversal of the hierarchy that is
363terminated either at leaf nodes or nodes that are classified as
364invisible. Let us call such nodes the {\em termination nodes}, and
365interior nodes that have been classified visible the {\em opened
366nodes}. We denote sets of termination and opened nodes in the $i$-th
367frame $\mathcal{T}_i$ and $\mathcal{ O}_i$, respectively. In the
368$i$-th frame, we traverse the kD-tree in a front-to-back order, skip
369all nodes of $\mathcal{ O}_{i-1}$ and apply occlusion queries first on
370the termination nodes $\mathcal{ T}_{i-1}$. When reaching a
371termination node, the algorithm proceeds as follows:
372
373\begin{itemize}
374\item For a previously visible node (this must be a leaf), we issue
375the occlusion query and store it in the query queue. Then we
376immediately render the associated geometry without waiting for the
377result of the query.
378 \item For a previously invisible node, we issue the query and store
379 it in the query queue.
380\end{itemize}
381
382\begin{figure}[htb]
383{\footnotesize
384  \input{code/pseudocode2}
385 }
386  \caption{Pseudo-code of coherent hierarchical culling.}
387  \label{fig:pseudocode}
388\end{figure}
389
390When the query queue is not empty, we check if the result of the
391oldest query in the queue is already available. If the query result is
392not available, we continue by recursively processing other nodes of
393the kD-tree as described above. If the query result is available, we
394fetch the result and remove the node from the query queue. If the node
395is visible, we process its children recursively. Otherwise, the whole
396subtree of the node is invisible and thus it is culled.
397
398
399In order to propagate changes in visibility upwards in the hierarchy,
400the visibility classification is \emph{pulled up} according to the
401following rule: An interior node is invisible only if all its children
402have been classified invisible. Otherwise, it remains visible and thus
403opened. The pseudo-code of the complete algorithm is given in
404Figure~\ref{fig:pseudocode}. An example of the behavior of the method
405on a small kD-tree for two subsequent frames is depicted
406Figure~\ref{fig:cut}.
407
408\begin{figure*}[htb]
409  \centerline{
410    \includegraphics[width=0.73\textwidth,draft=\DRAFTFIGS]{figs/cut} }
411  \caption{(left) Visibility classification of a node of the kD-tree and
412    the termination nodes. (right) Visibility classification after the
413    application of the occlusion test and the new set of termination nodes.
414    Nodes on which occlusion queries were applied are depicted with a
415    solid outline. Note the pull-up and pull-down due to visibility changes.}
416  \label{fig:cut}
417\end{figure*}
418
419
420
421The sets of opened nodes and termination nodes need not be maintained
422explicitly. Instead, these sets can be easily identified by
423associating with each node an information about its visibility and an
424id of the last frame when it was visited. The node is an opened node
425if it is an interior visible node that was visited in the last frame
426(line 23 in the pseudocode). Note that in the actual implementation of
427the pull up we can set all visited nodes to invisible by default and
428then pull up any changes from invisible to visible (lines 25 and line
42912 in Figure~\ref{fig:pseudocode}). This modification eliminates
430checking children for invisibility during the pull up.
431
432
433
434%% \begin{figure*}[htb]
435%% \centerline{\includegraphics[width=0.8\textwidth,draft=\DRAFTFIGS]{figs/basic}}
436%% \caption{Illustration of the hierarchy traversal. Initially the algorithm
437%% starts at the root of the hierarchy (left). In the second frame the
438%% opened nodes $\mathcal{ O}_0$ are skipped and the occlusion queries
439%% are first applied on the termination nodes $\mathcal{T}_0$. Visibility changes are propagated upwarsds in the hierarchy and a
440%% new set of termination nodes $\mathcal{ T}_1$ is established.}
441%% \label{fig:basic}
442%% \end{figure*}
443
444
445\subsection{Reduction of the number of queries}
446
447%Skipping the opened nodes and issuing occlusion queries for the
448%termination nodes assists in exploiting two main characteristics of
449%scene visibility:
450
451%Identifying the set of termination nodes assists in finding a subdivision
452
453Our method reduces the number of visibility queries in two ways:
454Firstly, as other hierarchical culling methods we consider only a
455subtree of the whole hierarchy (opened nodes + termination
456nodes). Secondly, by avoiding queries on opened nodes we eliminate
457part of the overhead of identification of this subtree. These
458reductions reflect the following coherence properties of scene
459visibility:
460
461\begin{itemize}
462
463
464\item {\em Spatial coherence.} The invisible termination nodes
465approximate the occluded part of the scene with the smallest number of
466nodes with respect to the given hierarchy, i.e., each invisible
467termination node has a visible parent. This induces an adaptive
468spatial subdivision that reflects spatial coherence of visibility,
469more precisely the coherence of occluded regions. The adaptive nature
470of the subdivision allows to minimize the number of subsequent
471occlusion queries by applying the queries on the largest spatial
472regions that are expected to remain occluded.
473
474%, and visible termination
475%nodes correspond to the smallest unoccluded regions (visible
476%leafs)
477
478\item {\em Temporal coherence.} If visibility remains constant the set
479of termination nodes needs no adaptation. If an occluded node becomes
480visible we recursively process its children (pull-down). If a visible
481node becomes occluded we propagate the change higher in the hierarchy
482(pull-up). A pull-down reflects a spatial growing of visible
483regions. Similarly, a pull-up reflects a spatial growing of occluded
484regions.
485
486
487%The first case corresponds to pull-down of the cut, the latter to pull-up.
488
489\end{itemize}
490
491By avoiding queries on the opened nodes, we can save $1/k$ of the queries
492for a hierarchy with branching factor $k$ (assuming visibility remains
493constant). Thus for the kD-tree, up to half of the queries can be
494saved. The actual savings in the total query time are even larger: the
495higher we are at the hierarchy, the larger boxes we would have to
496check for occlusion. Consequently, the higher is the fill rate that
497would have been required to rasterize the boxes. In particular,
498assuming that the sum of the screen space projected area for nodes at
499each level of the kD-tree is equal and the opened nodes form a
500complete binary subtree of depth $d$, the fill rate is reduced $(d+2)$
501times.
502
503
504
505\subsection{Reduction of CPU stalls and GPU starvation}
506
507\label{sec:latency}
508
509The reduction of CPU stalls and GPU starvation is achieved by
510interleaving occlusion queries with the rendering of visible geometry. The
511immediate rendering of previously visible termination nodes and the
512subsequent issuing of occlusion queries eliminates the requirement of
513waiting for the query result during the processing of the initial
514depth layers containing previously visible nodes. In an optimal case,
515new query results become available in between and thus we completely
516eliminate CPU stalls. In a static scenario, we achieve exactly the
517same visibility classification as the hierarchical stop-and-wait
518method.
519
520If the visibility is changing, the situation can be different: if the
521results of the queries arrive too late, it is possible that we
522initiated an occlusion query on a previously occluded node $A$ that is
523in fact occluded by another previously occluded node $B$ that became
524visible. If $B$ is still in the query queue, we do not capture a
525possible occlusion of $A$ by $B$ since the geometry associated with
526$B$ has not yet been rendered. In Section~\ref{sec:results} we show
527that the increase of the number of rendered objects compared to the
528stop-and-wait method is usually very small.
529
530%% It is possible that we have already traversed all previously
531%% visible nodes and we must stall the CPU by waiting for the result
532%% of the oldest occlusion query. A technique completely eliminating
533%% CPU stalls at the cost of an increase of the number of rendered
534%% objects will be discussed in Section~\ref{sec:full}.
535
536\subsection{Front-to-back scene traversal}
537
538For kD-trees the front-to-back scene traversal can be easily
539implemented using a depth first
540traversal~\cite{bittner01:jvca}. However, at a modest increase in
541computational cost we can also use a more general breadth-first
542traversal based on a priority queue. The priority of the node then
543corresponds to an inverse of the minimal distance of the viewpoint and
544the bounding box associated with the given node of the
545kD-tree~\cite{Klosowski:2001:ECV,Staneker:2004:OCO}.
546
547In the context of our culling algorithm, there are two main advantages
548of the breadth-first front-to-back traversal :
549
550\begin{itemize}
551\item {\em Better query scheduling.} By spreading the traversal of the
552scene in a breadth-first manner, we process the scene in depth
553layers. Within each layer, the node processing order is practically
554independent, which minimizes the problem of occlusion query
555dependence. The breadth-first traversal interleaves
556occlusion-independent nodes, which can provide a more accurate
557visibility classification if visibility changes quickly. In
558particular, it reduces the problem of false classifications due to
559missed occlusion by nodes waiting in the query queue (discussed in
560Section~\ref{sec:latency}).
561
562\item {\em Using other spatial data structures.} By using a
563breadth-first traversal, we are no longer restricted to the
564kD-tree. Instead we can use an arbitrary spatial data structure
565such as a bounding volume hierarchy, octree, grid, hierarchical grid,
566etc. Once we compute a distance from a node to the viewpoint, the
567node processing order is established by the priority queue.
568\end{itemize}
569
570 When using the priority queue, our culling algorithm can also be
571applied directly to the scene graph hierarchy, thus avoiding the
572construction of any auxiliary data structure for spatial
573partitioning. This is especially important for dynamic scenes, in
574which maintenance of a spatial classification of moving objects can be
575costly.
576
577\subsection{Checking the query result}
578
579  The presented algorithm repeatedly checks if the result of the
580 occlusion query is available before fetching any node from the
581 traversal stack (line 6 in Figure~\ref{fig:pseudocode}). Our
582 practical experiments have proven that the cost of this check is
583 negligible and thus it can used frequently without any performance
584 penalty. If the cost of this check were significantly higher, we could
585 delay asking for the query result by a time established by empirical
586 measurements for the particular hardware. This delay should also
587 reflect the size of the queried node to match the expected
588 availability of the query result as precise as possible.
589
590
591\section{Further Optimizations}
592
593\label{sec:optimizations}
594
595This section discusses a couple of optimizations of our method that
596can further improve the overall rendering performance. In contrast to
597the basic algorithm from the previous section, these optimizations
598rely on some user specified parameters that should be tuned for a
599particular scene and hardware configuration.
600
601
602\subsection{Conservative visibility testing}
603
604The first optimization addresses the reduction of the number of
605visibility tests at the cost of a possible increase in the number of
606rendered objects. This optimization is based on the idea of skipping
607some occlusion tests of visible nodes. We assume that whenever a node
608becomes visible, it remains visible for a number of frames. Within the
609given number of frames we avoid issuing occlusion queries and simply
610assume the node remains visible~\cite{bittner01:jvca}.
611
612This technique can significantly reduce the number of visibility tests
613applied on visible nodes of the hierarchy. Especially in the case of
614sparsely occluded scenes, there is a large number of visible nodes
615being tested, which does not provide any benefit since most of them
616remain visible. On the other hand, we do not immediately capture all
617changes from visibility to invisibility, and thus we may render
618objects that have already become invisible from the moment when the
619last occlusion test was issued.
620
621 In the simplest case, the number of frames a node is assumed visible
622can be a predefined constant. In a more complicated scenario this
623number should be influenced by the history of the success of occlusion
624queries and/or the current speed of camera movement.
625
626
627 \subsection{Approximate visibility}
628
629 The algorithm as presented computes a conservative visibility
630classification with respect to the resolution of the z-buffer. We
631can easily modify the algorithm to cull nodes more aggressively in
632cases when a small part of the node is visible. We compare the
633number of visible pixels returned by the occlusion query with a
634user specified constant and cull the node if this number drops
635below this constant.
636
637
638\subsection{Complete elimination of CPU stalls}
639
640\label{sec:full}
641
642 The basic algorithm eliminates CPU stalls unless the traversal stack
643is empty. If there is no node to traverse in the traversal stack and
644the result of the oldest query in the query queue is still not
645available, it stalls the CPU by waiting for the query result. To
646completely eliminate the CPU stalls, we can speculatively render some
647nodes with undecided visibility. In particular, we select a node from
648the query queue and render the geometry associated with the node (or
649the whole subtree if it is an interior node). The node is marked as
650rendered but the associated occlusion query is kept in the queue to
651fetch its result later. If we are unlucky and the node remains
652invisible, the effort of rendering the node's geometry is wasted. On
653the other hand, if the node has become visible, we have used the time
654slot before the next query arrives in an optimal manner.
655
656To avoid the problem of spending more time on rendering invisible
657nodes than would be spent by waiting for the result of the query, we
658select a node with the lowest estimated rendering cost and compare
659this cost with a user specified constant. If the cost is larger than
660the constant we conclude that it is too risky to render the node and
661wait till the result of the query becomes available.
662
663\section{Results}
664\label{sec:results}
665
666We have incorporated our method into an OpenGL-based scene graph
667library and tested it on three scenes of different types. All tests
668were conducted on a PC with a 3.2GHz P4, 1GB of memory, and a
669GeForce FX5950 graphics card.
670
671\subsection{Test scenes}
672
673The three test scenes comprise a synthetic arrangement of 5000
674randomly positioned teapots (11.6M polygons); an urban environment (1M
675polygons); and the UNC power plant model (13M polygons). The test
676scenes are depicted in Figure~\ref{fig:scenes}. All scenes were
677partitioned using a kD-tree constructed according to the surface-area
678heuristics~\cite{MacDonald90}.
679
680
681Although the teapot scene would intuitively offer good occlusion, it
682is a complicated case to handle for occlusion culling. Firstly, the
683teapots consist of small triangles and so only the effect of fused
684occlusion due to a large number of visible triangles can bring a
685culling benefit. Secondly, there are many thin holes through which it
686is possible to see quite far into the arrangement of teapots. Thirdly,
687the arrangement is long and thin and so we can see almost
688half of the teapots along the longer side of the arrangement.
689
690%Although the teapot scene would intuitively seem to offer good
691%occlusion, it is one of the more complicated cases for occlusion
692%culling. Due to the high number of very dense objects, a very fine
693%kD-tree subdivision was necessary, which in turn leads to higher costs
694%for testing many nodes in the hierarchy. Additionally, misclassifying
695%a single teapot can lead to a significant increase in frame time due
696%to the high polygon density (2,304 polygons per teapot). There are also
697%many locations where it is possible to see very far into the block of
698%teapots. This can be seen in the kD-tree node classification which is
699%shown in the accompanying video.
700
701The complete power plant model is quite challenging even to load into memory,
702but on the other hand it offers good
703occlusion. This scene is an interesting candidate for testing not only
704due to its size, but also due to significant changes in visibility and depth complexity in
705its different parts.
706
707%as can also be seen in the
708%walkthrough. A relatively coarse subdivision of the model triangles
709%into about 4,600 nodes was sufficient to capture most occlusion
710%interdependencies.
711
712The city scene is a classical target for occlusion culling
713algorithms. Due to the urban structure consisting of buildings and
714streets, most of the model is occluded when viewed from the
715streets. Note that the scene does not contain any detailed geometry
716inside the buildings. See Figure~\ref{fig:city_vis} for a
717visualization of the visibility classification of the kD-tree nodes
718for the city scene.
719
720%that
721%could further
722
723
724
725%% However, we still included a section in the walkthrough where the
726%% viewpoint is elevated above the building roofs in order to show that
727%% the algorithm continues to work without modification, albeit at some
728%% expense. It will be an interesting topic of research to automatically
729%% recognize such situations and decrease the frequency of visibility
730%% queries or turn the occlusion culling algorithm off completely.
731
732
733
734\begin{figure}
735\centering \includegraphics[width=0.32\textwidth,draft=\DRAFTIMAGES]{images/city_vis}
736\caption{Visibility classification of the kD-tree nodes in the city scene.
737  The orange nodes were found visible, all the other depicted nodes are invisible.
738  Note the increasing size of the occluded nodes with increasing distance from the visible set.}
739\label{fig:city_vis}
740\end{figure}
741
742
743\subsection{Basic tests}
744
745
746We have measured the frame times for rendering with only view-frustum culling
747(VFC), the hierarchical stop-and-wait method (S\&W), and our new
748coherent hierarchical culling method (CHC). Additionally, we have
749evaluated the time for an ``ideal'' algorithm. The ideal algorithm
750renders the visible objects found by the S\&W algorithm without
751performing any visibility tests. This is an optimal solution with
752respect to the given hierarchy, i.e., no occlusion culling algorithm
753operating on the same hierarchy can be faster. For the basic tests we
754did not apply any of the optimizations discussed in
755Section~\ref{sec:optimizations}, which require user specified
756parameters.
757
758%% In order to obtain repeatable and comparable timings, the process
759%% priority of our application was set to high, timings were obtained
760%% using the CPU cycle counter instruction, and the OpenGL {\tt Finish()}
761%% instruction was executed before each invocation of the timer at the
762%% start and the end of a frame. Furthermore, we did not include clear
763%% and buffer swapping times in our results, since they depend strongly
764%% on the windowing mode used.
765
766For each test scene, we have constructed a walkthrough which is shown
767in full in the accompanying
768video. Figures~\ref{fig:teapot_walkthrough},~\ref{fig:city_walkthrough},
769and~\ref{fig:plant_walkthrough} depict the frame times measured for
770the walkthroughs. Note that Figure~\ref{fig:plant_walkthrough} uses a
771logarithmic scale to capture the high variations in frame times during
772the power plant walkthrough.
773\begin{figure}
774\centering
775\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/teapotgraph1}
776\caption{Frame times for the teapot scene.}\label{fig:teapot_walkthrough}
777\end{figure}
778\begin{figure}
779\centering
780\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/citygraph1}
781\caption{Frame times for the city walkthrough. Note
782the spike around frame 1600, where the viewpoint was elevated above the roofs,
783practically eliminating any occlusion.}\label{fig:city_walkthrough}
784\end{figure}
785\begin{figure}
786\centering
787\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/plantgraph1log}
788\caption{Frame times for the power plant walkthrough.
789The plot shows the weakness of the S\&W method: when there is not much occlusion it becomes slower than VFC (near frame 2200).
790The CHC can keep up even in these situations and in the same time it can exploit occlusion when it appears
791(e.g. near frame 3700).}
792\label{fig:plant_walkthrough}
793\end{figure}
794To better demonstrate the behavior of our algorithm, all walkthroughs
795contain sections with both restricted and unrestricted visibility. For
796the teapots, we viewed the arrangement of teapots along the longer
797side of the arrangement (frames 25--90). In the city we elevated the
798viewpoint above the roofs and gained sight over most of the city
799(frames 1200--1800). The power plant walkthrough contains several
800viewpoints from which a large part of the model is visible (spikes in
801Figure~\ref{fig:plant_walkthrough} where all algorithms are slow),
802viewpoints along the border of the model directed outwards with low depth complexity (holes in
803Figure~\ref{fig:plant_walkthrough} where all algorithms are fast), and
804viewpoints inside the power plant with high depth complexity where occlusion culling produces a
805significant speedup over VFC (e.g. frame 3800).
806
807 As we can see for a number frames in the walkthroughs, the CHC method
808can produce a speedup of more than one order of magnitude compared to
809VFC. The maximum speedup for the teapots, the city, and the power
810plant walkthroughs is 8, 20, and 70, respectively. We can also observe
811that CHC maintains a significant gain over S\&W and in many cases it
812almost matches the performance of the ideal algorithm. In complicated
813scenarios the S\&W method caused a significant slowdown compared to
814VFC (e.g. frames 1200--1800 of
815Figure~\ref{fig:city_walkthrough}). Even in these cases, the CHC
816method maintained a good speedup over VFC except for a small number of
817frames.
818
819\begin{table*}
820\centering \footnotesize
821\begin{tabular}{|c|c|c|c|c|c|c|c|}
822 \hline\hline
823scene                  & method   & \#queries & wait time [ms] & rendered triangles & frame time [ms]  & speedup \\\hline\hline
824
825Teapots                & VFC      & ---        &  ---     & 11,139,928  & 310.42  & 1.0 \\\hhline{~|-|-|-|-|-|-|-|}
82611,520,000 triangles    & S\&W     & 4704     & 83.19  & 2,617,801  & 154.95 & 2.3 \\\hhline{~|-|-|-|-|-|-|-|}
82721,639 kD-Tree nodes   & {\bf CHC} &{\bf 2827} & {\bf 1.31}  & {\bf 2,852,514 } & {\bf 81,18 } & {\bf 4.6 } \\\hhline{~|-|-|-|-|-|-|-|}
828                       & Ideal    & ---        &  ---     & 2,617,801  & 72.19  & 5.2 \\
829                       \hline\hline
830City                   & VFC      & ---        &  ---     & 156,521    & 19.79 & 1.0   \\\hhline{~|-|-|-|-|-|-|-|}
8311,036,146 triangles    & S\&W     & 663      &  9.49  & 30,594     & 19.9  & 1.5 \\\hhline{~|-|-|-|-|-|-|-|}
83233,195 kD-Tree nodes   &{\bf CHC} & {\bf 345} & {\bf 0.18}   & {\bf 31,034}  & {\bf 8.47}  & {\bf 4.0}  \\\hhline{~|-|-|-|-|-|-|-|}
833                       & Ideal    & ---        &   ---   & 30,594     & 4.55  & 6.6  \\
834                       \hline\hline
835Power Plant            & VFC      & ---    &  ---   & 1,556,300 & 138.76 & 1.0 \\\hhline{~|-|-|-|-|-|-|-|}
83612,748,510 triangles    & S\&W    & 485   & 16.16  & 392,962  & 52.29 & 3.2 \\\hhline{~|-|-|-|-|-|-|-|}
83718,719 kD-Tree nodes    &{\bf CHC} & {\bf 263}  & {\bf 0.70}  & {\bf 397,920} & {\bf 38.73}  & {\bf 4.7} \\\hhline{~|-|-|-|-|-|-|-|}
838                       & Ideal    &    ---      &  ---   &   392,962    &  36.34  & 5.8 \\\hline\hline
839\end{tabular}
840\caption{Statistics for the three test scenes. VFC is rendering with only view-frustum culling, S\&W is the
841  hierarchical stop and wait method, CHC is our new method, and Ideal
842is a perfect method with respect to the given hierarchy. All values are averages over
843all frames (including the speedup).}
844\label{tab:averages}
845\end{table*}
846
847%VFC     S&W     CHC     Ideal
848%(01:50:59) salzrat: 1     3,243823523     4,744645293     5,765981148
849%(01:51:04) salzrat: for the powerplant
850% for the powerplant
851%(01:53:59) salzrat: 1     1,519898136     4,01897468     6,599413211
852%(01:54:01) salzrat: for the city
853%(01:55:15) salzrat: and
854%(01:55:15) salzrat: 1     2,321137905     4,646185621     5,199086993
855
856%salzrat: teapot: 1,977162565     1,128631431
857%(02:01:52) salzrat: city: 2,607396227     1,674188137
858%(02:02:59) salzrat: powerplant: 1,635036084     1,244446469
859
860
861 Next, we summarized the scene statistics and the average values per
862frame in Table~\ref{tab:averages}. The table shows the number of
863issued occlusion queries, the wait time representing the CPU stalls,
864the number of rendered triangles, the total frame time, and the
865speedup over VFC.
866
867
868We can see that the CHC method practically eliminates the CPU stalls
869(wait time) compared to the S\&W method. This is paid for by a slight
870increase in the number of rendered triangles. For the three
871walkthroughs, the CHC method produces average speedups of 4.6, 4.0, and
8724.7 over view frustum culling and average speedups of 2.0, 2.6, and
8731.6 over the S\&W method. CHC is only 1.1, 1.7, and 1.2 times
874slower than the ideal occlusion culling algorithm. Concerning the
875accuracy, the increase of the average number of rendered triangles for
876CHC method compared to S\&W was 9\%, 1.4\%, and 1.3\%. This increase
877was always recovered by the reduction of CPU stalls for the tested
878walkthroughs.
879
880%% While this number may be very low, the algorithm also incurs a
881%% non-negligible overhead which is caused by the graphics card
882%% reconfiguration required for the queries and rasterization of the
883%% bounding volume geometries.
884
885% triangles
886% 9%
887% 1.4%
888% 1.3%
889
890
891% speedup over SW
892% 1.9
893% 2.32
894% 1.33
895
896% compare to ideal
897% 1.13
898% 1.9
899% 1.05
900
901
902
903\subsection{Optimizations}
904
905First of all we have observed that the technique of complete
906elimination of CPU stalls discussed in Section~\ref{sec:full} has a
907very limited scope. In fact for all our tests the stalls were almost
908completely eliminated by the basic algorithm already (see wait time in
909Table~\ref{tab:averages}). We did not find constants that could
910produce additional speedup using this technique.
911
912The measurements for the other optimizations discussed in
913Section~\ref{sec:optimizations} are summarized in
914Table~\ref{tab:assumed}. We have measured the average number of issued queries
915and the average frame time in dependence on the number of frames a node is
916assumed visible and the pixel threshold of approximate visibility. We
917have observed that the effectiveness of the optimizations depends strongly on the scene. If
918the hierarchy is deep and the geometry associated with a leaf node is
919not too complex, the conservative visibility testing produces a
920significant speedup (city and power plant). For the teapot scene the
921penalty for false rendering of actually occluded objects became larger
922than savings achieved by the reduction of the number of queries. On the other
923hand since the teapot scene contains complex visible geometry the
924approximate visibility optimization produced a significant
925speedup. This is however paid for by introducing errors in the image
926proportional to the pixel threshold used.
927
928%Good parameters for these
929%optimizations have to be estimated on a scene-by-scene basis.
930
931
932
933
934%% , where we compare the reference frame time and
935%% number of queries already shown in the previous table to the
936%% approximate visibility optimization using a higher pixel threshold,
937%% and to the conservative visibility testing optimization assuming a
938%% node remains visible for 2 frames, both for the teapot and the city
939%% scene. These tests clearly show that a high visibility threshold
940%% improves frame times, albeit at the expense of image quality. The
941%% benefit of the conservative visibility optimization, is however
942%% limited. While the number of queries can be reduced significantly, the
943%% number of misclassified objects rises and increases the frame time
944%% again.
945
946
947
948%% \begin{table}
949%% \centering \footnotesize
950%% \begin{tabular}{|l|c|c|}
951%%  \hline\hline
952%%                 & Teapots & City \\\hline\hline
953%%  frame time CHC & 130.42  & 13.91\\\hline
954%%  \#queries      & 2267    & 94.04\\\hline\hline
955%%  \multicolumn{3}{|l|}{assume 2 frames visible}\\\hline
956%%  frame time      & 123.84 & 13.27 \\\hline
957%%  \#queries      & 1342    &  42 \\\hline\hline
958%%  \multicolumn{3}{|l|}{pixel threshold 50 pixels and assume 2 frames visible}\\\hline
959%%  frame time     & 73.22 & 11.47  \\\hline
960%%  \#queries      &  1132 &  50 \\\hline\hline
961%% \end{tabular}
962%% \caption{Influence of optimizations on two test scenes. All values are averages
963%% over all frames, frame times are in [ms].}\label{tab:assumed}
964%% \end{table}
965
966
967\begin{table}
968\centering \footnotesize
969\begin{tabular}{|l|c|c|c|c|}
970 \hline\hline
971scene                    & t$_{av}$ &  n$_{vp}$ & \#queries & frame time [ms]   \\ \hline\hline
972\multirow{3}{*}{Teapots} & 0        &    0         & 2827      & 81.18  \\ \hhline{~----}
973                         & 2       &    0         & 1769      &  86.31 \\ \hhline{~----}
974                         & 2       &    25         & 1468      & 55.90  \\ \hline\hline
975\multirow{3}{*}{City}    & 0        &    0         & 345        & 8.47  \\ \hhline{~----}
976                         & 2        &    0         & 192        & 6.70 \\ \hhline{~----}
977                         & 2       &    25         & 181       & 6.11  \\ \hline\hline
978\multirow{3}{*}{Power Plant}    & 0  &    0        & 263        & 38.73  \\ \hhline{~----}
979                         & 2        &    0         & 126        & 31.17 \\ \hhline{~----}
980                         & 2       &    25         & 120       & 36.62  \\ \hline\hline
981\end{tabular}
982
983\caption{Influence of optimizations on the CHC method.
984  $t_{av}$ is the number of assumed visibility frames for conservative
985  visibility testing, n$_{vp}$ is the pixel threshold for
986  approximate visibility.}\label{tab:assumed}
987\end{table}
988
989
990\subsection{Comparison to PVS-based rendering}
991
992 We also compared the CHC method against precalculated visibility. In
993particular, we used the PVS computed by an offline visibility
994algorithm~\cite{wonka00}. While the walkthrough using the PVS was
9951.26ms faster per frame on average, our method does not require costly
996precomputation and can be used at any general 3D position in the
997model, not only in a predefined view space.
998
999
1000\section{Conclusion}
1001\label{sec:conclusion}
1002
1003 We have presented a method for the optimized scheduling of hardware
1004 accelerated occlusion queries. The method schedules occlusion queries
1005 in order to minimize the number of the queries and their latency.
1006 This is achieved by exploiting spatial and temporal coherence of
1007 visibility. Our results show that the CPU stalls and GPU starvation
1008 are almost completely eliminated at the cost of a slight increase in
1009 the number of rendered objects.
1010
1011%Additionally it schedules the queries in an order
1012%minimizing the overhead due to the latency of availability of query
1013%results.
1014
1015 Our technique can be used with practically arbitrary scene
1016 partitioning data structures such as kD-trees, bounding volume
1017 hierarchies, or hierarchical grids. The implementation of the method
1018 is straightforward as it uses a simple OpenGL interface to the
1019 hardware occlusion queries. In particular, the method requires no
1020 complicated geometrical operations or data structures. The algorithm
1021 is suitable for application on scenes of arbitrary structure and it
1022 requires no preprocessing or scene dependent tuning.
1023
1024 We have experimentally verified that the method is well suited to the
1025 \nvoq{} supported on current consumer grade graphics hardware. We
1026 have obtained an average speedup of 4.0--4.7 compared to pure view-frustum
1027 culling and 1.6--2.6 compared to the hierarchical
1028 stop-and-wait application of occlusion queries.
1029
1030  The major potential in improving the method is a better estimation
1031 of changes in the visibility classification of hierarchy nodes. If
1032 nodes tend to be mostly visible, we could automatically decrease the
1033 frequency of occlusion tests and thus better adapt the method to the
1034 actual occlusion in the scene. Another possibility for improvement is
1035 better tuning for a particular graphics hardware by means of more
1036 accurate rendering cost estimation. Skipping occlusion tests for
1037 simpler geometry  can be faster than issuing comparably expensive
1038 occlusion queries.
1039
1040
1041
1042\begin{figure*}[htb]
1043  \centerline{
1044     \hfill
1045     \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/teapots}
1046     \hfill
1047     \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/city}
1048     \hfill
1049     \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/pplant}
1050     \hfill
1051   }
1052   \caption{The test scenes: the teapots, the city, and the power plant.}
1053   \label{fig:scenes}
1054\end{figure*}
1055
1056%%  \begin{figure}[htb]
1057%%    \centerline{
1058%%      \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/city}
1059%%   }
1060%%   \caption{The the city scene.}
1061%%   \label{fig:images1}
1062%% \end{figure}
1063
1064%%  \begin{figure}[htb]
1065%%    \centerline{
1066%%      \includegraphics[width=0.25\textwidth,draft=\DRAFTFIGS]{images/teapots}
1067%%   }
1068%%   \caption{The the teapot scene.}
1069%%   \label{fig:images2}
1070%% \end{figure}
1071
1072%%  \begin{figure}[htb]
1073%%    \centerline{
1074%%      \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/pplant}
1075%%   }
1076%%   \caption{The the power plant scene.}
1077%%   \label{fig:images3}
1078%% \end{figure}
1079
1080
Note: See TracBrowser for help on using the repository browser.