source: trunk/VUT/doc/SciReport/online.tex @ 266

Revision 266, 54.2 KB checked in by bittner, 19 years ago (diff)

structural changes

Line 
1\chapter{Online Culling Tool}
2%*****************************************************************************
3
4\section{Introduction}
5
6 Visibility culling is one of the major acceleration techniques for
7the real-time rendering of complex scenes. The ultimate goal of
8visibility culling techniques is to prevent invisible objects from
9being sent to the rendering pipeline. A standard visibility-culling
10technique is {\em view-frustum culling}, which eliminates objects
11outside of the current view frustum. View-frustum culling is a fast
12and simple technique, but it does not eliminate objects in the view
13frustum that are occluded by other objects. This can lead to significant
14{\em overdraw}, i.e., the same image area gets covered more than
15once. The overdraw causes a waste of computational effort both in the
16pixel and the vertex processing stages of modern graphic hardware.
17%that is, we render each pixel in the image several times. For many complex
18%scenes with high occlusion, the effort wasted on overdraw cannot simply be
19%eliminated by the increasing the raw computational power of the hardware.
20%This problem becomes even more apparent on recent graphics hardware with
21%programmable vertex and pixel shaders: all complex per-vertex or per-pixel
22%computations on object get completely wasted if the object is occluded.
23The elimination of occluded objects is addressed by {\em occlusion
24culling}. In an optimized rendering pipeline, occlusion culling complements
25other rendering acceleration techniques such as levels of detail or
26impostors.
27
28 Occlusion culling can either be applied offline or online.  When
29applied offline as a preprocess, we compute a potentially visible set
30(PVS) for cells of a fixed subdivision of the scene. At runtime, we
31can quickly identify a PVS for the given viewpoint. However, this
32approach suffers from four major problems: (1) the PVS is  valid only
33for the original static scene configuration, (2) for a given
34viewpoint, the corresponding cell-based PVS can be overly
35conservative, (3) computing all PVSs is computationally expensive, and (4)
36an accurate PVS computation is difficult to implement for general
37scenes. Online occlusion culling can solve these problems at the cost
38of applying extra computations at each frame. To make these additional
39computations efficient, most online occlusion culling methods rely on a
40number of assumptions about the scene structure and its occlusion
41characteristics (e.g. presence of large occluders, occluder
42connectivity, occlusion by few closest depth layers).
43
44\begin{figure}%[htb]
45  %\centerline
46  \centering \footnotesize
47  \begin{tabular}{c}
48  \includegraphics[width=0.6\textwidth, draft=\DRAFTFIGS]{images/ogre_terrain} \\
49  %\hline
50  \includegraphics[width=0.3\textwidth, draft=\DRAFTFIGS]{images/vis_viewfrustum} \hfill \includegraphics[width=0.3\textwidth, draft=\DRAFTFIGS]{images/vis_chc} \\
51  \end{tabular}
52%\label{tab:online_culling_example}
53  \caption{(top) The rendered terrain scene. (bottom) Visualizion of the rendered / culled objects.
54    Using view frustum culling (left image) vs. occlusion queries (right image).
55    The yellow boxes show the actually rendered scene objects. The
56    red boxes depict the view frustum culled hierarchy nodes, the blue boxes depict the
57    occlusion query culled hierarchy nodes.}
58  \label{fig:online_culling_example}
59\end{figure}
60
61Recent graphics hardware natively supports an \emph{occlusion query}
62to detect the visibility of an object against the current contents of
63the z-buffer.  Although the query itself is processed quickly using
64the raw power of the graphics processing unit (GPU), its result is not
65available immediately due to the delay between issuing the query and
66its actual processing in the graphics pipeline. As a result, a naive
67application of occlusion queries can even decrease the overall
68application performance due the associated CPU stalls and GPU
69starvation. In this chapter, we present an algorithm that aims to
70overcome these problems by reducing the number of issued queries and
71eliminating the CPU stalls and GPU starvation. To schedule the
72queries, the algorithm makes use of both the spatial and the temporal
73coherence of visibility. A major strength of our technique is its
74simplicity and versatility: the method can be easily integrated in
75existing real-time rendering packages on architectures supporting the
76underlying occlusion query.  In
77figure~\ref{fig:online_culling_example}, the same scene (top row) is
78rendered using view frustum culling (visualization in the bottom left
79image) versus online culling using occlusion queries (visualization
80in the bottom right image). It can be seen that with view frustum
81culling only many objects are still rendered.
82%Using spatial and assuming temporal coherence
83
84\section{Related Work}
85
86With the demand for rendering scenes of ever increasing size, there
87have been a number of visibility culling methods developed in the
88last decade.  A comprehensive survey of visibility culling methods
89was presented by Cohen-Or et al.~\cite{Cohen:2002:survey}. Another
90recent survey of Bittner and Wonka~\cite{bittner03:jep} discusses
91visibility culling in a broader context of other visibility problems.
92
93According to the domain of visibility computation, we distinguish
94between {\em from-point} and {\em from-region} visibility algorithms.
95From-region algorithms compute a PVS and are applied offline in a
96preprocessing phase~\cite{Airey90,Teller91a,Leyvand:2003:RSF}.
97From-point algorithms are applied online for each particular
98viewpoint~\cite{Greene93a,Hudson97,EVL-1997-163,bittner98b,Wonka:1999:OSF,Klosowski:2001:ECV}.
99In our further discussion we focus on online occlusion culling methods
100that exploit graphics hardware.
101
102A conceptually important online occlusion culling method is the
103hierarchical z-buffer introduced by Greene et al.~\cite{Greene93a}. It
104organizes the z-buffer as a pyramid, where the standard z-buffer is
105the finest level. At all other levels, each z-value is the farthest in
106the window corresponding to the adjacent finer level. The hierarchical
107z-buffer allows to quickly determine if the geometry in question is
108occluded. To a certain extent this idea is used in the current
109generation of graphics hardware by applying early z-tests of fragments
110in the graphics pipeline (e.g., Hyper-Z technology of ATI or Z-cull of
111NVIDIA). However, the geometry still needs to be sent to the GPU,
112transformed, and coarsely rasterized even if it is later determined
113invisible.
114
115 Zhang~\cite{EVL-1997-163} proposed hierarchical occlusion maps, which
116do not rely on the hardware support for the z-pyramid, but instead
117make use of hardware texturing. The hierarchical occlusion map is
118computed on the GPU by rasterizing and down sampling a given set of
119occluders. The occlusion map is used for overlap tests whereas the
120depths are compared using a coarse depth estimation buffer. Wonka and
121Schmalstieg~\cite{Wonka:1999:OSF} use occluder shadows to compute
122from-point visibility in \m25d scenes with the help of the GPU. This
123method has been further extended to online computation of from-region
124visibility executed on a server~\cite{wonka:01:eg}.
125
126%% In
127%% parallel to rendering, the visibility server calculates visibility for
128%% the neighborhood of the given viewpoint and sends them back to the
129%% display host.
130
131%% A similar method designed by Aila and
132%% Miettinen~\cite{aila:00:msc}. The incremental occlusion maps are
133%% created on the CPU, which takes the CPU computational time but it does
134%% not suffer the problem of slow read-back from the GPU.
135
136%Klosowski:2001:ECV
137
138Bartz et al.~\cite{Bartz98} proposed an OpenGL extension for occlusion
139queries along with a discussion concerning a potential realization in
140hardware. A first hardware implementation of occlusion queries came
141with the VISUALIZE fx graphics hardware~\cite{Scott:1998:OVF}. The
142corresponding OpenGL extension is called \hpot{}. A more recent OpenGL
143extension, \nvoq{}, was introduced by NVIDIA with the GeForce 3 graphics
144card and it is now also available as an official ARB extension.
145
146%Both
147%extensions have been used in several several occlusion methods
148%algorithms~\cite{Meissner01,Hillesland02,Staneker:2004:OCO}.
149%\cite{Meissner01}
150
151 Hillesland et al.~\cite{Hillesland02} have proposed an algorithm
152which employs the \nvoq. They subdivide the scene using a uniform
153grid. Then the cubes are traversed in slabs roughly perpendicular to
154the viewport. The queries are issued for all cubes of a slab at once,
155after the visible geometry of this slab has been rendered. The method
156can also use nested grids: a cell of the grid contains another grid
157that is traversed if the cell is proven visible. This method however
158does not exploit temporal and spatial coherence of visibility and it
159is restricted to regular subdivision data structures. Our new method
160addresses both these problems and provides natural extensions to
161balance the accuracy of visibility classification and the associated
162computational costs.
163
164 Recently, Staneker et al.~\cite{Staneker:2004:OCO} developed a method
165integrating occlusion culling into the OpenSG scene graph
166framework. Their technique uses occupancy maps maintained in software
167to avoid queries on visible scene graph nodes, and temporal coherence
168to reduce the number of occlusion queries. The drawback of the method
169is that it performs the queries in a serial fashion and thus it
170suffers from the CPU stalls and GPU starvation.
171
172%% The methods proposed in this chapter can be used to make use of temporal
173%% and spatial coherence in the scope of existing visibility algorithms,
174%% that utilise a spatial hierarchy. Examples of these are algorithms
175%% based on hierarchical occlusion maps~\cite{Zhang97}, coverage
176%% masks~\cite{Greene:1996:HPT}, shadow frusta~\cite{Hudson97}, and
177%% occlusion trees~\cite{bittner98b_long}.
178
179On a theoretical level, our paper is related to methods aiming to
180exploit the temporal coherence of visibility. Greene et
181al.~\cite{Greene93a} used the set of visible objects from one frame to
182initialize the z-pyramid in the next frame in order to reduce the
183overdraw of the hierarchical z-buffer. The algorithm of Coorg and
184Teller~\cite{Coorg96b} restricts the hierarchical traversal to nodes
185associated with visual events that were crossed between successive
186viewpoint positions. Another method of Coorg and Teller~\cite{Coorg97}
187exploits temporal coherence by caching occlusion relationships.
188Chrysanthou and Slater have proposed a probabilistic scheme for
189view-frustum culling~\cite{Chrysanthou:97}.
190
191The above mentioned methods for exploiting temporal coherence are
192tightly interwoven with the particular culling algorithm. On the
193contrary, Bittner et al.~\cite{bittner01:jvca} presented a general
194acceleration technique for exploiting spatial and temporal coherence
195in hierarchical visibility algorithms. The central idea, which is also
196vital for the online culling tool, is to avoid repeated visibility tests of
197interior nodes of the hierarchy. The problem of direct adoption of
198this method is that it is designed for the use with instantaneous CPU
199based occlusion queries, whereas hardware occlusion queries exhibit
200significant latency. The method presented herein efficiently overcomes
201the problem of latency while keeping the benefits of a generality and
202simplicity of the original hierarchical technique. As a result we
203obtain a simple and efficient occlusion culling algorithm utilizing
204hardware occlusion queries.
205
206
207%In this paper we show how to overcome this problem while
208%keeping the benefits of exploiting.
209
210
211The rest of the paper is organized as follows:
212Section~\ref{sec:hwqueries} discusses hardware supported occlusion
213queries and a basic application of these queries using a kD-tree.
214Section~\ref{sec:hoq} presents our new algorithm and
215Section~\ref{sec:optimizations} describes several additional
216optimizations. Section~\ref{sec:results} presents results obtained by
217experimental evaluation of the method and discusses its
218behavior. Finally Section~\ref{sec:conclusion} concludes the paper.
219
220
221
222
223\section{Hardware Occlusion Queries}
224\label{sec:hwqueries}
225
226
227Hardware occlusion queries follow a simple pattern: To test visibility
228of an occludee, we send its bounding volume to the GPU. The volume is
229rasterized and its fragments are compared to the current contents of
230the z-buffer. The GPU then returns the number of visible
231fragments. If there is no visible fragment, the occludee is invisible
232and it need not be rendered.
233
234\subsection{Advantages of hardware occlusion queries}
235
236There are several advantages of hardware occlusion queries:
237
238\begin{itemize}
239
240\item {\em Generality of occluders.} We can use the original scene geometry
241as occluders, since the queries use the current contents of the z-buffer.
242
243
244\item {\em Occluder fusion.} The occluders are merged in the z-buffer,
245so the queries automatically account for occluder fusion. Additionally
246this fusion comes for free since we use the intermediate result of the
247rendering itself.
248
249\item {\em Generality of occludees.} We can use complex
250occludees. Anything that can be rasterized quickly is suitable.
251
252\item {\em Exploiting the GPU power.} The queries take full advantage of
253the high fill rates and internal parallelism provided by modern GPUs.
254
255\item {\em Simple use.} Hardware occlusion queries can be easily
256integrated into a rendering algorithm. They provide a powerful tool to
257minimize the implementation effort, especially when compared to
258CPU-based occlusion culling.
259
260
261\end{itemize}
262
263
264
265\subsection{Problems of hardware occlusion queries}
266
267 Currently there are two main hardware supported variants of occlusion
268queries: the HP test (\hpot) and the more recent NV query (\nvoq, now
269also available as \arboq). The most important difference between the
270HP test and the NV query is that multiple NV queries can be
271issued before asking for their results, while only one HP test is
272allowed at a time, which severely limits its possible algorithmic
273usage. Additionally the NV query returns the number of visible pixels
274whereas the HP test returns only a binary visibility
275classification.
276
277%% Issuing multiple independent NV queries is crucial for
278%% the design of algorithms that strive to exploit power of the GPU.
279
280 The main problem of both the HP test and the NV query is the latency
281between issuing the query and the availability of the result. The
282latency occurs due to the delayed processing of the query in a long
283graphics pipeline, the cost of processing the query itself, and the
284cost of transferring the result back to the CPU. The latency causes
285two major problems: CPU stalls and GPU starvation. After issuing the
286query, the CPU waits for its result and does not feed the GPU with new
287data. When the result finally becomes available, the GPU pipeline can
288already be empty. Thus the GPU needs to wait for the CPU to process
289the result of the query and to feed the GPU with new data.
290
291A major challenge when using hardware occlusion queries is to avoid
292the CPU stalls by filling the latency time with other tasks, such as
293rendering visible scene objects or issuing other, independent
294occlusion queries (see Figure~\ref{fig:latency})
295
296
297\begin{figure}[htb]
298  \centerline{
299    \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{figs/latency2}
300    }
301
302  \caption{(top) Illustration of CPU stalls and GPU starvation.
303    Qn, Rn, and Cn denote querying, rendering, and culling of object
304    n, respectively. Note that object 5 is found invisible by Q5 and
305    thus not rendered. (bottom) More efficient query scheduling. The
306    scheduling assumes that objects 4 and 6 will be visible in the
307    current frame and renders them without waiting for the result of
308    the corresponding queries. }
309  \label{fig:latency}
310\end{figure}
311
312
313
314
315
316\subsection{Hierarchical stop-and-wait method}
317\label{sec:basic}
318
319Many rendering algorithms rely on hierarchical structures in order to
320deal with complex scenes.  In the context of occlusion culling, such a
321data structure allows to efficiently cull large scene blocks, and thus
322to exploit spatial coherence of visibility and provide a key to
323achieving output sensitivity.
324
325This section outlines a naive application of occlusion queries in the
326scope of a hierarchical algorithm. We refer to this approach as the
327{\em hierarchical stop-and-wait} method. Our discussion is based on
328kD-trees, which proved to be efficient for point location, ray
329tracing, and visibility
330culling~\cite{MacDonald90,Hudson97,Coorg97,bittner01:jvca}. The
331concept applies to general hierarchical data structures as well, though.
332
333The hierarchical stop-and-wait method proceeds as follows: Once a
334kD-tree node passes view-frustum culling, it is tested for
335occlusion by issuing the occlusion query and waiting for its
336result. If the node is found visible, we continue by recursively
337testing its children in a front-to-back order. If the node is a
338leaf, we render its associated objects.
339
340 The problem with this approach is that we can continue the tree
341traversal only when the result of the last occlusion query becomes
342available. If the result is not available, we have to stall the CPU,
343which causes significant performance penalties. As we document in
344Section~\ref{sec:results}, these penalties together with the overhead
345of the queries themselves can even decrease the overall application
346performance compared to pure view-frustum culling. Our new method aims
347to eliminate this problem by issuing multiple occlusion queries for
348independent scene parts and exploiting temporal coherence of
349visibility classifications.
350
351
352\section{Coherent Hierarchical Culling}
353
354\label{sec:hoq}
355
356 In this section we first present an overview of our new
357algorithm. Then we discuss its steps in more detail.
358
359%and present a
360%generalization of the method to other spatial data structures.
361
362\subsection{Algorithm Overview}
363
364\label{sec:overview}
365
366
367Our method is based on exploiting temporal coherence of visibility
368classification. In particular, it is centered on the following
369three ideas:
370
371\begin{itemize}
372 \item We initiate occlusion queries on nodes of the hierarchy where
373   the traversal terminated in the last frame. Thus we avoid queries
374   on all previously visible interior nodes~\cite{bittner01:jvca}.
375 \item We assume that a previously visible leaf node remains visible
376   and render the associated geometry without waiting for the result
377   of the corresponding occlusion query.
378\item Issued occlusion queries are stored in a query queue until they are known to be carried out by the GPU. This allows
379  interleaving the queries with the rendering of visible geometry.
380\end{itemize}
381
382
383The algorithm performs a traversal of the hierarchy that is
384terminated either at leaf nodes or nodes that are classified as
385invisible. Let us call such nodes the {\em termination nodes}, and
386interior nodes that have been classified visible the {\em opened
387nodes}. We denote sets of termination and opened nodes in the $i$-th
388frame $\mathcal{T}_i$ and $\mathcal{ O}_i$, respectively. In the
389$i$-th frame, we traverse the kD-tree in a front-to-back order, skip
390all nodes of $\mathcal{ O}_{i-1}$ and apply occlusion queries first on
391the termination nodes $\mathcal{ T}_{i-1}$. When reaching a
392termination node, the algorithm proceeds as follows:
393
394\begin{itemize}
395\item For a previously visible node (this must be a leaf), we issue
396the occlusion query and store it in the query queue. Then we
397immediately render the associated geometry without waiting for the
398result of the query.
399 \item For a previously invisible node, we issue the query and store
400 it in the query queue.
401\end{itemize}
402
403\begin{figure}[htb]
404{\footnotesize
405  \input{code/pseudocode2}
406 }
407  \caption{Pseudo-code of coherent hierarchical culling.}
408  \label{fig:pseudocode}
409\end{figure}
410
411When the query queue is not empty, we check if the result of the
412oldest query in the queue is already available. If the query result is
413not available, we continue by recursively processing other nodes of
414the kD-tree as described above. If the query result is available, we
415fetch the result and remove the node from the query queue. If the node
416is visible, we process its children recursively. Otherwise, the whole
417subtree of the node is invisible and thus it is culled.
418
419
420In order to propagate changes in visibility upwards in the hierarchy,
421the visibility classification is \emph{pulled up} according to the
422following rule: An interior node is invisible only if all its children
423have been classified invisible. Otherwise, it remains visible and thus
424opened. The pseudo-code of the complete algorithm is given in
425Figure~\ref{fig:pseudocode}. An example of the behavior of the method
426on a small kD-tree for two subsequent frames is depicted
427Figure~\ref{fig:cut}.
428
429\begin{figure*}[htb]
430  \centerline{
431    \includegraphics[width=0.73\textwidth,draft=\DRAFTFIGS]{figs/cut} }
432  \caption{(left) Visibility classification of a node of the kD-tree and
433    the termination nodes. (right) Visibility classification after the
434    application of the occlusion test and the new set of termination nodes.
435    Nodes on which occlusion queries were applied are depicted with a
436    solid outline. Note the pull-up and pull-down due to visibility changes.}
437  \label{fig:cut}
438\end{figure*}
439
440
441
442The sets of opened nodes and termination nodes need not be maintained
443explicitly. Instead, these sets can be easily identified by
444associating with each node an information about its visibility and an
445id of the last frame when it was visited. The node is an opened node
446if it is an interior visible node that was visited in the last frame
447(line 23 in the pseudocode). Note that in the actual implementation of
448the pull up we can set all visited nodes to invisible by default and
449then pull up any changes from invisible to visible (lines 25 and line
45012 in Figure~\ref{fig:pseudocode}). This modification eliminates
451checking children for invisibility during the pull up.
452
453
454
455%% \begin{figure*}[htb]
456%% \centerline{\includegraphics[width=0.8\textwidth,draft=\DRAFTFIGS]{figs/basic}}
457%% \caption{Illustration of the hierarchy traversal. Initially the algorithm
458%% starts at the root of the hierarchy (left). In the second frame the
459%% opened nodes $\mathcal{ O}_0$ are skipped and the occlusion queries
460%% are first applied on the termination nodes $\mathcal{T}_0$. Visibility changes are propagated upwarsds in the hierarchy and a
461%% new set of termination nodes $\mathcal{ T}_1$ is established.}
462%% \label{fig:basic}
463%% \end{figure*}
464
465
466\subsection{Reduction of the number of queries}
467
468%Skipping the opened nodes and issuing occlusion queries for the
469%termination nodes assists in exploiting two main characteristics of
470%scene visibility:
471
472%Identifying the set of termination nodes assists in finding a subdivision
473
474Our method reduces the number of visibility queries in two ways:
475Firstly, as other hierarchical culling methods we consider only a
476subtree of the whole hierarchy (opened nodes + termination
477nodes). Secondly, by avoiding queries on opened nodes we eliminate
478part of the overhead of identification of this subtree. These
479reductions reflect the following coherence properties of scene
480visibility:
481
482\begin{itemize}
483
484
485\item {\em Spatial coherence.} The invisible termination nodes
486approximate the occluded part of the scene with the smallest number of
487nodes with respect to the given hierarchy, i.e., each invisible
488termination node has a visible parent. This induces an adaptive
489spatial subdivision that reflects spatial coherence of visibility,
490more precisely the coherence of occluded regions. The adaptive nature
491of the subdivision allows to minimize the number of subsequent
492occlusion queries by applying the queries on the largest spatial
493regions that are expected to remain occluded.
494
495%, and visible termination
496%nodes correspond to the smallest unoccluded regions (visible
497%leafs)
498
499\item {\em Temporal coherence.} If visibility remains constant the set
500of termination nodes needs no adaptation. If an occluded node becomes
501visible we recursively process its children (pull-down). If a visible
502node becomes occluded we propagate the change higher in the hierarchy
503(pull-up). A pull-down reflects a spatial growing of visible
504regions. Similarly, a pull-up reflects a spatial growing of occluded
505regions.
506
507
508%The first case corresponds to pull-down of the cut, the latter to pull-up.
509
510\end{itemize}
511
512By avoiding queries on the opened nodes, we can save $1/k$ of the queries
513for a hierarchy with branching factor $k$ (assuming visibility remains
514constant). Thus for the kD-tree, up to half of the queries can be
515saved. The actual savings in the total query time are even larger: the
516higher we are at the hierarchy, the larger boxes we would have to
517check for occlusion. Consequently, the higher is the fill rate that
518would have been required to rasterize the boxes. In particular,
519assuming that the sum of the screen space projected area for nodes at
520each level of the kD-tree is equal and the opened nodes form a
521complete binary subtree of depth $d$, the fill rate is reduced $(d+2)$
522times.
523
524
525
526\subsection{Reduction of CPU stalls and GPU starvation}
527
528\label{sec:latency}
529
530The reduction of CPU stalls and GPU starvation is achieved by
531interleaving occlusion queries with the rendering of visible geometry. The
532immediate rendering of previously visible termination nodes and the
533subsequent issuing of occlusion queries eliminates the requirement of
534waiting for the query result during the processing of the initial
535depth layers containing previously visible nodes. In an optimal case,
536new query results become available in between and thus we completely
537eliminate CPU stalls. In a static scenario, we achieve exactly the
538same visibility classification as the hierarchical stop-and-wait
539method.
540
541If the visibility is changing, the situation can be different: if the
542results of the queries arrive too late, it is possible that we
543initiated an occlusion query on a previously occluded node $A$ that is
544in fact occluded by another previously occluded node $B$ that became
545visible. If $B$ is still in the query queue, we do not capture a
546possible occlusion of $A$ by $B$ since the geometry associated with
547$B$ has not yet been rendered. In Section~\ref{sec:results} we show
548that the increase of the number of rendered objects compared to the
549stop-and-wait method is usually very small.
550
551%% It is possible that we have already traversed all previously
552%% visible nodes and we must stall the CPU by waiting for the result
553%% of the oldest occlusion query. A technique completely eliminating
554%% CPU stalls at the cost of an increase of the number of rendered
555%% objects will be discussed in Section~\ref{sec:full}.
556
557\subsection{Front-to-back scene traversal}
558
559For kD-trees the front-to-back scene traversal can be easily
560implemented using a depth first
561traversal~\cite{bittner01:jvca}. However, at a modest increase in
562computational cost we can also use a more general breadth-first
563traversal based on a priority queue. The priority of the node then
564corresponds to an inverse of the minimal distance of the viewpoint and
565the bounding box associated with the given node of the
566kD-tree~\cite{Klosowski:2001:ECV,Staneker:2004:OCO}.
567
568In the context of our culling algorithm, there are two main advantages
569of the breadth-first front-to-back traversal :
570
571\begin{itemize}
572\item {\em Better query scheduling.} By spreading the traversal of the
573scene in a breadth-first manner, we process the scene in depth
574layers. Within each layer, the node processing order is practically
575independent, which minimizes the problem of occlusion query
576dependence. The breadth-first traversal interleaves
577occlusion-independent nodes, which can provide a more accurate
578visibility classification if visibility changes quickly. In
579particular, it reduces the problem of false classifications due to
580missed occlusion by nodes waiting in the query queue (discussed in
581Section~\ref{sec:latency}).
582
583\item {\em Using other spatial data structures.} By using a
584breadth-first traversal, we are no longer restricted to the
585kD-tree. Instead we can use an arbitrary spatial data structure
586such as a bounding volume hierarchy, octree, grid, hierarchical grid,
587etc. Once we compute a distance from a node to the viewpoint, the
588node processing order is established by the priority queue.
589\end{itemize}
590
591 When using the priority queue, our culling algorithm can also be
592applied directly to the scene graph hierarchy, thus avoiding the
593construction of any auxiliary data structure for spatial
594partitioning. This is especially important for dynamic scenes, in
595which maintenance of a spatial classification of moving objects can be
596costly.
597
598\subsection{Checking the query result}
599
600  The presented algorithm repeatedly checks if the result of the
601 occlusion query is available before fetching any node from the
602 traversal stack (line 6 in Figure~\ref{fig:pseudocode}). Our
603 practical experiments have proven that the cost of this check is
604 negligible and thus it can used frequently without any performance
605 penalty. If the cost of this check were significantly higher, we could
606 delay asking for the query result by a time established by empirical
607 measurements for the particular hardware. This delay should also
608 reflect the size of the queried node to match the expected
609 availability of the query result as precise as possible.
610
611
612\section{Further Optimizations}
613
614\label{sec:optimizations}
615
616This section discusses a couple of optimizations of our method that
617can further improve the overall rendering performance. In contrast to
618the basic algorithm from the previous section, these optimizations
619rely on some user specified parameters that should be tuned for a
620particular scene and hardware configuration.
621
622
623\subsection{Conservative visibility testing}
624
625The first optimization addresses the reduction of the number of
626visibility tests at the cost of a possible increase in the number of
627rendered objects. This optimization is based on the idea of skipping
628some occlusion tests of visible nodes. We assume that whenever a node
629becomes visible, it remains visible for a number of frames. Within the
630given number of frames we avoid issuing occlusion queries and simply
631assume the node remains visible~\cite{bittner01:jvca}.
632
633This technique can significantly reduce the number of visibility tests
634applied on visible nodes of the hierarchy. Especially in the case of
635sparsely occluded scenes, there is a large number of visible nodes
636being tested, which does not provide any benefit since most of them
637remain visible. On the other hand, we do not immediately capture all
638changes from visibility to invisibility, and thus we may render
639objects that have already become invisible from the moment when the
640last occlusion test was issued.
641
642 In the simplest case, the number of frames a node is assumed visible
643can be a predefined constant. In a more complicated scenario this
644number should be influenced by the history of the success of occlusion
645queries and/or the current speed of camera movement.
646
647
648\subsection{Approximate visibility}
649
650 The algorithm as presented computes a conservative visibility
651classification with respect to the resolution of the z-buffer. We
652can easily modify the algorithm to cull nodes more aggressively in
653cases when a small part of the node is visible. We compare the
654number of visible pixels returned by the occlusion query with a
655user specified constant and cull the node if this number drops
656below this constant.
657
658
659\subsection{Complete elimination of CPU stalls}
660
661\label{sec:full}
662
663 The basic algorithm eliminates CPU stalls unless the traversal stack
664is empty. If there is no node to traverse in the traversal stack and
665the result of the oldest query in the query queue is still not
666available, it stalls the CPU by waiting for the query result. To
667completely eliminate the CPU stalls, we can speculatively render some
668nodes with undecided visibility. In particular, we select a node from
669the query queue and render the geometry associated with the node (or
670the whole subtree if it is an interior node). The node is marked as
671rendered but the associated occlusion query is kept in the queue to
672fetch its result later. If we are unlucky and the node remains
673invisible, the effort of rendering the node's geometry is wasted. On
674the other hand, if the node has become visible, we have used the time
675slot before the next query arrives in an optimal manner.
676
677To avoid the problem of spending more time on rendering invisible
678nodes than would be spent by waiting for the result of the query, we
679select a node with the lowest estimated rendering cost and compare
680this cost with a user specified constant. If the cost is larger than
681the constant we conclude that it is too risky to render the node and
682wait till the result of the query becomes available.
683
684\begin{figure}%[htb]
685{\footnotesize
686  \input{code/pseudocode_batching}
687 }
688  \caption{Pseudo-code of coherent hierarchical culling using multiple queries.}
689  \label{fig:pseudocode_batching}
690\end{figure}
691
692\subsection{Initial depth pass}
693\label{sec:initial}
694
695\begin{figure}
696\centering
697\includegraphics[width=0.49\textwidth,draft=\DRAFTIMAGES]{images/transp_wrong}
698\includegraphics[width=0.49\textwidth,draft=\DRAFTIMAGES]{images/transp_delayed}
699\caption{(left) all passes are rendered with CHC. Note that the soldiers are
700visible through the tree. (right) Only the solid passes are rendered using CHC, afterwards the transparent passes.}
701\label{fig:online_transparency}
702\end{figure}
703
704
705To achieve maximal performance on modern GPU's, one has to take care of a number of issues.
706First, it is very important to reduce material switching. Thus modern rendering engines sort the
707objects (or patches) by materials in order to eliminate the material switching as good as possible.
708
709Next, materials can be very costly, sometimes complicated shaders have to be evaluated several times per batch. Hence it should be avoided to render the full material for fragments which eventually turn out to be occluded. This can be achieved
710by rendering an initial depth pass (i.e., enabling only depth write to fill the depth
711buffer). Afterwards the geometry is rendered again, this time with full shading. Because the
712depth buffer is already established, invisible fragments will be discarded before any shading is done calculated.
713
714This approach can be naturally adapted for use with the CHC algorithm. Only an initial depth pass is rendered
715in front-to-back order using the CHC algorithm. The initial pass is sufficient to fill the depth buffer and determine the visible geometry. Then only the visible geometry is rendered again, exploiting the full optimization and material sorting capability of the rendering engine.
716
717If the materials requires several rendering passes, we can use a variant of the depth pass method. We render
718only the first passes using the algorithm (e.g., the solid passes), determining the visibility of the patches, and
719render all the other passes afterwards. This approach can be used when there are passes which require a special
720kind of sorting to be rendered correctly (e.g., transparent passes, shadow passes). In figure~\ref{fig:online_transparency},
721we can see that artifacts occur in the left image if the transparent passes are not rendered in the correct order after
722applying the hierarchical algorithm (right image). In a similar fashion, we are able to handle shadows~\ref{fig:chc_shadows}.
723
724\begin{figure}
725\centering
726\includegraphics[width=0.35\textwidth,draft=\DRAFTIMAGES]{images/chc_shadows}
727\caption{We can correctly handle shadow volumes together with CHC.}
728\label{fig:chc_shadows}
729\end{figure}
730
731\subsection{Batching multiple queries}
732\label{sec:batching}
733
734When occlusion queries are rendered interleaved with geometry, there is always a state change involved. To
735reduce state changes, it is beneficial not to execute one query at a time, but multiple queries at once~\cite{Staneker:2004:EMO}.
736Instead of immediately executing a query for a node when we fetch it from the traversal stack, we add it to the {\em pending queue}. If {\em n} of these queries are accumulated in the queue, we can execute them at once.
737To optain an optimal value for {\em n}, several some heuristics can be applied, e.g., a fraction of the number of queries issued in the last frame. The pseudo-code of the algorithm including the batching is given in Figure~\ref{fig:pseudocode_batching}.
738
739
740\section{Results}
741\label{sec:results}
742
743We have incorporated our method into an OpenGL-based scene graph
744library and tested it on three scenes of different types. All tests
745were conducted on a PC with a 3.2GHz P4, 1GB of memory, and a
746GeForce FX5950 graphics card.
747
748\subsection{Test scenes}
749
750The three test scenes comprise a synthetic arrangement of 5000
751randomly positioned teapots (11.6M polygons); an urban environment (1M
752polygons); and the UNC power plant model (13M polygons). The test
753scenes are depicted in Figure~\ref{fig:scenes}. All scenes were
754partitioned using a kD-tree constructed according to the surface-area
755heuristics~\cite{MacDonald90}.
756
757
758Although the teapot scene would intuitively offer good occlusion, it
759is a complicated case to handle for occlusion culling. Firstly, the
760teapots consist of small triangles and so only the effect of fused
761occlusion due to a large number of visible triangles can bring a
762culling benefit. Secondly, there are many thin holes through which it
763is possible to see quite far into the arrangement of teapots. Thirdly,
764the arrangement is long and thin and so we can see almost
765half of the teapots along the longer side of the arrangement.
766
767%Although the teapot scene would intuitively seem to offer good
768%occlusion, it is one of the more complicated cases for occlusion
769%culling. Due to the high number of very dense objects, a very fine
770%kD-tree subdivision was necessary, which in turn leads to higher costs
771%for testing many nodes in the hierarchy. Additionally, misclassifying
772%a single teapot can lead to a significant increase in frame time due
773%to the high polygon density (2,304 polygons per teapot). There are also
774%many locations where it is possible to see very far into the block of
775%teapots. This can be seen in the kD-tree node classification which is
776%shown in the accompanying video.
777
778The complete power plant model is quite challenging even to load into
779memory, but on the other hand it offers good occlusion. This scene is
780an interesting candidate for testing not only due to its size, but
781also due to significant changes in visibility and depth complexity in
782its different parts.
783
784%as can also be seen in the
785%walkthrough. A relatively coarse subdivision of the model triangles
786%into about 4,600 nodes was sufficient to capture most occlusion
787%interdependencies.
788
789The city scene is a classical target for occlusion culling
790algorithms. Due to the urban structure consisting of buildings and
791streets, most of the model is occluded when viewed from the
792streets. Note that the scene does not contain any detailed geometry
793inside the buildings. See Figure~\ref{fig:city_vis} for a
794visualization of the visibility classification of the kD-tree nodes
795for the city scene.
796
797%that
798%could further
799
800
801
802%% However, we still included a section in the walkthrough where the
803%% viewpoint is elevated above the building roofs in order to show that
804%% the algorithm continues to work without modification, albeit at some
805%% expense. It will be an interesting topic of research to automatically
806%% recognize such situations and decrease the frequency of visibility
807%% queries or turn the occlusion culling algorithm off completely.
808
809
810
811\begin{figure}
812\centering \includegraphics[width=0.32\textwidth,draft=\DRAFTIMAGES]{images/city_vis}
813\caption{Visibility classification of the kD-tree nodes in the city scene.
814  The orange nodes were found visible, all the other depicted nodes are invisible.
815  Note the increasing size of the occluded nodes with increasing distance from the visible set.}
816\label{fig:city_vis}
817\end{figure}
818
819
820\subsection{Basic tests}
821
822
823We have measured the frame times for rendering with only view-frustum culling
824(VFC), the hierarchical stop-and-wait method (S\&W), and our new
825coherent hierarchical culling method (CHC). Additionally, we have
826evaluated the time for an ``ideal'' algorithm. The ideal algorithm
827renders the visible objects found by the S\&W algorithm without
828performing any visibility tests. This is an optimal solution with
829respect to the given hierarchy, i.e., no occlusion culling algorithm
830operating on the same hierarchy can be faster. For the basic tests we
831did not apply any of the optimizations discussed in
832Section~\ref{sec:optimizations}, which require user specified
833parameters.
834
835%% In order to obtain repeatable and comparable timings, the process
836%% priority of our application was set to high, timings were obtained
837%% using the CPU cycle counter instruction, and the OpenGL {\tt Finish()}
838%% instruction was executed before each invocation of the timer at the
839%% start and the end of a frame. Furthermore, we did not include clear
840%% and buffer swapping times in our results, since they depend strongly
841%% on the windowing mode used.
842
843For each test scene, we have constructed a walkthrough which is shown
844in full in the accompanying
845video. Figures~\ref{fig:teapot_walkthrough},~\ref{fig:city_walkthrough},
846and~\ref{fig:plant_walkthrough} depict the frame times measured for
847the walkthroughs. Note that Figure~\ref{fig:plant_walkthrough} uses a
848logarithmic scale to capture the high variations in frame times during
849the power plant walkthrough.
850\begin{figure}
851\centering
852\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/teapotgraph1}
853\caption{Frame times for the teapot scene.}\label{fig:teapot_walkthrough}
854\end{figure}
855\begin{figure}
856\centering
857\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/citygraph1}
858\caption{Frame times for the city walkthrough. Note
859the spike around frame 1600, where the viewpoint was elevated above the roofs,
860practically eliminating any occlusion.}\label{fig:city_walkthrough}
861\end{figure}
862\begin{figure}
863\centering
864\includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/plantgraph1log}
865\caption{Frame times for the power plant walkthrough.
866The plot shows the weakness of the S\&W method: when there is not much occlusion it becomes slower than VFC (near frame 2200).
867The CHC can keep up even in these situations and in the same time it can exploit occlusion when it appears
868(e.g. near frame 3700).}
869\label{fig:plant_walkthrough}
870\end{figure}
871To better demonstrate the behavior of our algorithm, all walkthroughs
872contain sections with both restricted and unrestricted visibility. For
873the teapots, we viewed the arrangement of teapots along the longer
874side of the arrangement (frames 25--90). In the city we elevated the
875viewpoint above the roofs and gained sight over most of the city
876(frames 1200--1800). The power plant walkthrough contains several
877viewpoints from which a large part of the model is visible (spikes in
878Figure~\ref{fig:plant_walkthrough} where all algorithms are slow),
879viewpoints along the border of the model directed outwards with low depth complexity (holes in
880Figure~\ref{fig:plant_walkthrough} where all algorithms are fast), and
881viewpoints inside the power plant with high depth complexity where occlusion culling produces a
882significant speedup over VFC (e.g. frame 3800).
883
884 As we can see for a number frames in the walkthroughs, the CHC method
885can produce a speedup of more than one order of magnitude compared to
886VFC. The maximum speedup for the teapots, the city, and the power
887plant walkthroughs is 8, 20, and 70, respectively. We can also observe
888that CHC maintains a significant gain over S\&W and in many cases it
889almost matches the performance of the ideal algorithm. In complicated
890scenarios the S\&W method caused a significant slowdown compared to
891VFC (e.g. frames 1200--1800 of
892Figure~\ref{fig:city_walkthrough}). Even in these cases, the CHC
893method maintained a good speedup over VFC except for a small number of
894frames.
895
896\begin{table*}
897\centering \footnotesize
898\begin{tabular}{|c|c|c|c|c|c|c|c|}
899 \hline\hline
900scene                  & method   & \#queries & wait time [ms] & rendered triangles & frame time [ms]  & speedup \\\hline\hline
901
902Teapots                & VFC      & ---        &  ---     & 11,139,928  & 310.42  & 1.0 \\\hhline{~|-|-|-|-|-|-|-|}
90311,520,000 triangles    & S\&W     & 4704     & 83.19  & 2,617,801  & 154.95 & 2.3 \\\hhline{~|-|-|-|-|-|-|-|}
90421,639 kD-Tree nodes   & {\bf CHC} &{\bf 2827} & {\bf 1.31}  & {\bf 2,852,514 } & {\bf 81,18 } & {\bf 4.6 } \\\hhline{~|-|-|-|-|-|-|-|}
905                       & Ideal    & ---        &  ---     & 2,617,801  & 72.19  & 5.2 \\
906                       \hline\hline
907City                   & VFC      & ---        &  ---     & 156,521    & 19.79 & 1.0   \\\hhline{~|-|-|-|-|-|-|-|}
9081,036,146 triangles    & S\&W     & 663      &  9.49  & 30,594     & 19.9  & 1.5 \\\hhline{~|-|-|-|-|-|-|-|}
90933,195 kD-Tree nodes   &{\bf CHC} & {\bf 345} & {\bf 0.18}   & {\bf 31,034}  & {\bf 8.47}  & {\bf 4.0}  \\\hhline{~|-|-|-|-|-|-|-|}
910                       & Ideal    & ---        &   ---   & 30,594     & 4.55  & 6.6  \\
911                       \hline\hline
912Power Plant            & VFC      & ---    &  ---   & 1,556,300 & 138.76 & 1.0 \\\hhline{~|-|-|-|-|-|-|-|}
91312,748,510 triangles    & S\&W    & 485   & 16.16  & 392,962  & 52.29 & 3.2 \\\hhline{~|-|-|-|-|-|-|-|}
91418,719 kD-Tree nodes    &{\bf CHC} & {\bf 263}  & {\bf 0.70}  & {\bf 397,920} & {\bf 38.73}  & {\bf 4.7} \\\hhline{~|-|-|-|-|-|-|-|}
915                       & Ideal    &    ---      &  ---   &   392,962    &  36.34  & 5.8 \\\hline\hline
916\end{tabular}
917\caption{Statistics for the three test scenes. VFC is rendering with only view-frustum culling, S\&W is the
918  hierarchical stop and wait method, CHC is our new method, and Ideal
919is a perfect method with respect to the given hierarchy. All values
920are averages over all frames (including the speedup).}
921\label{tab:averages}
922\end{table*}
923
924%VFC     S&W     CHC     Ideal
925%(01:50:59) salzrat: 1     3,243823523     4,744645293     5,765981148
926%(01:51:04) salzrat: for the powerplant
927% for the powerplant
928%(01:53:59) salzrat: 1     1,519898136     4,01897468     6,599413211
929%(01:54:01) salzrat: for the city
930%(01:55:15) salzrat: and
931%(01:55:15) salzrat: 1     2,321137905     4,646185621     5,199086993
932
933%salzrat: teapot: 1,977162565     1,128631431
934%(02:01:52) salzrat: city: 2,607396227     1,674188137
935%(02:02:59) salzrat: powerplant: 1,635036084     1,244446469
936
937
938 Next, we summarized the scene statistics and the average values per
939frame in Table~\ref{tab:averages}. The table shows the number of
940issued occlusion queries, the wait time representing the CPU stalls,
941the number of rendered triangles, the total frame time, and the
942speedup over VFC.
943
944
945We can see that the CHC method practically eliminates the CPU stalls
946(wait time) compared to the S\&W method. This is paid for by a slight
947increase in the number of rendered triangles. For the three
948walkthroughs, the CHC method produces average speedups of 4.6, 4.0, and
9494.7 over view frustum culling and average speedups of 2.0, 2.6, and
9501.6 over the S\&W method. CHC is only 1.1, 1.7, and 1.2 times
951slower than the ideal occlusion culling algorithm. Concerning the
952accuracy, the increase of the average number of rendered triangles for
953CHC method compared to S\&W was 9\%, 1.4\%, and 1.3\%. This increase
954was always recovered by the reduction of CPU stalls for the tested
955walkthroughs.
956
957%% While this number may be very low, the algorithm also incurs a
958%% non-negligible overhead which is caused by the graphics card
959%% reconfiguration required for the queries and rasterization of the
960%% bounding volume geometries.
961
962% triangles
963% 9%
964% 1.4%
965% 1.3%
966
967
968% speedup over SW
969% 1.9
970% 2.32
971% 1.33
972
973% compare to ideal
974% 1.13
975% 1.9
976% 1.05
977
978
979
980\subsection{Optimizations}
981
982First of all we have observed that the technique of complete
983elimination of CPU stalls discussed in Section~\ref{sec:full} has a
984very limited scope. In fact for all our tests the stalls were almost
985completely eliminated by the basic algorithm already (see wait time in
986Table~\ref{tab:averages}). We did not find constants that could
987produce additional speedup using this technique.
988
989The measurements for the other optimizations discussed in
990Section~\ref{sec:optimizations} are summarized in
991Table~\ref{tab:assumed}. We have measured the average number of issued queries
992and the average frame time in dependence on the number of frames a node is
993assumed visible and the pixel threshold of approximate visibility. We
994have observed that the effectiveness of the optimizations depends strongly on the scene. If
995the hierarchy is deep and the geometry associated with a leaf node is
996not too complex, the conservative visibility testing produces a
997significant speedup (city and power plant). For the teapot scene the
998penalty for false rendering of actually occluded objects became larger
999than savings achieved by the reduction of the number of queries. On the other
1000hand since the teapot scene contains complex visible geometry the
1001approximate visibility optimization produced a significant
1002speedup. This is however paid for by introducing errors in the image
1003proportional to the pixel threshold used.
1004
1005%Good parameters for these
1006%optimizations have to be estimated on a scene-by-scene basis.
1007
1008
1009
1010
1011%% , where we compare the reference frame time and
1012%% number of queries already shown in the previous table to the
1013%% approximate visibility optimization using a higher pixel threshold,
1014%% and to the conservative visibility testing optimization assuming a
1015%% node remains visible for 2 frames, both for the teapot and the city
1016%% scene. These tests clearly show that a high visibility threshold
1017%% improves frame times, albeit at the expense of image quality. The
1018%% benefit of the conservative visibility optimization, is however
1019%% limited. While the number of queries can be reduced significantly, the
1020%% number of misclassified objects rises and increases the frame time
1021%% again.
1022
1023
1024
1025%% \begin{table}
1026%% \centering \footnotesize
1027%% \begin{tabular}{|l|c|c|}
1028%%  \hline\hline
1029%%                 & Teapots & City \\\hline\hline
1030%%  frame time CHC & 130.42  & 13.91\\\hline
1031%%  \#queries      & 2267    & 94.04\\\hline\hline
1032%%  \multicolumn{3}{|l|}{assume 2 frames visible}\\\hline
1033%%  frame time      & 123.84 & 13.27 \\\hline
1034%%  \#queries      & 1342    &  42 \\\hline\hline
1035%%  \multicolumn{3}{|l|}{pixel threshold 50 pixels and assume 2 frames visible}\\\hline
1036%%  frame time     & 73.22 & 11.47  \\\hline
1037%%  \#queries      &  1132 &  50 \\\hline\hline
1038%% \end{tabular}
1039%% \caption{Influence of optimizations on two test scenes. All values are averages
1040%% over all frames, frame times are in [ms].}\label{tab:assumed}
1041%% \end{table}
1042
1043
1044\begin{table}
1045\centering \footnotesize
1046\begin{tabular}{|l|c|c|c|c|}
1047 \hline\hline
1048scene                    & t$_{av}$ &  n$_{vp}$ & \#queries & frame time [ms]   \\ \hline\hline
1049\multirow{3}{*}{Teapots} & 0        &    0         & 2827      & 81.18  \\ \hhline{~----}
1050                         & 2       &    0         & 1769      &  86.31 \\ \hhline{~----}
1051                         & 2       &    25         & 1468      & 55.90  \\ \hline\hline
1052\multirow{3}{*}{City}    & 0        &    0         & 345        & 8.47  \\ \hhline{~----}
1053                         & 2        &    0         & 192        & 6.70 \\ \hhline{~----}
1054                         & 2       &    25         & 181       & 6.11  \\ \hline\hline
1055\multirow{3}{*}{Power Plant}    & 0  &    0        & 263        & 38.73  \\ \hhline{~----}
1056                         & 2        &    0         & 126        & 31.17 \\ \hhline{~----}
1057                         & 2       &    25         & 120       & 36.62  \\ \hline\hline
1058\end{tabular}
1059
1060\caption{Influence of optimizations on the CHC method.
1061  $t_{av}$ is the number of assumed visibility frames for conservative
1062  visibility testing, n$_{vp}$ is the pixel threshold for
1063  approximate visibility.}\label{tab:assumed}
1064\end{table}
1065
1066
1067\subsection{Comparison to PVS-based rendering}
1068
1069 We also compared the CHC method against precalculated visibility. In
1070particular, we used the PVS computed by an offline visibility
1071algorithm~\cite{wonka00}. While the walkthrough using the PVS was
10721.26ms faster per frame on average, our method does not require costly
1073precomputation and can be used at any general 3D position in the
1074model, not only in a predefined view space.
1075
1076
1077\section{Conclusion}
1078\label{sec:conclusion}
1079
1080 We have presented a method for the optimized scheduling of hardware
1081 accelerated occlusion queries. The method schedules occlusion queries
1082 in order to minimize the number of the queries and their latency.
1083 This is achieved by exploiting spatial and temporal coherence of
1084 visibility. Our results show that the CPU stalls and GPU starvation
1085 are almost completely eliminated at the cost of a slight increase in
1086 the number of rendered objects.
1087
1088%Additionally it schedules the queries in an order
1089%minimizing the overhead due to the latency of availability of query
1090%results.
1091
1092 Our technique can be used with practically arbitrary scene
1093 partitioning data structures such as kD-trees, bounding volume
1094 hierarchies, or hierarchical grids. The implementation of the method
1095 is straightforward as it uses a simple OpenGL interface to the
1096 hardware occlusion queries. In particular, the method requires no
1097 complicated geometrical operations or data structures. The algorithm
1098 is suitable for application on scenes of arbitrary structure and it
1099 requires no preprocessing or scene dependent tuning.
1100
1101 We have experimentally verified that the method is well suited to the
1102 \nvoq{} supported on current consumer grade graphics hardware. We
1103 have obtained an average speedup of 4.0--4.7 compared to pure view-frustum
1104 culling and 1.6--2.6 compared to the hierarchical
1105 stop-and-wait application of occlusion queries.
1106
1107
1108  The major potential in improving the method is a better estimation
1109 of changes in the visibility classification of hierarchy nodes. If
1110 nodes tend to be mostly visible, we could automatically decrease the
1111 frequency of occlusion tests and thus better adapt the method to the
1112 actual occlusion in the scene. Another possibility for improvement is
1113 better tuning for a particular graphics hardware by means of more
1114 accurate rendering cost estimation. Skipping occlusion tests for
1115 simpler geometry  can be faster than issuing comparably expensive
1116 occlusion queries.
1117
1118
1119\begin{figure*}[htb]
1120  \centerline{
1121     \hfill
1122     \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/teapots}
1123     \hfill
1124     \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/city}
1125     \hfill
1126     \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/pplant}
1127     \hfill
1128   }
1129   \caption{The test scenes: the teapots, the city, and the power plant.}
1130   \label{fig:scenes}
1131\end{figure*}
1132
1133%%  \begin{figure}[htb]
1134%%    \centerline{
1135%%      \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/city}
1136%%   }
1137%%   \caption{The the city scene.}
1138%%   \label{fig:images1}
1139%% \end{figure}
1140
1141%%  \begin{figure}[htb]
1142%%    \centerline{
1143%%      \includegraphics[width=0.25\textwidth,draft=\DRAFTFIGS]{images/teapots}
1144%%   }
1145%%   \caption{The the teapot scene.}
1146%%   \label{fig:images2}
1147%% \end{figure}
1148
1149%%  \begin{figure}[htb]
1150%%    \centerline{
1151%%      \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/pplant}
1152%%   }
1153%%   \caption{The the power plant scene.}
1154%%   \label{fig:images3}
1155%% \end{figure}
1156
1157
Note: See TracBrowser for help on using the repository browser.