1 | \chapter{Online Visibility Culling} |
---|
2 | %***************************************************************************** |
---|
3 | |
---|
4 | \section{Introduction} |
---|
5 | |
---|
6 | Visibility culling is one of the major acceleration techniques for |
---|
7 | the real-time rendering of complex scenes. The ultimate goal of |
---|
8 | visibility culling techniques is to prevent invisible objects from |
---|
9 | being sent to the rendering pipeline. A standard visibility-culling |
---|
10 | technique is {\em view-frustum culling}, which eliminates objects |
---|
11 | outside of the current view frustum. View-frustum culling is a fast |
---|
12 | and simple technique, but it does not eliminate objects in the view |
---|
13 | frustum that are occluded by other objects. This can lead to significant |
---|
14 | {\em overdraw}, i.e., the same image area gets covered more than |
---|
15 | once. The overdraw causes a waste of computational effort both in the |
---|
16 | pixel and the vertex processing stages of modern graphic hardware. |
---|
17 | %that is, we render each pixel in the image several times. For many complex |
---|
18 | %scenes with high occlusion, the effort wasted on overdraw cannot simply be |
---|
19 | %eliminated by the increasing the raw computational power of the hardware. |
---|
20 | %This problem becomes even more apparent on recent graphics hardware with |
---|
21 | %programmable vertex and pixel shaders: all complex per-vertex or per-pixel |
---|
22 | %computations on object get completely wasted if the object is occluded. |
---|
23 | The elimination of occluded objects is addressed by {\em occlusion |
---|
24 | culling}. In an optimized rendering pipeline, occlusion culling complements |
---|
25 | other rendering acceleration techniques such as levels of detail or |
---|
26 | impostors. |
---|
27 | |
---|
28 | Occlusion culling can either be applied offline or online. When |
---|
29 | applied offline as a preprocess, we compute a potentially visible set |
---|
30 | (PVS) for cells of a fixed subdivision of the scene. At runtime, we |
---|
31 | can quickly identify a PVS for the given viewpoint. However, this |
---|
32 | approach suffers from four major problems: (1) the PVS is valid only |
---|
33 | for the original static scene configuration, (2) for a given |
---|
34 | viewpoint, the corresponding cell-based PVS can be overly |
---|
35 | conservative, (3) computing all PVSs is computationally expensive, and (4) |
---|
36 | an accurate PVS computation is difficult to implement for general |
---|
37 | scenes. Online occlusion culling can solve these problems at the cost |
---|
38 | of applying extra computations at each frame. To make these additional |
---|
39 | computations efficient, most online occlusion culling methods rely on a |
---|
40 | number of assumptions about the scene structure and its occlusion |
---|
41 | characteristics (e.g. presence of large occluders, occluder |
---|
42 | connectivity, occlusion by few closest depth layers). |
---|
43 | |
---|
44 | Recent graphics hardware natively supports an \emph{occlusion query} |
---|
45 | to detect the visibility of an object against the current contents of the |
---|
46 | z-buffer. Although the query itself is processed quickly using the |
---|
47 | raw power of the graphics processing unit (GPU), its result is not |
---|
48 | available immediately due to the delay between issuing the query and |
---|
49 | its actual processing in the graphics pipeline. As a result, a naive |
---|
50 | application of occlusion queries can even decrease the overall |
---|
51 | application performance due the associated CPU stalls and GPU |
---|
52 | starvation. In this paper, we present an algorithm that aims to |
---|
53 | overcome these problems by reducing the number of issued queries and |
---|
54 | eliminating the CPU stalls and GPU starvation. To schedule the |
---|
55 | queries, the algorithm makes use of both the spatial and the temporal |
---|
56 | coherence of visibility. A major strength of our technique is its |
---|
57 | simplicity and versatility: the method can be easily integrated in |
---|
58 | existing real-time rendering packages on architectures supporting the |
---|
59 | underlying occlusion query. |
---|
60 | |
---|
61 | %Using spatial and assuming temporal coherence |
---|
62 | |
---|
63 | \section{Related Work} |
---|
64 | |
---|
65 | With the demand for rendering scenes of ever increasing size, there |
---|
66 | have been a number of visibility culling methods developed in the |
---|
67 | last decade. A comprehensive survey of visibility culling methods |
---|
68 | was presented by Cohen-Or et al.~\cite{Cohen:2002:survey}. Another |
---|
69 | recent survey of Bittner and Wonka~\cite{bittner03:jep} discusses |
---|
70 | visibility culling in a broader context of other visibility problems. |
---|
71 | |
---|
72 | According to the domain of visibility computation, we distinguish |
---|
73 | between {\em from-point} and {\em from-region} visibility algorithms. |
---|
74 | From-region algorithms compute a PVS and are applied offline in a |
---|
75 | preprocessing phase~\cite{Airey90,Teller91a,Leyvand:2003:RSF}. |
---|
76 | From-point algorithms are applied online for each particular |
---|
77 | viewpoint~\cite{Greene93a,Hudson97,EVL-1997-163,bittner98b,Wonka:1999:OSF,Klosowski:2001:ECV}. |
---|
78 | In our further discussion we focus on online occlusion culling methods |
---|
79 | that exploit graphics hardware. |
---|
80 | |
---|
81 | A conceptually important online occlusion culling method is the |
---|
82 | hierarchical z-buffer introduced by Greene et al.~\cite{Greene93a}. It |
---|
83 | organizes the z-buffer as a pyramid, where the standard z-buffer is |
---|
84 | the finest level. At all other levels, each z-value is the farthest in |
---|
85 | the window corresponding to the adjacent finer level. The hierarchical |
---|
86 | z-buffer allows to quickly determine if the geometry in question is |
---|
87 | occluded. To a certain extent this idea is used in the current |
---|
88 | generation of graphics hardware by applying early z-tests of |
---|
89 | fragments in the graphics pipeline (e.g., Hyper-Z technology of ATI or |
---|
90 | Z-cull of NVIDIA). However, the geometry still needs to be sent to the |
---|
91 | GPU, transformed, and coarsely rasterized even if it is later |
---|
92 | determined invisible. |
---|
93 | |
---|
94 | Zhang~\cite{EVL-1997-163} proposed hierarchical occlusion maps, which |
---|
95 | do not rely on the hardware support for the z-pyramid, but instead |
---|
96 | make use of hardware texturing. The hierarchical occlusion map is |
---|
97 | computed on the GPU by rasterizing and down sampling a given set of |
---|
98 | occluders. The occlusion map is used for overlap tests whereas the |
---|
99 | depths are compared using a coarse depth estimation buffer. Wonka and |
---|
100 | Schmalstieg~\cite{Wonka:1999:OSF} use occluder shadows to compute |
---|
101 | from-point visibility in \m25d scenes with the help of the GPU. This |
---|
102 | method has been further extended to online computation of from-region |
---|
103 | visibility executed on a server~\cite{wonka:01:eg}. |
---|
104 | |
---|
105 | %% In |
---|
106 | %% parallel to rendering, the visibility server calculates visibility for |
---|
107 | %% the neighborhood of the given viewpoint and sends them back to the |
---|
108 | %% display host. |
---|
109 | |
---|
110 | %% A similar method designed by Aila and |
---|
111 | %% Miettinen~\cite{aila:00:msc}. The incremental occlusion maps are |
---|
112 | %% created on the CPU, which takes the CPU computational time but it does |
---|
113 | %% not suffer the problem of slow read-back from the GPU. |
---|
114 | |
---|
115 | %Klosowski:2001:ECV |
---|
116 | |
---|
117 | Bartz et al.~\cite{Bartz98} proposed an OpenGL extension for occlusion |
---|
118 | queries along with a discussion concerning a potential realization in |
---|
119 | hardware. A first hardware implementation of occlusion queries came |
---|
120 | with the VISUALIZE fx graphics hardware~\cite{Scott:1998:OVF}. The |
---|
121 | corresponding OpenGL extension is called \hpot{}. A more recent OpenGL |
---|
122 | extension, \nvoq{}, was introduced by NVIDIA with the GeForce 3 graphics |
---|
123 | card and it is now also available as an official ARB extension. |
---|
124 | |
---|
125 | %Both |
---|
126 | %extensions have been used in several several occlusion methods |
---|
127 | %algorithms~\cite{Meissner01,Hillesland02,Staneker:2004:OCO}. |
---|
128 | %\cite{Meissner01} |
---|
129 | |
---|
130 | Hillesland et al.~\cite{Hillesland02} have proposed an algorithm |
---|
131 | which employs the \nvoq. They subdivide the scene using a uniform |
---|
132 | grid. Then the cubes are traversed in slabs roughly perpendicular to |
---|
133 | the viewport. The queries are issued for all cubes of a slab at once, |
---|
134 | after the visible geometry of this slab has been rendered. The method |
---|
135 | can also use nested grids: a cell of the grid contains another grid |
---|
136 | that is traversed if the cell is proven visible. This method however |
---|
137 | does not exploit temporal and spatial coherence of visibility and it |
---|
138 | is restricted to regular subdivision data structures. Our new method |
---|
139 | addresses both these problems and provides natural extensions to |
---|
140 | balance the accuracy of visibility classification and the associated |
---|
141 | computational costs. |
---|
142 | |
---|
143 | Recently, Staneker et al.~\cite{Staneker:2004:OCO} developed a method |
---|
144 | integrating occlusion culling into the OpenSG scene graph |
---|
145 | framework. Their technique uses occupancy maps maintained in software |
---|
146 | to avoid queries on visible scene graph nodes, and temporal coherence |
---|
147 | to reduce the number of occlusion queries. The drawback of the method |
---|
148 | is that it performs the queries in a serial fashion and thus it |
---|
149 | suffers from the CPU stalls and GPU starvation. |
---|
150 | |
---|
151 | %% The methods proposed in this paper can be used to make use of temporal |
---|
152 | %% and spatial coherence in the scope of existing visibility algorithms, |
---|
153 | %% that utilise a spatial hierarchy. Examples of these are algorithms |
---|
154 | %% based on hierarchical occlusion maps~\cite{Zhang97}, coverage |
---|
155 | %% masks~\cite{Greene:1996:HPT}, shadow frusta~\cite{Hudson97}, and |
---|
156 | %% occlusion trees~\cite{bittner98b_long}. |
---|
157 | |
---|
158 | On a theoretical level, our paper is related to methods aiming to |
---|
159 | exploit the temporal coherence of visibility. Greene et |
---|
160 | al.~\cite{Greene93a} used the set of visible objects from one frame to |
---|
161 | initialize the z-pyramid in the next frame in order to reduce the |
---|
162 | overdraw of the hierarchical z-buffer. The algorithm of Coorg and |
---|
163 | Teller~\cite{Coorg96b} restricts the hierarchical traversal to nodes |
---|
164 | associated with visual events that were crossed between successive |
---|
165 | viewpoint positions. Another method of Coorg and Teller~\cite{Coorg97} |
---|
166 | exploits temporal coherence by caching occlusion relationships. |
---|
167 | Chrysanthou and Slater have proposed a probabilistic scheme for |
---|
168 | view-frustum culling~\cite{Chrysanthou:97}. |
---|
169 | |
---|
170 | The above mentioned methods for exploiting temporal coherence are |
---|
171 | tightly interwoven with the particular culling algorithm. On the |
---|
172 | contrary, Bittner et al.~\cite{bittner01:jvca} presented a general |
---|
173 | acceleration technique for exploiting spatial and temporal coherence |
---|
174 | in hierarchical visibility algorithms. The central idea, which is also |
---|
175 | vital for this paper, is to avoid repeated visibility tests of |
---|
176 | interior nodes of the hierarchy. The problem of direct adoption of |
---|
177 | this method is that it is designed for the use with instantaneous CPU |
---|
178 | based occlusion queries, whereas hardware occlusion queries exhibit |
---|
179 | significant latency. The method presented herein efficiently overcomes |
---|
180 | the problem of latency while keeping the benefits of a generality and |
---|
181 | simplicity of the original hierarchical technique. As a result we |
---|
182 | obtain a simple and efficient occlusion culling algorithm utilizing |
---|
183 | hardware occlusion queries. |
---|
184 | |
---|
185 | |
---|
186 | %In this paper we show how to overcome this problem while |
---|
187 | %keeping the benefits of exploiting. |
---|
188 | |
---|
189 | |
---|
190 | The rest of the paper is organized as follows: |
---|
191 | Section~\ref{sec:hwqueries} discusses hardware supported occlusion |
---|
192 | queries and a basic application of these queries using a kD-tree. |
---|
193 | Section~\ref{sec:hoq} presents our new algorithm and |
---|
194 | Section~\ref{sec:optimizations} describes several additional |
---|
195 | optimizations. Section~\ref{sec:results} presents results obtained by |
---|
196 | experimental evaluation of the method and discusses its |
---|
197 | behavior. Finally Section~\ref{sec:conclusion} concludes the paper. |
---|
198 | |
---|
199 | |
---|
200 | |
---|
201 | |
---|
202 | \section{Hardware Occlusion Queries} |
---|
203 | \label{sec:hwqueries} |
---|
204 | |
---|
205 | |
---|
206 | Hardware occlusion queries follow a simple pattern: To test visibility |
---|
207 | of an occludee, we send its bounding volume to the GPU. The volume is |
---|
208 | rasterized and its fragments are compared to the current contents of |
---|
209 | the z-buffer. The GPU then returns the number of visible |
---|
210 | fragments. If there is no visible fragment, the occludee is invisible |
---|
211 | and it need not be rendered. |
---|
212 | |
---|
213 | \subsection{Advantages of hardware occlusion queries} |
---|
214 | |
---|
215 | There are several advantages of hardware occlusion queries: |
---|
216 | |
---|
217 | \begin{itemize} |
---|
218 | |
---|
219 | \item {\em Generality of occluders.} We can use the original scene geometry |
---|
220 | as occluders, since the queries use the current contents of the z-buffer. |
---|
221 | |
---|
222 | |
---|
223 | \item {\em Occluder fusion.} The occluders are merged in the z-buffer, |
---|
224 | so the queries automatically account for occluder fusion. Additionally |
---|
225 | this fusion comes for free since we use the intermediate result of the |
---|
226 | rendering itself. |
---|
227 | |
---|
228 | \item {\em Generality of occludees.} We can use complex |
---|
229 | occludees. Anything that can be rasterized quickly is suitable. |
---|
230 | |
---|
231 | \item {\em Exploiting the GPU power.} The queries take full advantage of |
---|
232 | the high fill rates and internal parallelism provided by modern GPUs. |
---|
233 | |
---|
234 | \item {\em Simple use.} Hardware occlusion queries can be easily |
---|
235 | integrated into a rendering algorithm. They provide a powerful tool to |
---|
236 | minimize the implementation effort, especially when compared to |
---|
237 | CPU-based occlusion culling. |
---|
238 | |
---|
239 | |
---|
240 | \end{itemize} |
---|
241 | |
---|
242 | |
---|
243 | |
---|
244 | \subsection{Problems of hardware occlusion queries} |
---|
245 | |
---|
246 | Currently there are two main hardware supported variants of occlusion |
---|
247 | queries: the HP test (\hpot) and the more recent NV query (\nvoq, now |
---|
248 | also available as \arboq). The most important difference between the |
---|
249 | HP test and the NV query is that multiple NV queries can be |
---|
250 | issued before asking for their results, while only one HP test is |
---|
251 | allowed at a time, which severely limits its possible algorithmic |
---|
252 | usage. Additionally the NV query returns the number of visible pixels |
---|
253 | whereas the HP test returns only a binary visibility |
---|
254 | classification. |
---|
255 | |
---|
256 | %% Issuing multiple independent NV queries is crucial for |
---|
257 | %% the design of algorithms that strive to exploit power of the GPU. |
---|
258 | |
---|
259 | The main problem of both the HP test and the NV query is the latency |
---|
260 | between issuing the query and the availability of the result. The |
---|
261 | latency occurs due to the delayed processing of the query in a long |
---|
262 | graphics pipeline, the cost of processing the query itself, and the |
---|
263 | cost of transferring the result back to the CPU. The latency causes |
---|
264 | two major problems: CPU stalls and GPU starvation. After issuing the |
---|
265 | query, the CPU waits for its result and does not feed the GPU with new |
---|
266 | data. When the result finally becomes available, the GPU pipeline can |
---|
267 | already be empty. Thus the GPU needs to wait for the CPU to process |
---|
268 | the result of the query and to feed the GPU with new data. |
---|
269 | |
---|
270 | A major challenge when using hardware occlusion queries is to avoid |
---|
271 | the CPU stalls by filling the latency time with other tasks, such as |
---|
272 | rendering visible scene objects or issuing other, independent |
---|
273 | occlusion queries (see Figure~\ref{fig:latency}) |
---|
274 | |
---|
275 | |
---|
276 | \begin{figure}[htb] |
---|
277 | \centerline{ |
---|
278 | \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{figs/latency2} |
---|
279 | } |
---|
280 | |
---|
281 | \caption{(top) Illustration of CPU stalls and GPU starvation. |
---|
282 | Qn, Rn, and Cn denote querying, rendering, and culling of object |
---|
283 | n, respectively. Note that object 5 is found invisible by Q5 and |
---|
284 | thus not rendered. (bottom) More efficient query scheduling. The |
---|
285 | scheduling assumes that objects 4 and 6 will be visible in the |
---|
286 | current frame and renders them without waiting for the result of |
---|
287 | the corresponding queries. } |
---|
288 | \label{fig:latency} |
---|
289 | \end{figure} |
---|
290 | |
---|
291 | |
---|
292 | |
---|
293 | |
---|
294 | |
---|
295 | \subsection{Hierarchical stop-and-wait method} |
---|
296 | \label{sec:basic} |
---|
297 | |
---|
298 | Many rendering algorithms rely on hierarchical structures in order to |
---|
299 | deal with complex scenes. In the context of occlusion culling, such a |
---|
300 | data structure allows to efficiently cull large scene blocks, and thus |
---|
301 | to exploit spatial coherence of visibility and provide a key to |
---|
302 | achieving output sensitivity. |
---|
303 | |
---|
304 | This section outlines a naive application of occlusion queries in the |
---|
305 | scope of a hierarchical algorithm. We refer to this approach as the |
---|
306 | {\em hierarchical stop-and-wait} method. Our discussion is based on |
---|
307 | kD-trees, which proved to be efficient for point location, ray |
---|
308 | tracing, and visibility |
---|
309 | culling~\cite{MacDonald90,Hudson97,Coorg97,bittner01:jvca}. The |
---|
310 | concept applies to general hierarchical data structures as well, though. |
---|
311 | |
---|
312 | The hierarchical stop-and-wait method proceeds as follows: Once a |
---|
313 | kD-tree node passes view-frustum culling, it is tested for |
---|
314 | occlusion by issuing the occlusion query and waiting for its |
---|
315 | result. If the node is found visible, we continue by recursively |
---|
316 | testing its children in a front-to-back order. If the node is a |
---|
317 | leaf, we render its associated objects. |
---|
318 | |
---|
319 | The problem with this approach is that we can continue the tree |
---|
320 | traversal only when the result of the last occlusion query becomes |
---|
321 | available. If the result is not available, we have to stall the CPU, |
---|
322 | which causes significant performance penalties. As we document in |
---|
323 | Section~\ref{sec:results}, these penalties together with the overhead |
---|
324 | of the queries themselves can even decrease the overall application |
---|
325 | performance compared to pure view-frustum culling. Our new method aims |
---|
326 | to eliminate this problem by issuing multiple occlusion queries for |
---|
327 | independent scene parts and exploiting temporal coherence of |
---|
328 | visibility classifications. |
---|
329 | |
---|
330 | |
---|
331 | \section{Coherent Hierarchical Culling} |
---|
332 | |
---|
333 | \label{sec:hoq} |
---|
334 | |
---|
335 | In this section we first present an overview of our new |
---|
336 | algorithm. Then we discuss its steps in more detail. |
---|
337 | |
---|
338 | %and present a |
---|
339 | %generalization of the method to other spatial data structures. |
---|
340 | |
---|
341 | \subsection{Algorithm Overview} |
---|
342 | |
---|
343 | \label{sec:overview} |
---|
344 | |
---|
345 | |
---|
346 | Our method is based on exploiting temporal coherence of visibility |
---|
347 | classification. In particular, it is centered on the following |
---|
348 | three ideas: |
---|
349 | |
---|
350 | \begin{itemize} |
---|
351 | \item We initiate occlusion queries on nodes of the hierarchy where |
---|
352 | the traversal terminated in the last frame. Thus we avoid queries |
---|
353 | on all previously visible interior nodes~\cite{bittner01:jvca}. |
---|
354 | \item We assume that a previously visible leaf node remains visible |
---|
355 | and render the associated geometry without waiting for the result |
---|
356 | of the corresponding occlusion query. |
---|
357 | \item Issued occlusion queries are stored in a query queue until they are known to be carried out by the GPU. This allows |
---|
358 | interleaving the queries with the rendering of visible geometry. |
---|
359 | \end{itemize} |
---|
360 | |
---|
361 | |
---|
362 | The algorithm performs a traversal of the hierarchy that is |
---|
363 | terminated either at leaf nodes or nodes that are classified as |
---|
364 | invisible. Let us call such nodes the {\em termination nodes}, and |
---|
365 | interior nodes that have been classified visible the {\em opened |
---|
366 | nodes}. We denote sets of termination and opened nodes in the $i$-th |
---|
367 | frame $\mathcal{T}_i$ and $\mathcal{ O}_i$, respectively. In the |
---|
368 | $i$-th frame, we traverse the kD-tree in a front-to-back order, skip |
---|
369 | all nodes of $\mathcal{ O}_{i-1}$ and apply occlusion queries first on |
---|
370 | the termination nodes $\mathcal{ T}_{i-1}$. When reaching a |
---|
371 | termination node, the algorithm proceeds as follows: |
---|
372 | |
---|
373 | \begin{itemize} |
---|
374 | \item For a previously visible node (this must be a leaf), we issue |
---|
375 | the occlusion query and store it in the query queue. Then we |
---|
376 | immediately render the associated geometry without waiting for the |
---|
377 | result of the query. |
---|
378 | \item For a previously invisible node, we issue the query and store |
---|
379 | it in the query queue. |
---|
380 | \end{itemize} |
---|
381 | |
---|
382 | \begin{figure}[htb] |
---|
383 | {\footnotesize |
---|
384 | \input{code/pseudocode2} |
---|
385 | } |
---|
386 | \caption{Pseudo-code of coherent hierarchical culling.} |
---|
387 | \label{fig:pseudocode} |
---|
388 | \end{figure} |
---|
389 | |
---|
390 | When the query queue is not empty, we check if the result of the |
---|
391 | oldest query in the queue is already available. If the query result is |
---|
392 | not available, we continue by recursively processing other nodes of |
---|
393 | the kD-tree as described above. If the query result is available, we |
---|
394 | fetch the result and remove the node from the query queue. If the node |
---|
395 | is visible, we process its children recursively. Otherwise, the whole |
---|
396 | subtree of the node is invisible and thus it is culled. |
---|
397 | |
---|
398 | |
---|
399 | In order to propagate changes in visibility upwards in the hierarchy, |
---|
400 | the visibility classification is \emph{pulled up} according to the |
---|
401 | following rule: An interior node is invisible only if all its children |
---|
402 | have been classified invisible. Otherwise, it remains visible and thus |
---|
403 | opened. The pseudo-code of the complete algorithm is given in |
---|
404 | Figure~\ref{fig:pseudocode}. An example of the behavior of the method |
---|
405 | on a small kD-tree for two subsequent frames is depicted |
---|
406 | Figure~\ref{fig:cut}. |
---|
407 | |
---|
408 | \begin{figure*}[htb] |
---|
409 | \centerline{ |
---|
410 | \includegraphics[width=0.73\textwidth,draft=\DRAFTFIGS]{figs/cut} } |
---|
411 | \caption{(left) Visibility classification of a node of the kD-tree and |
---|
412 | the termination nodes. (right) Visibility classification after the |
---|
413 | application of the occlusion test and the new set of termination nodes. |
---|
414 | Nodes on which occlusion queries were applied are depicted with a |
---|
415 | solid outline. Note the pull-up and pull-down due to visibility changes.} |
---|
416 | \label{fig:cut} |
---|
417 | \end{figure*} |
---|
418 | |
---|
419 | |
---|
420 | |
---|
421 | The sets of opened nodes and termination nodes need not be maintained |
---|
422 | explicitly. Instead, these sets can be easily identified by |
---|
423 | associating with each node an information about its visibility and an |
---|
424 | id of the last frame when it was visited. The node is an opened node |
---|
425 | if it is an interior visible node that was visited in the last frame |
---|
426 | (line 23 in the pseudocode). Note that in the actual implementation of |
---|
427 | the pull up we can set all visited nodes to invisible by default and |
---|
428 | then pull up any changes from invisible to visible (lines 25 and line |
---|
429 | 12 in Figure~\ref{fig:pseudocode}). This modification eliminates |
---|
430 | checking children for invisibility during the pull up. |
---|
431 | |
---|
432 | |
---|
433 | |
---|
434 | %% \begin{figure*}[htb] |
---|
435 | %% \centerline{\includegraphics[width=0.8\textwidth,draft=\DRAFTFIGS]{figs/basic}} |
---|
436 | %% \caption{Illustration of the hierarchy traversal. Initially the algorithm |
---|
437 | %% starts at the root of the hierarchy (left). In the second frame the |
---|
438 | %% opened nodes $\mathcal{ O}_0$ are skipped and the occlusion queries |
---|
439 | %% are first applied on the termination nodes $\mathcal{T}_0$. Visibility changes are propagated upwarsds in the hierarchy and a |
---|
440 | %% new set of termination nodes $\mathcal{ T}_1$ is established.} |
---|
441 | %% \label{fig:basic} |
---|
442 | %% \end{figure*} |
---|
443 | |
---|
444 | |
---|
445 | \subsection{Reduction of the number of queries} |
---|
446 | |
---|
447 | %Skipping the opened nodes and issuing occlusion queries for the |
---|
448 | %termination nodes assists in exploiting two main characteristics of |
---|
449 | %scene visibility: |
---|
450 | |
---|
451 | %Identifying the set of termination nodes assists in finding a subdivision |
---|
452 | |
---|
453 | Our method reduces the number of visibility queries in two ways: |
---|
454 | Firstly, as other hierarchical culling methods we consider only a |
---|
455 | subtree of the whole hierarchy (opened nodes + termination |
---|
456 | nodes). Secondly, by avoiding queries on opened nodes we eliminate |
---|
457 | part of the overhead of identification of this subtree. These |
---|
458 | reductions reflect the following coherence properties of scene |
---|
459 | visibility: |
---|
460 | |
---|
461 | \begin{itemize} |
---|
462 | |
---|
463 | |
---|
464 | \item {\em Spatial coherence.} The invisible termination nodes |
---|
465 | approximate the occluded part of the scene with the smallest number of |
---|
466 | nodes with respect to the given hierarchy, i.e., each invisible |
---|
467 | termination node has a visible parent. This induces an adaptive |
---|
468 | spatial subdivision that reflects spatial coherence of visibility, |
---|
469 | more precisely the coherence of occluded regions. The adaptive nature |
---|
470 | of the subdivision allows to minimize the number of subsequent |
---|
471 | occlusion queries by applying the queries on the largest spatial |
---|
472 | regions that are expected to remain occluded. |
---|
473 | |
---|
474 | %, and visible termination |
---|
475 | %nodes correspond to the smallest unoccluded regions (visible |
---|
476 | %leafs) |
---|
477 | |
---|
478 | \item {\em Temporal coherence.} If visibility remains constant the set |
---|
479 | of termination nodes needs no adaptation. If an occluded node becomes |
---|
480 | visible we recursively process its children (pull-down). If a visible |
---|
481 | node becomes occluded we propagate the change higher in the hierarchy |
---|
482 | (pull-up). A pull-down reflects a spatial growing of visible |
---|
483 | regions. Similarly, a pull-up reflects a spatial growing of occluded |
---|
484 | regions. |
---|
485 | |
---|
486 | |
---|
487 | %The first case corresponds to pull-down of the cut, the latter to pull-up. |
---|
488 | |
---|
489 | \end{itemize} |
---|
490 | |
---|
491 | By avoiding queries on the opened nodes, we can save $1/k$ of the queries |
---|
492 | for a hierarchy with branching factor $k$ (assuming visibility remains |
---|
493 | constant). Thus for the kD-tree, up to half of the queries can be |
---|
494 | saved. The actual savings in the total query time are even larger: the |
---|
495 | higher we are at the hierarchy, the larger boxes we would have to |
---|
496 | check for occlusion. Consequently, the higher is the fill rate that |
---|
497 | would have been required to rasterize the boxes. In particular, |
---|
498 | assuming that the sum of the screen space projected area for nodes at |
---|
499 | each level of the kD-tree is equal and the opened nodes form a |
---|
500 | complete binary subtree of depth $d$, the fill rate is reduced $(d+2)$ |
---|
501 | times. |
---|
502 | |
---|
503 | |
---|
504 | |
---|
505 | \subsection{Reduction of CPU stalls and GPU starvation} |
---|
506 | |
---|
507 | \label{sec:latency} |
---|
508 | |
---|
509 | The reduction of CPU stalls and GPU starvation is achieved by |
---|
510 | interleaving occlusion queries with the rendering of visible geometry. The |
---|
511 | immediate rendering of previously visible termination nodes and the |
---|
512 | subsequent issuing of occlusion queries eliminates the requirement of |
---|
513 | waiting for the query result during the processing of the initial |
---|
514 | depth layers containing previously visible nodes. In an optimal case, |
---|
515 | new query results become available in between and thus we completely |
---|
516 | eliminate CPU stalls. In a static scenario, we achieve exactly the |
---|
517 | same visibility classification as the hierarchical stop-and-wait |
---|
518 | method. |
---|
519 | |
---|
520 | If the visibility is changing, the situation can be different: if the |
---|
521 | results of the queries arrive too late, it is possible that we |
---|
522 | initiated an occlusion query on a previously occluded node $A$ that is |
---|
523 | in fact occluded by another previously occluded node $B$ that became |
---|
524 | visible. If $B$ is still in the query queue, we do not capture a |
---|
525 | possible occlusion of $A$ by $B$ since the geometry associated with |
---|
526 | $B$ has not yet been rendered. In Section~\ref{sec:results} we show |
---|
527 | that the increase of the number of rendered objects compared to the |
---|
528 | stop-and-wait method is usually very small. |
---|
529 | |
---|
530 | %% It is possible that we have already traversed all previously |
---|
531 | %% visible nodes and we must stall the CPU by waiting for the result |
---|
532 | %% of the oldest occlusion query. A technique completely eliminating |
---|
533 | %% CPU stalls at the cost of an increase of the number of rendered |
---|
534 | %% objects will be discussed in Section~\ref{sec:full}. |
---|
535 | |
---|
536 | \subsection{Front-to-back scene traversal} |
---|
537 | |
---|
538 | For kD-trees the front-to-back scene traversal can be easily |
---|
539 | implemented using a depth first |
---|
540 | traversal~\cite{bittner01:jvca}. However, at a modest increase in |
---|
541 | computational cost we can also use a more general breadth-first |
---|
542 | traversal based on a priority queue. The priority of the node then |
---|
543 | corresponds to an inverse of the minimal distance of the viewpoint and |
---|
544 | the bounding box associated with the given node of the |
---|
545 | kD-tree~\cite{Klosowski:2001:ECV,Staneker:2004:OCO}. |
---|
546 | |
---|
547 | In the context of our culling algorithm, there are two main advantages |
---|
548 | of the breadth-first front-to-back traversal : |
---|
549 | |
---|
550 | \begin{itemize} |
---|
551 | \item {\em Better query scheduling.} By spreading the traversal of the |
---|
552 | scene in a breadth-first manner, we process the scene in depth |
---|
553 | layers. Within each layer, the node processing order is practically |
---|
554 | independent, which minimizes the problem of occlusion query |
---|
555 | dependence. The breadth-first traversal interleaves |
---|
556 | occlusion-independent nodes, which can provide a more accurate |
---|
557 | visibility classification if visibility changes quickly. In |
---|
558 | particular, it reduces the problem of false classifications due to |
---|
559 | missed occlusion by nodes waiting in the query queue (discussed in |
---|
560 | Section~\ref{sec:latency}). |
---|
561 | |
---|
562 | \item {\em Using other spatial data structures.} By using a |
---|
563 | breadth-first traversal, we are no longer restricted to the |
---|
564 | kD-tree. Instead we can use an arbitrary spatial data structure |
---|
565 | such as a bounding volume hierarchy, octree, grid, hierarchical grid, |
---|
566 | etc. Once we compute a distance from a node to the viewpoint, the |
---|
567 | node processing order is established by the priority queue. |
---|
568 | \end{itemize} |
---|
569 | |
---|
570 | When using the priority queue, our culling algorithm can also be |
---|
571 | applied directly to the scene graph hierarchy, thus avoiding the |
---|
572 | construction of any auxiliary data structure for spatial |
---|
573 | partitioning. This is especially important for dynamic scenes, in |
---|
574 | which maintenance of a spatial classification of moving objects can be |
---|
575 | costly. |
---|
576 | |
---|
577 | \subsection{Checking the query result} |
---|
578 | |
---|
579 | The presented algorithm repeatedly checks if the result of the |
---|
580 | occlusion query is available before fetching any node from the |
---|
581 | traversal stack (line 6 in Figure~\ref{fig:pseudocode}). Our |
---|
582 | practical experiments have proven that the cost of this check is |
---|
583 | negligible and thus it can used frequently without any performance |
---|
584 | penalty. If the cost of this check were significantly higher, we could |
---|
585 | delay asking for the query result by a time established by empirical |
---|
586 | measurements for the particular hardware. This delay should also |
---|
587 | reflect the size of the queried node to match the expected |
---|
588 | availability of the query result as precise as possible. |
---|
589 | |
---|
590 | |
---|
591 | \section{Further Optimizations} |
---|
592 | |
---|
593 | \label{sec:optimizations} |
---|
594 | |
---|
595 | This section discusses a couple of optimizations of our method that |
---|
596 | can further improve the overall rendering performance. In contrast to |
---|
597 | the basic algorithm from the previous section, these optimizations |
---|
598 | rely on some user specified parameters that should be tuned for a |
---|
599 | particular scene and hardware configuration. |
---|
600 | |
---|
601 | |
---|
602 | \subsection{Conservative visibility testing} |
---|
603 | |
---|
604 | The first optimization addresses the reduction of the number of |
---|
605 | visibility tests at the cost of a possible increase in the number of |
---|
606 | rendered objects. This optimization is based on the idea of skipping |
---|
607 | some occlusion tests of visible nodes. We assume that whenever a node |
---|
608 | becomes visible, it remains visible for a number of frames. Within the |
---|
609 | given number of frames we avoid issuing occlusion queries and simply |
---|
610 | assume the node remains visible~\cite{bittner01:jvca}. |
---|
611 | |
---|
612 | This technique can significantly reduce the number of visibility tests |
---|
613 | applied on visible nodes of the hierarchy. Especially in the case of |
---|
614 | sparsely occluded scenes, there is a large number of visible nodes |
---|
615 | being tested, which does not provide any benefit since most of them |
---|
616 | remain visible. On the other hand, we do not immediately capture all |
---|
617 | changes from visibility to invisibility, and thus we may render |
---|
618 | objects that have already become invisible from the moment when the |
---|
619 | last occlusion test was issued. |
---|
620 | |
---|
621 | In the simplest case, the number of frames a node is assumed visible |
---|
622 | can be a predefined constant. In a more complicated scenario this |
---|
623 | number should be influenced by the history of the success of occlusion |
---|
624 | queries and/or the current speed of camera movement. |
---|
625 | |
---|
626 | |
---|
627 | \subsection{Approximate visibility} |
---|
628 | |
---|
629 | The algorithm as presented computes a conservative visibility |
---|
630 | classification with respect to the resolution of the z-buffer. We |
---|
631 | can easily modify the algorithm to cull nodes more aggressively in |
---|
632 | cases when a small part of the node is visible. We compare the |
---|
633 | number of visible pixels returned by the occlusion query with a |
---|
634 | user specified constant and cull the node if this number drops |
---|
635 | below this constant. |
---|
636 | |
---|
637 | |
---|
638 | \subsection{Complete elimination of CPU stalls} |
---|
639 | |
---|
640 | \label{sec:full} |
---|
641 | |
---|
642 | The basic algorithm eliminates CPU stalls unless the traversal stack |
---|
643 | is empty. If there is no node to traverse in the traversal stack and |
---|
644 | the result of the oldest query in the query queue is still not |
---|
645 | available, it stalls the CPU by waiting for the query result. To |
---|
646 | completely eliminate the CPU stalls, we can speculatively render some |
---|
647 | nodes with undecided visibility. In particular, we select a node from |
---|
648 | the query queue and render the geometry associated with the node (or |
---|
649 | the whole subtree if it is an interior node). The node is marked as |
---|
650 | rendered but the associated occlusion query is kept in the queue to |
---|
651 | fetch its result later. If we are unlucky and the node remains |
---|
652 | invisible, the effort of rendering the node's geometry is wasted. On |
---|
653 | the other hand, if the node has become visible, we have used the time |
---|
654 | slot before the next query arrives in an optimal manner. |
---|
655 | |
---|
656 | To avoid the problem of spending more time on rendering invisible |
---|
657 | nodes than would be spent by waiting for the result of the query, we |
---|
658 | select a node with the lowest estimated rendering cost and compare |
---|
659 | this cost with a user specified constant. If the cost is larger than |
---|
660 | the constant we conclude that it is too risky to render the node and |
---|
661 | wait till the result of the query becomes available. |
---|
662 | |
---|
663 | \section{Results} |
---|
664 | \label{sec:results} |
---|
665 | |
---|
666 | We have incorporated our method into an OpenGL-based scene graph |
---|
667 | library and tested it on three scenes of different types. All tests |
---|
668 | were conducted on a PC with a 3.2GHz P4, 1GB of memory, and a |
---|
669 | GeForce FX5950 graphics card. |
---|
670 | |
---|
671 | \subsection{Test scenes} |
---|
672 | |
---|
673 | The three test scenes comprise a synthetic arrangement of 5000 |
---|
674 | randomly positioned teapots (11.6M polygons); an urban environment (1M |
---|
675 | polygons); and the UNC power plant model (13M polygons). The test |
---|
676 | scenes are depicted in Figure~\ref{fig:scenes}. All scenes were |
---|
677 | partitioned using a kD-tree constructed according to the surface-area |
---|
678 | heuristics~\cite{MacDonald90}. |
---|
679 | |
---|
680 | |
---|
681 | Although the teapot scene would intuitively offer good occlusion, it |
---|
682 | is a complicated case to handle for occlusion culling. Firstly, the |
---|
683 | teapots consist of small triangles and so only the effect of fused |
---|
684 | occlusion due to a large number of visible triangles can bring a |
---|
685 | culling benefit. Secondly, there are many thin holes through which it |
---|
686 | is possible to see quite far into the arrangement of teapots. Thirdly, |
---|
687 | the arrangement is long and thin and so we can see almost |
---|
688 | half of the teapots along the longer side of the arrangement. |
---|
689 | |
---|
690 | %Although the teapot scene would intuitively seem to offer good |
---|
691 | %occlusion, it is one of the more complicated cases for occlusion |
---|
692 | %culling. Due to the high number of very dense objects, a very fine |
---|
693 | %kD-tree subdivision was necessary, which in turn leads to higher costs |
---|
694 | %for testing many nodes in the hierarchy. Additionally, misclassifying |
---|
695 | %a single teapot can lead to a significant increase in frame time due |
---|
696 | %to the high polygon density (2,304 polygons per teapot). There are also |
---|
697 | %many locations where it is possible to see very far into the block of |
---|
698 | %teapots. This can be seen in the kD-tree node classification which is |
---|
699 | %shown in the accompanying video. |
---|
700 | |
---|
701 | The complete power plant model is quite challenging even to load into memory, |
---|
702 | but on the other hand it offers good |
---|
703 | occlusion. This scene is an interesting candidate for testing not only |
---|
704 | due to its size, but also due to significant changes in visibility and depth complexity in |
---|
705 | its different parts. |
---|
706 | |
---|
707 | %as can also be seen in the |
---|
708 | %walkthrough. A relatively coarse subdivision of the model triangles |
---|
709 | %into about 4,600 nodes was sufficient to capture most occlusion |
---|
710 | %interdependencies. |
---|
711 | |
---|
712 | The city scene is a classical target for occlusion culling |
---|
713 | algorithms. Due to the urban structure consisting of buildings and |
---|
714 | streets, most of the model is occluded when viewed from the |
---|
715 | streets. Note that the scene does not contain any detailed geometry |
---|
716 | inside the buildings. See Figure~\ref{fig:city_vis} for a |
---|
717 | visualization of the visibility classification of the kD-tree nodes |
---|
718 | for the city scene. |
---|
719 | |
---|
720 | %that |
---|
721 | %could further |
---|
722 | |
---|
723 | |
---|
724 | |
---|
725 | %% However, we still included a section in the walkthrough where the |
---|
726 | %% viewpoint is elevated above the building roofs in order to show that |
---|
727 | %% the algorithm continues to work without modification, albeit at some |
---|
728 | %% expense. It will be an interesting topic of research to automatically |
---|
729 | %% recognize such situations and decrease the frequency of visibility |
---|
730 | %% queries or turn the occlusion culling algorithm off completely. |
---|
731 | |
---|
732 | |
---|
733 | |
---|
734 | \begin{figure} |
---|
735 | \centering \includegraphics[width=0.32\textwidth,draft=\DRAFTIMAGES]{images/city_vis} |
---|
736 | \caption{Visibility classification of the kD-tree nodes in the city scene. |
---|
737 | The orange nodes were found visible, all the other depicted nodes are invisible. |
---|
738 | Note the increasing size of the occluded nodes with increasing distance from the visible set.} |
---|
739 | \label{fig:city_vis} |
---|
740 | \end{figure} |
---|
741 | |
---|
742 | |
---|
743 | \subsection{Basic tests} |
---|
744 | |
---|
745 | |
---|
746 | We have measured the frame times for rendering with only view-frustum culling |
---|
747 | (VFC), the hierarchical stop-and-wait method (S\&W), and our new |
---|
748 | coherent hierarchical culling method (CHC). Additionally, we have |
---|
749 | evaluated the time for an ``ideal'' algorithm. The ideal algorithm |
---|
750 | renders the visible objects found by the S\&W algorithm without |
---|
751 | performing any visibility tests. This is an optimal solution with |
---|
752 | respect to the given hierarchy, i.e., no occlusion culling algorithm |
---|
753 | operating on the same hierarchy can be faster. For the basic tests we |
---|
754 | did not apply any of the optimizations discussed in |
---|
755 | Section~\ref{sec:optimizations}, which require user specified |
---|
756 | parameters. |
---|
757 | |
---|
758 | %% In order to obtain repeatable and comparable timings, the process |
---|
759 | %% priority of our application was set to high, timings were obtained |
---|
760 | %% using the CPU cycle counter instruction, and the OpenGL {\tt Finish()} |
---|
761 | %% instruction was executed before each invocation of the timer at the |
---|
762 | %% start and the end of a frame. Furthermore, we did not include clear |
---|
763 | %% and buffer swapping times in our results, since they depend strongly |
---|
764 | %% on the windowing mode used. |
---|
765 | |
---|
766 | For each test scene, we have constructed a walkthrough which is shown |
---|
767 | in full in the accompanying |
---|
768 | video. Figures~\ref{fig:teapot_walkthrough},~\ref{fig:city_walkthrough}, |
---|
769 | and~\ref{fig:plant_walkthrough} depict the frame times measured for |
---|
770 | the walkthroughs. Note that Figure~\ref{fig:plant_walkthrough} uses a |
---|
771 | logarithmic scale to capture the high variations in frame times during |
---|
772 | the power plant walkthrough. |
---|
773 | \begin{figure} |
---|
774 | \centering |
---|
775 | \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/teapotgraph1} |
---|
776 | \caption{Frame times for the teapot scene.}\label{fig:teapot_walkthrough} |
---|
777 | \end{figure} |
---|
778 | \begin{figure} |
---|
779 | \centering |
---|
780 | \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/citygraph1} |
---|
781 | \caption{Frame times for the city walkthrough. Note |
---|
782 | the spike around frame 1600, where the viewpoint was elevated above the roofs, |
---|
783 | practically eliminating any occlusion.}\label{fig:city_walkthrough} |
---|
784 | \end{figure} |
---|
785 | \begin{figure} |
---|
786 | \centering |
---|
787 | \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/plantgraph1log} |
---|
788 | \caption{Frame times for the power plant walkthrough. |
---|
789 | The plot shows the weakness of the S\&W method: when there is not much occlusion it becomes slower than VFC (near frame 2200). |
---|
790 | The CHC can keep up even in these situations and in the same time it can exploit occlusion when it appears |
---|
791 | (e.g. near frame 3700).} |
---|
792 | \label{fig:plant_walkthrough} |
---|
793 | \end{figure} |
---|
794 | To better demonstrate the behavior of our algorithm, all walkthroughs |
---|
795 | contain sections with both restricted and unrestricted visibility. For |
---|
796 | the teapots, we viewed the arrangement of teapots along the longer |
---|
797 | side of the arrangement (frames 25--90). In the city we elevated the |
---|
798 | viewpoint above the roofs and gained sight over most of the city |
---|
799 | (frames 1200--1800). The power plant walkthrough contains several |
---|
800 | viewpoints from which a large part of the model is visible (spikes in |
---|
801 | Figure~\ref{fig:plant_walkthrough} where all algorithms are slow), |
---|
802 | viewpoints along the border of the model directed outwards with low depth complexity (holes in |
---|
803 | Figure~\ref{fig:plant_walkthrough} where all algorithms are fast), and |
---|
804 | viewpoints inside the power plant with high depth complexity where occlusion culling produces a |
---|
805 | significant speedup over VFC (e.g. frame 3800). |
---|
806 | |
---|
807 | As we can see for a number frames in the walkthroughs, the CHC method |
---|
808 | can produce a speedup of more than one order of magnitude compared to |
---|
809 | VFC. The maximum speedup for the teapots, the city, and the power |
---|
810 | plant walkthroughs is 8, 20, and 70, respectively. We can also observe |
---|
811 | that CHC maintains a significant gain over S\&W and in many cases it |
---|
812 | almost matches the performance of the ideal algorithm. In complicated |
---|
813 | scenarios the S\&W method caused a significant slowdown compared to |
---|
814 | VFC (e.g. frames 1200--1800 of |
---|
815 | Figure~\ref{fig:city_walkthrough}). Even in these cases, the CHC |
---|
816 | method maintained a good speedup over VFC except for a small number of |
---|
817 | frames. |
---|
818 | |
---|
819 | \begin{table*} |
---|
820 | \centering \footnotesize |
---|
821 | \begin{tabular}{|c|c|c|c|c|c|c|c|} |
---|
822 | \hline\hline |
---|
823 | scene & method & \#queries & wait time [ms] & rendered triangles & frame time [ms] & speedup \\\hline\hline |
---|
824 | |
---|
825 | Teapots & VFC & --- & --- & 11,139,928 & 310.42 & 1.0 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
826 | 11,520,000 triangles & S\&W & 4704 & 83.19 & 2,617,801 & 154.95 & 2.3 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
827 | 21,639 kD-Tree nodes & {\bf CHC} &{\bf 2827} & {\bf 1.31} & {\bf 2,852,514 } & {\bf 81,18 } & {\bf 4.6 } \\\hhline{~|-|-|-|-|-|-|-|} |
---|
828 | & Ideal & --- & --- & 2,617,801 & 72.19 & 5.2 \\ |
---|
829 | \hline\hline |
---|
830 | City & VFC & --- & --- & 156,521 & 19.79 & 1.0 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
831 | 1,036,146 triangles & S\&W & 663 & 9.49 & 30,594 & 19.9 & 1.5 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
832 | 33,195 kD-Tree nodes &{\bf CHC} & {\bf 345} & {\bf 0.18} & {\bf 31,034} & {\bf 8.47} & {\bf 4.0} \\\hhline{~|-|-|-|-|-|-|-|} |
---|
833 | & Ideal & --- & --- & 30,594 & 4.55 & 6.6 \\ |
---|
834 | \hline\hline |
---|
835 | Power Plant & VFC & --- & --- & 1,556,300 & 138.76 & 1.0 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
836 | 12,748,510 triangles & S\&W & 485 & 16.16 & 392,962 & 52.29 & 3.2 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
837 | 18,719 kD-Tree nodes &{\bf CHC} & {\bf 263} & {\bf 0.70} & {\bf 397,920} & {\bf 38.73} & {\bf 4.7} \\\hhline{~|-|-|-|-|-|-|-|} |
---|
838 | & Ideal & --- & --- & 392,962 & 36.34 & 5.8 \\\hline\hline |
---|
839 | \end{tabular} |
---|
840 | \caption{Statistics for the three test scenes. VFC is rendering with only view-frustum culling, S\&W is the |
---|
841 | hierarchical stop and wait method, CHC is our new method, and Ideal |
---|
842 | is a perfect method with respect to the given hierarchy. All values are averages over |
---|
843 | all frames (including the speedup).} |
---|
844 | \label{tab:averages} |
---|
845 | \end{table*} |
---|
846 | |
---|
847 | %VFC S&W CHC Ideal |
---|
848 | %(01:50:59) salzrat: 1 3,243823523 4,744645293 5,765981148 |
---|
849 | %(01:51:04) salzrat: for the powerplant |
---|
850 | % for the powerplant |
---|
851 | %(01:53:59) salzrat: 1 1,519898136 4,01897468 6,599413211 |
---|
852 | %(01:54:01) salzrat: for the city |
---|
853 | %(01:55:15) salzrat: and |
---|
854 | %(01:55:15) salzrat: 1 2,321137905 4,646185621 5,199086993 |
---|
855 | |
---|
856 | %salzrat: teapot: 1,977162565 1,128631431 |
---|
857 | %(02:01:52) salzrat: city: 2,607396227 1,674188137 |
---|
858 | %(02:02:59) salzrat: powerplant: 1,635036084 1,244446469 |
---|
859 | |
---|
860 | |
---|
861 | Next, we summarized the scene statistics and the average values per |
---|
862 | frame in Table~\ref{tab:averages}. The table shows the number of |
---|
863 | issued occlusion queries, the wait time representing the CPU stalls, |
---|
864 | the number of rendered triangles, the total frame time, and the |
---|
865 | speedup over VFC. |
---|
866 | |
---|
867 | |
---|
868 | We can see that the CHC method practically eliminates the CPU stalls |
---|
869 | (wait time) compared to the S\&W method. This is paid for by a slight |
---|
870 | increase in the number of rendered triangles. For the three |
---|
871 | walkthroughs, the CHC method produces average speedups of 4.6, 4.0, and |
---|
872 | 4.7 over view frustum culling and average speedups of 2.0, 2.6, and |
---|
873 | 1.6 over the S\&W method. CHC is only 1.1, 1.7, and 1.2 times |
---|
874 | slower than the ideal occlusion culling algorithm. Concerning the |
---|
875 | accuracy, the increase of the average number of rendered triangles for |
---|
876 | CHC method compared to S\&W was 9\%, 1.4\%, and 1.3\%. This increase |
---|
877 | was always recovered by the reduction of CPU stalls for the tested |
---|
878 | walkthroughs. |
---|
879 | |
---|
880 | %% While this number may be very low, the algorithm also incurs a |
---|
881 | %% non-negligible overhead which is caused by the graphics card |
---|
882 | %% reconfiguration required for the queries and rasterization of the |
---|
883 | %% bounding volume geometries. |
---|
884 | |
---|
885 | % triangles |
---|
886 | % 9% |
---|
887 | % 1.4% |
---|
888 | % 1.3% |
---|
889 | |
---|
890 | |
---|
891 | % speedup over SW |
---|
892 | % 1.9 |
---|
893 | % 2.32 |
---|
894 | % 1.33 |
---|
895 | |
---|
896 | % compare to ideal |
---|
897 | % 1.13 |
---|
898 | % 1.9 |
---|
899 | % 1.05 |
---|
900 | |
---|
901 | |
---|
902 | |
---|
903 | \subsection{Optimizations} |
---|
904 | |
---|
905 | First of all we have observed that the technique of complete |
---|
906 | elimination of CPU stalls discussed in Section~\ref{sec:full} has a |
---|
907 | very limited scope. In fact for all our tests the stalls were almost |
---|
908 | completely eliminated by the basic algorithm already (see wait time in |
---|
909 | Table~\ref{tab:averages}). We did not find constants that could |
---|
910 | produce additional speedup using this technique. |
---|
911 | |
---|
912 | The measurements for the other optimizations discussed in |
---|
913 | Section~\ref{sec:optimizations} are summarized in |
---|
914 | Table~\ref{tab:assumed}. We have measured the average number of issued queries |
---|
915 | and the average frame time in dependence on the number of frames a node is |
---|
916 | assumed visible and the pixel threshold of approximate visibility. We |
---|
917 | have observed that the effectiveness of the optimizations depends strongly on the scene. If |
---|
918 | the hierarchy is deep and the geometry associated with a leaf node is |
---|
919 | not too complex, the conservative visibility testing produces a |
---|
920 | significant speedup (city and power plant). For the teapot scene the |
---|
921 | penalty for false rendering of actually occluded objects became larger |
---|
922 | than savings achieved by the reduction of the number of queries. On the other |
---|
923 | hand since the teapot scene contains complex visible geometry the |
---|
924 | approximate visibility optimization produced a significant |
---|
925 | speedup. This is however paid for by introducing errors in the image |
---|
926 | proportional to the pixel threshold used. |
---|
927 | |
---|
928 | %Good parameters for these |
---|
929 | %optimizations have to be estimated on a scene-by-scene basis. |
---|
930 | |
---|
931 | |
---|
932 | |
---|
933 | |
---|
934 | %% , where we compare the reference frame time and |
---|
935 | %% number of queries already shown in the previous table to the |
---|
936 | %% approximate visibility optimization using a higher pixel threshold, |
---|
937 | %% and to the conservative visibility testing optimization assuming a |
---|
938 | %% node remains visible for 2 frames, both for the teapot and the city |
---|
939 | %% scene. These tests clearly show that a high visibility threshold |
---|
940 | %% improves frame times, albeit at the expense of image quality. The |
---|
941 | %% benefit of the conservative visibility optimization, is however |
---|
942 | %% limited. While the number of queries can be reduced significantly, the |
---|
943 | %% number of misclassified objects rises and increases the frame time |
---|
944 | %% again. |
---|
945 | |
---|
946 | |
---|
947 | |
---|
948 | %% \begin{table} |
---|
949 | %% \centering \footnotesize |
---|
950 | %% \begin{tabular}{|l|c|c|} |
---|
951 | %% \hline\hline |
---|
952 | %% & Teapots & City \\\hline\hline |
---|
953 | %% frame time CHC & 130.42 & 13.91\\\hline |
---|
954 | %% \#queries & 2267 & 94.04\\\hline\hline |
---|
955 | %% \multicolumn{3}{|l|}{assume 2 frames visible}\\\hline |
---|
956 | %% frame time & 123.84 & 13.27 \\\hline |
---|
957 | %% \#queries & 1342 & 42 \\\hline\hline |
---|
958 | %% \multicolumn{3}{|l|}{pixel threshold 50 pixels and assume 2 frames visible}\\\hline |
---|
959 | %% frame time & 73.22 & 11.47 \\\hline |
---|
960 | %% \#queries & 1132 & 50 \\\hline\hline |
---|
961 | %% \end{tabular} |
---|
962 | %% \caption{Influence of optimizations on two test scenes. All values are averages |
---|
963 | %% over all frames, frame times are in [ms].}\label{tab:assumed} |
---|
964 | %% \end{table} |
---|
965 | |
---|
966 | |
---|
967 | \begin{table} |
---|
968 | \centering \footnotesize |
---|
969 | \begin{tabular}{|l|c|c|c|c|} |
---|
970 | \hline\hline |
---|
971 | scene & t$_{av}$ & n$_{vp}$ & \#queries & frame time [ms] \\ \hline\hline |
---|
972 | \multirow{3}{*}{Teapots} & 0 & 0 & 2827 & 81.18 \\ \hhline{~----} |
---|
973 | & 2 & 0 & 1769 & 86.31 \\ \hhline{~----} |
---|
974 | & 2 & 25 & 1468 & 55.90 \\ \hline\hline |
---|
975 | \multirow{3}{*}{City} & 0 & 0 & 345 & 8.47 \\ \hhline{~----} |
---|
976 | & 2 & 0 & 192 & 6.70 \\ \hhline{~----} |
---|
977 | & 2 & 25 & 181 & 6.11 \\ \hline\hline |
---|
978 | \multirow{3}{*}{Power Plant} & 0 & 0 & 263 & 38.73 \\ \hhline{~----} |
---|
979 | & 2 & 0 & 126 & 31.17 \\ \hhline{~----} |
---|
980 | & 2 & 25 & 120 & 36.62 \\ \hline\hline |
---|
981 | \end{tabular} |
---|
982 | |
---|
983 | \caption{Influence of optimizations on the CHC method. |
---|
984 | $t_{av}$ is the number of assumed visibility frames for conservative |
---|
985 | visibility testing, n$_{vp}$ is the pixel threshold for |
---|
986 | approximate visibility.}\label{tab:assumed} |
---|
987 | \end{table} |
---|
988 | |
---|
989 | |
---|
990 | \subsection{Comparison to PVS-based rendering} |
---|
991 | |
---|
992 | We also compared the CHC method against precalculated visibility. In |
---|
993 | particular, we used the PVS computed by an offline visibility |
---|
994 | algorithm~\cite{wonka00}. While the walkthrough using the PVS was |
---|
995 | 1.26ms faster per frame on average, our method does not require costly |
---|
996 | precomputation and can be used at any general 3D position in the |
---|
997 | model, not only in a predefined view space. |
---|
998 | |
---|
999 | |
---|
1000 | \section{Conclusion} |
---|
1001 | \label{sec:conclusion} |
---|
1002 | |
---|
1003 | We have presented a method for the optimized scheduling of hardware |
---|
1004 | accelerated occlusion queries. The method schedules occlusion queries |
---|
1005 | in order to minimize the number of the queries and their latency. |
---|
1006 | This is achieved by exploiting spatial and temporal coherence of |
---|
1007 | visibility. Our results show that the CPU stalls and GPU starvation |
---|
1008 | are almost completely eliminated at the cost of a slight increase in |
---|
1009 | the number of rendered objects. |
---|
1010 | |
---|
1011 | %Additionally it schedules the queries in an order |
---|
1012 | %minimizing the overhead due to the latency of availability of query |
---|
1013 | %results. |
---|
1014 | |
---|
1015 | Our technique can be used with practically arbitrary scene |
---|
1016 | partitioning data structures such as kD-trees, bounding volume |
---|
1017 | hierarchies, or hierarchical grids. The implementation of the method |
---|
1018 | is straightforward as it uses a simple OpenGL interface to the |
---|
1019 | hardware occlusion queries. In particular, the method requires no |
---|
1020 | complicated geometrical operations or data structures. The algorithm |
---|
1021 | is suitable for application on scenes of arbitrary structure and it |
---|
1022 | requires no preprocessing or scene dependent tuning. |
---|
1023 | |
---|
1024 | We have experimentally verified that the method is well suited to the |
---|
1025 | \nvoq{} supported on current consumer grade graphics hardware. We |
---|
1026 | have obtained an average speedup of 4.0--4.7 compared to pure view-frustum |
---|
1027 | culling and 1.6--2.6 compared to the hierarchical |
---|
1028 | stop-and-wait application of occlusion queries. |
---|
1029 | |
---|
1030 | The major potential in improving the method is a better estimation |
---|
1031 | of changes in the visibility classification of hierarchy nodes. If |
---|
1032 | nodes tend to be mostly visible, we could automatically decrease the |
---|
1033 | frequency of occlusion tests and thus better adapt the method to the |
---|
1034 | actual occlusion in the scene. Another possibility for improvement is |
---|
1035 | better tuning for a particular graphics hardware by means of more |
---|
1036 | accurate rendering cost estimation. Skipping occlusion tests for |
---|
1037 | simpler geometry can be faster than issuing comparably expensive |
---|
1038 | occlusion queries. |
---|
1039 | |
---|
1040 | |
---|
1041 | |
---|
1042 | \begin{figure*}[htb] |
---|
1043 | \centerline{ |
---|
1044 | \hfill |
---|
1045 | \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/teapots} |
---|
1046 | \hfill |
---|
1047 | \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/city} |
---|
1048 | \hfill |
---|
1049 | \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/pplant} |
---|
1050 | \hfill |
---|
1051 | } |
---|
1052 | \caption{The test scenes: the teapots, the city, and the power plant.} |
---|
1053 | \label{fig:scenes} |
---|
1054 | \end{figure*} |
---|
1055 | |
---|
1056 | %% \begin{figure}[htb] |
---|
1057 | %% \centerline{ |
---|
1058 | %% \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/city} |
---|
1059 | %% } |
---|
1060 | %% \caption{The the city scene.} |
---|
1061 | %% \label{fig:images1} |
---|
1062 | %% \end{figure} |
---|
1063 | |
---|
1064 | %% \begin{figure}[htb] |
---|
1065 | %% \centerline{ |
---|
1066 | %% \includegraphics[width=0.25\textwidth,draft=\DRAFTFIGS]{images/teapots} |
---|
1067 | %% } |
---|
1068 | %% \caption{The the teapot scene.} |
---|
1069 | %% \label{fig:images2} |
---|
1070 | %% \end{figure} |
---|
1071 | |
---|
1072 | %% \begin{figure}[htb] |
---|
1073 | %% \centerline{ |
---|
1074 | %% \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/pplant} |
---|
1075 | %% } |
---|
1076 | %% \caption{The the power plant scene.} |
---|
1077 | %% \label{fig:images3} |
---|
1078 | %% \end{figure} |
---|
1079 | |
---|
1080 | |
---|