1 | \chapter{Online Culling Tool} |
---|
2 | %***************************************************************************** |
---|
3 | |
---|
4 | \section{Introduction} |
---|
5 | |
---|
6 | Visibility culling is one of the major acceleration techniques for |
---|
7 | the real-time rendering of complex scenes. The ultimate goal of |
---|
8 | visibility culling techniques is to prevent invisible objects from |
---|
9 | being sent to the rendering pipeline. A standard visibility-culling |
---|
10 | technique is {\em view-frustum culling}, which eliminates objects |
---|
11 | outside of the current view frustum. View-frustum culling is a fast |
---|
12 | and simple technique, but it does not eliminate objects in the view |
---|
13 | frustum that are occluded by other objects. This can lead to significant |
---|
14 | {\em overdraw}, i.e., the same image area gets covered more than |
---|
15 | once. The overdraw causes a waste of computational effort both in the |
---|
16 | pixel and the vertex processing stages of modern graphic hardware. |
---|
17 | %that is, we render each pixel in the image several times. For many complex |
---|
18 | %scenes with high occlusion, the effort wasted on overdraw cannot simply be |
---|
19 | %eliminated by the increasing the raw computational power of the hardware. |
---|
20 | %This problem becomes even more apparent on recent graphics hardware with |
---|
21 | %programmable vertex and pixel shaders: all complex per-vertex or per-pixel |
---|
22 | %computations on object get completely wasted if the object is occluded. |
---|
23 | The elimination of occluded objects is addressed by {\em occlusion |
---|
24 | culling}. In an optimized rendering pipeline, occlusion culling complements |
---|
25 | other rendering acceleration techniques such as levels of detail or |
---|
26 | impostors. |
---|
27 | |
---|
28 | Occlusion culling can either be applied offline or online. When |
---|
29 | applied offline as a preprocess, we compute a potentially visible set |
---|
30 | (PVS) for cells of a fixed subdivision of the scene. At runtime, we |
---|
31 | can quickly identify a PVS for the given viewpoint. However, this |
---|
32 | approach suffers from four major problems: (1) the PVS is valid only |
---|
33 | for the original static scene configuration, (2) for a given |
---|
34 | viewpoint, the corresponding cell-based PVS can be overly |
---|
35 | conservative, (3) computing all PVSs is computationally expensive, and (4) |
---|
36 | an accurate PVS computation is difficult to implement for general |
---|
37 | scenes. Online occlusion culling can solve these problems at the cost |
---|
38 | of applying extra computations at each frame. To make these additional |
---|
39 | computations efficient, most online occlusion culling methods rely on a |
---|
40 | number of assumptions about the scene structure and its occlusion |
---|
41 | characteristics (e.g. presence of large occluders, occluder |
---|
42 | connectivity, occlusion by few closest depth layers). |
---|
43 | |
---|
44 | \begin{figure}%[htb] |
---|
45 | %\centerline |
---|
46 | \centering \footnotesize |
---|
47 | \begin{tabular}{c} |
---|
48 | \includegraphics[width=0.6\textwidth, draft=\DRAFTFIGS]{images/ogre_terrain} \\ |
---|
49 | %\hline |
---|
50 | \includegraphics[width=0.3\textwidth, draft=\DRAFTFIGS]{images/vis_viewfrustum} \hfill \includegraphics[width=0.3\textwidth, draft=\DRAFTFIGS]{images/vis_chc} \\ |
---|
51 | \end{tabular} |
---|
52 | %\label{tab:online_culling_example} |
---|
53 | \caption{(top) The rendered terrain scene. (bottom) Visualizion of the rendered / culled objects. |
---|
54 | Using view frustum culling (left image) vs. occlusion queries (right image). |
---|
55 | The yellow boxes show the actually rendered scene objects. The |
---|
56 | red boxes depict the view frustum culled hierarchy nodes, the blue boxes depict the |
---|
57 | occlusion query culled hierarchy nodes.} |
---|
58 | \label{fig:online_culling_example} |
---|
59 | \end{figure} |
---|
60 | |
---|
61 | Recent graphics hardware natively supports an \emph{occlusion query} |
---|
62 | to detect the visibility of an object against the current contents of the |
---|
63 | z-buffer. Although the query itself is processed quickly using the |
---|
64 | raw power of the graphics processing unit (GPU), its result is not |
---|
65 | available immediately due to the delay between issuing the query and |
---|
66 | its actual processing in the graphics pipeline. As a result, a naive |
---|
67 | application of occlusion queries can even decrease the overall |
---|
68 | application performance due the associated CPU stalls and GPU |
---|
69 | starvation. In this chapter, we present an algorithm that aims to |
---|
70 | overcome these problems by reducing the number of issued queries and |
---|
71 | eliminating the CPU stalls and GPU starvation. To schedule the |
---|
72 | queries, the algorithm makes use of both the spatial and the temporal |
---|
73 | coherence of visibility. A major strength of our technique is its |
---|
74 | simplicity and versatility: the method can be easily integrated in |
---|
75 | existing real-time rendering packages on architectures supporting the |
---|
76 | underlying occlusion query. |
---|
77 | In figure~\ref{fig:online_culling_example}, the same scene (top row) is rendered using view frustum |
---|
78 | culling (visualization in the bottom left image) versus online culling using occlusion queries (visualization |
---|
79 | in the bottom right image). It can be seen that with view frustum culling only many objects are still rendered. |
---|
80 | %Using spatial and assuming temporal coherence |
---|
81 | |
---|
82 | \section{Related Work} |
---|
83 | |
---|
84 | With the demand for rendering scenes of ever increasing size, there |
---|
85 | have been a number of visibility culling methods developed in the |
---|
86 | last decade. A comprehensive survey of visibility culling methods |
---|
87 | was presented by Cohen-Or et al.~\cite{Cohen:2002:survey}. Another |
---|
88 | recent survey of Bittner and Wonka~\cite{bittner03:jep} discusses |
---|
89 | visibility culling in a broader context of other visibility problems. |
---|
90 | |
---|
91 | According to the domain of visibility computation, we distinguish |
---|
92 | between {\em from-point} and {\em from-region} visibility algorithms. |
---|
93 | From-region algorithms compute a PVS and are applied offline in a |
---|
94 | preprocessing phase~\cite{Airey90,Teller91a,Leyvand:2003:RSF}. |
---|
95 | From-point algorithms are applied online for each particular |
---|
96 | viewpoint~\cite{Greene93a,Hudson97,EVL-1997-163,bittner98b,Wonka:1999:OSF,Klosowski:2001:ECV}. |
---|
97 | In our further discussion we focus on online occlusion culling methods |
---|
98 | that exploit graphics hardware. |
---|
99 | |
---|
100 | A conceptually important online occlusion culling method is the |
---|
101 | hierarchical z-buffer introduced by Greene et al.~\cite{Greene93a}. It |
---|
102 | organizes the z-buffer as a pyramid, where the standard z-buffer is |
---|
103 | the finest level. At all other levels, each z-value is the farthest in |
---|
104 | the window corresponding to the adjacent finer level. The hierarchical |
---|
105 | z-buffer allows to quickly determine if the geometry in question is |
---|
106 | occluded. To a certain extent this idea is used in the current |
---|
107 | generation of graphics hardware by applying early z-tests of |
---|
108 | fragments in the graphics pipeline (e.g., Hyper-Z technology of ATI or |
---|
109 | Z-cull of NVIDIA). However, the geometry still needs to be sent to the |
---|
110 | GPU, transformed, and coarsely rasterized even if it is later |
---|
111 | determined invisible. |
---|
112 | |
---|
113 | Zhang~\cite{EVL-1997-163} proposed hierarchical occlusion maps, which |
---|
114 | do not rely on the hardware support for the z-pyramid, but instead |
---|
115 | make use of hardware texturing. The hierarchical occlusion map is |
---|
116 | computed on the GPU by rasterizing and down sampling a given set of |
---|
117 | occluders. The occlusion map is used for overlap tests whereas the |
---|
118 | depths are compared using a coarse depth estimation buffer. Wonka and |
---|
119 | Schmalstieg~\cite{Wonka:1999:OSF} use occluder shadows to compute |
---|
120 | from-point visibility in \m25d scenes with the help of the GPU. This |
---|
121 | method has been further extended to online computation of from-region |
---|
122 | visibility executed on a server~\cite{wonka:01:eg}. |
---|
123 | |
---|
124 | %% In |
---|
125 | %% parallel to rendering, the visibility server calculates visibility for |
---|
126 | %% the neighborhood of the given viewpoint and sends them back to the |
---|
127 | %% display host. |
---|
128 | |
---|
129 | %% A similar method designed by Aila and |
---|
130 | %% Miettinen~\cite{aila:00:msc}. The incremental occlusion maps are |
---|
131 | %% created on the CPU, which takes the CPU computational time but it does |
---|
132 | %% not suffer the problem of slow read-back from the GPU. |
---|
133 | |
---|
134 | %Klosowski:2001:ECV |
---|
135 | |
---|
136 | Bartz et al.~\cite{Bartz98} proposed an OpenGL extension for occlusion |
---|
137 | queries along with a discussion concerning a potential realization in |
---|
138 | hardware. A first hardware implementation of occlusion queries came |
---|
139 | with the VISUALIZE fx graphics hardware~\cite{Scott:1998:OVF}. The |
---|
140 | corresponding OpenGL extension is called \hpot{}. A more recent OpenGL |
---|
141 | extension, \nvoq{}, was introduced by NVIDIA with the GeForce 3 graphics |
---|
142 | card and it is now also available as an official ARB extension. |
---|
143 | |
---|
144 | %Both |
---|
145 | %extensions have been used in several several occlusion methods |
---|
146 | %algorithms~\cite{Meissner01,Hillesland02,Staneker:2004:OCO}. |
---|
147 | %\cite{Meissner01} |
---|
148 | |
---|
149 | Hillesland et al.~\cite{Hillesland02} have proposed an algorithm |
---|
150 | which employs the \nvoq. They subdivide the scene using a uniform |
---|
151 | grid. Then the cubes are traversed in slabs roughly perpendicular to |
---|
152 | the viewport. The queries are issued for all cubes of a slab at once, |
---|
153 | after the visible geometry of this slab has been rendered. The method |
---|
154 | can also use nested grids: a cell of the grid contains another grid |
---|
155 | that is traversed if the cell is proven visible. This method however |
---|
156 | does not exploit temporal and spatial coherence of visibility and it |
---|
157 | is restricted to regular subdivision data structures. Our new method |
---|
158 | addresses both these problems and provides natural extensions to |
---|
159 | balance the accuracy of visibility classification and the associated |
---|
160 | computational costs. |
---|
161 | |
---|
162 | Recently, Staneker et al.~\cite{Staneker:2004:OCO} developed a method |
---|
163 | integrating occlusion culling into the OpenSG scene graph |
---|
164 | framework. Their technique uses occupancy maps maintained in software |
---|
165 | to avoid queries on visible scene graph nodes, and temporal coherence |
---|
166 | to reduce the number of occlusion queries. The drawback of the method |
---|
167 | is that it performs the queries in a serial fashion and thus it |
---|
168 | suffers from the CPU stalls and GPU starvation. |
---|
169 | |
---|
170 | %% The methods proposed in this chapter can be used to make use of temporal |
---|
171 | %% and spatial coherence in the scope of existing visibility algorithms, |
---|
172 | %% that utilise a spatial hierarchy. Examples of these are algorithms |
---|
173 | %% based on hierarchical occlusion maps~\cite{Zhang97}, coverage |
---|
174 | %% masks~\cite{Greene:1996:HPT}, shadow frusta~\cite{Hudson97}, and |
---|
175 | %% occlusion trees~\cite{bittner98b_long}. |
---|
176 | |
---|
177 | On a theoretical level, our paper is related to methods aiming to |
---|
178 | exploit the temporal coherence of visibility. Greene et |
---|
179 | al.~\cite{Greene93a} used the set of visible objects from one frame to |
---|
180 | initialize the z-pyramid in the next frame in order to reduce the |
---|
181 | overdraw of the hierarchical z-buffer. The algorithm of Coorg and |
---|
182 | Teller~\cite{Coorg96b} restricts the hierarchical traversal to nodes |
---|
183 | associated with visual events that were crossed between successive |
---|
184 | viewpoint positions. Another method of Coorg and Teller~\cite{Coorg97} |
---|
185 | exploits temporal coherence by caching occlusion relationships. |
---|
186 | Chrysanthou and Slater have proposed a probabilistic scheme for |
---|
187 | view-frustum culling~\cite{Chrysanthou:97}. |
---|
188 | |
---|
189 | The above mentioned methods for exploiting temporal coherence are |
---|
190 | tightly interwoven with the particular culling algorithm. On the |
---|
191 | contrary, Bittner et al.~\cite{bittner01:jvca} presented a general |
---|
192 | acceleration technique for exploiting spatial and temporal coherence |
---|
193 | in hierarchical visibility algorithms. The central idea, which is also |
---|
194 | vital for the online culling tool, is to avoid repeated visibility tests of |
---|
195 | interior nodes of the hierarchy. The problem of direct adoption of |
---|
196 | this method is that it is designed for the use with instantaneous CPU |
---|
197 | based occlusion queries, whereas hardware occlusion queries exhibit |
---|
198 | significant latency. The method presented herein efficiently overcomes |
---|
199 | the problem of latency while keeping the benefits of a generality and |
---|
200 | simplicity of the original hierarchical technique. As a result we |
---|
201 | obtain a simple and efficient occlusion culling algorithm utilizing |
---|
202 | hardware occlusion queries. |
---|
203 | |
---|
204 | |
---|
205 | %In this paper we show how to overcome this problem while |
---|
206 | %keeping the benefits of exploiting. |
---|
207 | |
---|
208 | |
---|
209 | The rest of the paper is organized as follows: |
---|
210 | Section~\ref{sec:hwqueries} discusses hardware supported occlusion |
---|
211 | queries and a basic application of these queries using a kD-tree. |
---|
212 | Section~\ref{sec:hoq} presents our new algorithm and |
---|
213 | Section~\ref{sec:optimizations} describes several additional |
---|
214 | optimizations. Section~\ref{sec:results} presents results obtained by |
---|
215 | experimental evaluation of the method and discusses its |
---|
216 | behavior. Finally Section~\ref{sec:conclusion} concludes the paper. |
---|
217 | |
---|
218 | |
---|
219 | |
---|
220 | |
---|
221 | \section{Hardware Occlusion Queries} |
---|
222 | \label{sec:hwqueries} |
---|
223 | |
---|
224 | |
---|
225 | Hardware occlusion queries follow a simple pattern: To test visibility |
---|
226 | of an occludee, we send its bounding volume to the GPU. The volume is |
---|
227 | rasterized and its fragments are compared to the current contents of |
---|
228 | the z-buffer. The GPU then returns the number of visible |
---|
229 | fragments. If there is no visible fragment, the occludee is invisible |
---|
230 | and it need not be rendered. |
---|
231 | |
---|
232 | \subsection{Advantages of hardware occlusion queries} |
---|
233 | |
---|
234 | There are several advantages of hardware occlusion queries: |
---|
235 | |
---|
236 | \begin{itemize} |
---|
237 | |
---|
238 | \item {\em Generality of occluders.} We can use the original scene geometry |
---|
239 | as occluders, since the queries use the current contents of the z-buffer. |
---|
240 | |
---|
241 | |
---|
242 | \item {\em Occluder fusion.} The occluders are merged in the z-buffer, |
---|
243 | so the queries automatically account for occluder fusion. Additionally |
---|
244 | this fusion comes for free since we use the intermediate result of the |
---|
245 | rendering itself. |
---|
246 | |
---|
247 | \item {\em Generality of occludees.} We can use complex |
---|
248 | occludees. Anything that can be rasterized quickly is suitable. |
---|
249 | |
---|
250 | \item {\em Exploiting the GPU power.} The queries take full advantage of |
---|
251 | the high fill rates and internal parallelism provided by modern GPUs. |
---|
252 | |
---|
253 | \item {\em Simple use.} Hardware occlusion queries can be easily |
---|
254 | integrated into a rendering algorithm. They provide a powerful tool to |
---|
255 | minimize the implementation effort, especially when compared to |
---|
256 | CPU-based occlusion culling. |
---|
257 | |
---|
258 | |
---|
259 | \end{itemize} |
---|
260 | |
---|
261 | |
---|
262 | |
---|
263 | \subsection{Problems of hardware occlusion queries} |
---|
264 | |
---|
265 | Currently there are two main hardware supported variants of occlusion |
---|
266 | queries: the HP test (\hpot) and the more recent NV query (\nvoq, now |
---|
267 | also available as \arboq). The most important difference between the |
---|
268 | HP test and the NV query is that multiple NV queries can be |
---|
269 | issued before asking for their results, while only one HP test is |
---|
270 | allowed at a time, which severely limits its possible algorithmic |
---|
271 | usage. Additionally the NV query returns the number of visible pixels |
---|
272 | whereas the HP test returns only a binary visibility |
---|
273 | classification. |
---|
274 | |
---|
275 | %% Issuing multiple independent NV queries is crucial for |
---|
276 | %% the design of algorithms that strive to exploit power of the GPU. |
---|
277 | |
---|
278 | The main problem of both the HP test and the NV query is the latency |
---|
279 | between issuing the query and the availability of the result. The |
---|
280 | latency occurs due to the delayed processing of the query in a long |
---|
281 | graphics pipeline, the cost of processing the query itself, and the |
---|
282 | cost of transferring the result back to the CPU. The latency causes |
---|
283 | two major problems: CPU stalls and GPU starvation. After issuing the |
---|
284 | query, the CPU waits for its result and does not feed the GPU with new |
---|
285 | data. When the result finally becomes available, the GPU pipeline can |
---|
286 | already be empty. Thus the GPU needs to wait for the CPU to process |
---|
287 | the result of the query and to feed the GPU with new data. |
---|
288 | |
---|
289 | A major challenge when using hardware occlusion queries is to avoid |
---|
290 | the CPU stalls by filling the latency time with other tasks, such as |
---|
291 | rendering visible scene objects or issuing other, independent |
---|
292 | occlusion queries (see Figure~\ref{fig:latency}) |
---|
293 | |
---|
294 | |
---|
295 | \begin{figure}[htb] |
---|
296 | \centerline{ |
---|
297 | \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{figs/latency2} |
---|
298 | } |
---|
299 | |
---|
300 | \caption{(top) Illustration of CPU stalls and GPU starvation. |
---|
301 | Qn, Rn, and Cn denote querying, rendering, and culling of object |
---|
302 | n, respectively. Note that object 5 is found invisible by Q5 and |
---|
303 | thus not rendered. (bottom) More efficient query scheduling. The |
---|
304 | scheduling assumes that objects 4 and 6 will be visible in the |
---|
305 | current frame and renders them without waiting for the result of |
---|
306 | the corresponding queries. } |
---|
307 | \label{fig:latency} |
---|
308 | \end{figure} |
---|
309 | |
---|
310 | |
---|
311 | |
---|
312 | |
---|
313 | |
---|
314 | \subsection{Hierarchical stop-and-wait method} |
---|
315 | \label{sec:basic} |
---|
316 | |
---|
317 | Many rendering algorithms rely on hierarchical structures in order to |
---|
318 | deal with complex scenes. In the context of occlusion culling, such a |
---|
319 | data structure allows to efficiently cull large scene blocks, and thus |
---|
320 | to exploit spatial coherence of visibility and provide a key to |
---|
321 | achieving output sensitivity. |
---|
322 | |
---|
323 | This section outlines a naive application of occlusion queries in the |
---|
324 | scope of a hierarchical algorithm. We refer to this approach as the |
---|
325 | {\em hierarchical stop-and-wait} method. Our discussion is based on |
---|
326 | kD-trees, which proved to be efficient for point location, ray |
---|
327 | tracing, and visibility |
---|
328 | culling~\cite{MacDonald90,Hudson97,Coorg97,bittner01:jvca}. The |
---|
329 | concept applies to general hierarchical data structures as well, though. |
---|
330 | |
---|
331 | The hierarchical stop-and-wait method proceeds as follows: Once a |
---|
332 | kD-tree node passes view-frustum culling, it is tested for |
---|
333 | occlusion by issuing the occlusion query and waiting for its |
---|
334 | result. If the node is found visible, we continue by recursively |
---|
335 | testing its children in a front-to-back order. If the node is a |
---|
336 | leaf, we render its associated objects. |
---|
337 | |
---|
338 | The problem with this approach is that we can continue the tree |
---|
339 | traversal only when the result of the last occlusion query becomes |
---|
340 | available. If the result is not available, we have to stall the CPU, |
---|
341 | which causes significant performance penalties. As we document in |
---|
342 | Section~\ref{sec:results}, these penalties together with the overhead |
---|
343 | of the queries themselves can even decrease the overall application |
---|
344 | performance compared to pure view-frustum culling. Our new method aims |
---|
345 | to eliminate this problem by issuing multiple occlusion queries for |
---|
346 | independent scene parts and exploiting temporal coherence of |
---|
347 | visibility classifications. |
---|
348 | |
---|
349 | |
---|
350 | \section{Coherent Hierarchical Culling} |
---|
351 | |
---|
352 | \label{sec:hoq} |
---|
353 | |
---|
354 | In this section we first present an overview of our new |
---|
355 | algorithm. Then we discuss its steps in more detail. |
---|
356 | |
---|
357 | %and present a |
---|
358 | %generalization of the method to other spatial data structures. |
---|
359 | |
---|
360 | \subsection{Algorithm Overview} |
---|
361 | |
---|
362 | \label{sec:overview} |
---|
363 | |
---|
364 | |
---|
365 | Our method is based on exploiting temporal coherence of visibility |
---|
366 | classification. In particular, it is centered on the following |
---|
367 | three ideas: |
---|
368 | |
---|
369 | \begin{itemize} |
---|
370 | \item We initiate occlusion queries on nodes of the hierarchy where |
---|
371 | the traversal terminated in the last frame. Thus we avoid queries |
---|
372 | on all previously visible interior nodes~\cite{bittner01:jvca}. |
---|
373 | \item We assume that a previously visible leaf node remains visible |
---|
374 | and render the associated geometry without waiting for the result |
---|
375 | of the corresponding occlusion query. |
---|
376 | \item Issued occlusion queries are stored in a query queue until they are known to be carried out by the GPU. This allows |
---|
377 | interleaving the queries with the rendering of visible geometry. |
---|
378 | \end{itemize} |
---|
379 | |
---|
380 | |
---|
381 | The algorithm performs a traversal of the hierarchy that is |
---|
382 | terminated either at leaf nodes or nodes that are classified as |
---|
383 | invisible. Let us call such nodes the {\em termination nodes}, and |
---|
384 | interior nodes that have been classified visible the {\em opened |
---|
385 | nodes}. We denote sets of termination and opened nodes in the $i$-th |
---|
386 | frame $\mathcal{T}_i$ and $\mathcal{ O}_i$, respectively. In the |
---|
387 | $i$-th frame, we traverse the kD-tree in a front-to-back order, skip |
---|
388 | all nodes of $\mathcal{ O}_{i-1}$ and apply occlusion queries first on |
---|
389 | the termination nodes $\mathcal{ T}_{i-1}$. When reaching a |
---|
390 | termination node, the algorithm proceeds as follows: |
---|
391 | |
---|
392 | \begin{itemize} |
---|
393 | \item For a previously visible node (this must be a leaf), we issue |
---|
394 | the occlusion query and store it in the query queue. Then we |
---|
395 | immediately render the associated geometry without waiting for the |
---|
396 | result of the query. |
---|
397 | \item For a previously invisible node, we issue the query and store |
---|
398 | it in the query queue. |
---|
399 | \end{itemize} |
---|
400 | |
---|
401 | \begin{figure}[htb] |
---|
402 | {\footnotesize |
---|
403 | \input{code/pseudocode2} |
---|
404 | } |
---|
405 | \caption{Pseudo-code of coherent hierarchical culling.} |
---|
406 | \label{fig:pseudocode} |
---|
407 | \end{figure} |
---|
408 | |
---|
409 | When the query queue is not empty, we check if the result of the |
---|
410 | oldest query in the queue is already available. If the query result is |
---|
411 | not available, we continue by recursively processing other nodes of |
---|
412 | the kD-tree as described above. If the query result is available, we |
---|
413 | fetch the result and remove the node from the query queue. If the node |
---|
414 | is visible, we process its children recursively. Otherwise, the whole |
---|
415 | subtree of the node is invisible and thus it is culled. |
---|
416 | |
---|
417 | |
---|
418 | In order to propagate changes in visibility upwards in the hierarchy, |
---|
419 | the visibility classification is \emph{pulled up} according to the |
---|
420 | following rule: An interior node is invisible only if all its children |
---|
421 | have been classified invisible. Otherwise, it remains visible and thus |
---|
422 | opened. The pseudo-code of the complete algorithm is given in |
---|
423 | Figure~\ref{fig:pseudocode}. An example of the behavior of the method |
---|
424 | on a small kD-tree for two subsequent frames is depicted |
---|
425 | Figure~\ref{fig:cut}. |
---|
426 | |
---|
427 | \begin{figure*}[htb] |
---|
428 | \centerline{ |
---|
429 | \includegraphics[width=0.73\textwidth,draft=\DRAFTFIGS]{figs/cut} } |
---|
430 | \caption{(left) Visibility classification of a node of the kD-tree and |
---|
431 | the termination nodes. (right) Visibility classification after the |
---|
432 | application of the occlusion test and the new set of termination nodes. |
---|
433 | Nodes on which occlusion queries were applied are depicted with a |
---|
434 | solid outline. Note the pull-up and pull-down due to visibility changes.} |
---|
435 | \label{fig:cut} |
---|
436 | \end{figure*} |
---|
437 | |
---|
438 | |
---|
439 | |
---|
440 | The sets of opened nodes and termination nodes need not be maintained |
---|
441 | explicitly. Instead, these sets can be easily identified by |
---|
442 | associating with each node an information about its visibility and an |
---|
443 | id of the last frame when it was visited. The node is an opened node |
---|
444 | if it is an interior visible node that was visited in the last frame |
---|
445 | (line 23 in the pseudocode). Note that in the actual implementation of |
---|
446 | the pull up we can set all visited nodes to invisible by default and |
---|
447 | then pull up any changes from invisible to visible (lines 25 and line |
---|
448 | 12 in Figure~\ref{fig:pseudocode}). This modification eliminates |
---|
449 | checking children for invisibility during the pull up. |
---|
450 | |
---|
451 | |
---|
452 | |
---|
453 | %% \begin{figure*}[htb] |
---|
454 | %% \centerline{\includegraphics[width=0.8\textwidth,draft=\DRAFTFIGS]{figs/basic}} |
---|
455 | %% \caption{Illustration of the hierarchy traversal. Initially the algorithm |
---|
456 | %% starts at the root of the hierarchy (left). In the second frame the |
---|
457 | %% opened nodes $\mathcal{ O}_0$ are skipped and the occlusion queries |
---|
458 | %% are first applied on the termination nodes $\mathcal{T}_0$. Visibility changes are propagated upwarsds in the hierarchy and a |
---|
459 | %% new set of termination nodes $\mathcal{ T}_1$ is established.} |
---|
460 | %% \label{fig:basic} |
---|
461 | %% \end{figure*} |
---|
462 | |
---|
463 | |
---|
464 | \subsection{Reduction of the number of queries} |
---|
465 | |
---|
466 | %Skipping the opened nodes and issuing occlusion queries for the |
---|
467 | %termination nodes assists in exploiting two main characteristics of |
---|
468 | %scene visibility: |
---|
469 | |
---|
470 | %Identifying the set of termination nodes assists in finding a subdivision |
---|
471 | |
---|
472 | Our method reduces the number of visibility queries in two ways: |
---|
473 | Firstly, as other hierarchical culling methods we consider only a |
---|
474 | subtree of the whole hierarchy (opened nodes + termination |
---|
475 | nodes). Secondly, by avoiding queries on opened nodes we eliminate |
---|
476 | part of the overhead of identification of this subtree. These |
---|
477 | reductions reflect the following coherence properties of scene |
---|
478 | visibility: |
---|
479 | |
---|
480 | \begin{itemize} |
---|
481 | |
---|
482 | |
---|
483 | \item {\em Spatial coherence.} The invisible termination nodes |
---|
484 | approximate the occluded part of the scene with the smallest number of |
---|
485 | nodes with respect to the given hierarchy, i.e., each invisible |
---|
486 | termination node has a visible parent. This induces an adaptive |
---|
487 | spatial subdivision that reflects spatial coherence of visibility, |
---|
488 | more precisely the coherence of occluded regions. The adaptive nature |
---|
489 | of the subdivision allows to minimize the number of subsequent |
---|
490 | occlusion queries by applying the queries on the largest spatial |
---|
491 | regions that are expected to remain occluded. |
---|
492 | |
---|
493 | %, and visible termination |
---|
494 | %nodes correspond to the smallest unoccluded regions (visible |
---|
495 | %leafs) |
---|
496 | |
---|
497 | \item {\em Temporal coherence.} If visibility remains constant the set |
---|
498 | of termination nodes needs no adaptation. If an occluded node becomes |
---|
499 | visible we recursively process its children (pull-down). If a visible |
---|
500 | node becomes occluded we propagate the change higher in the hierarchy |
---|
501 | (pull-up). A pull-down reflects a spatial growing of visible |
---|
502 | regions. Similarly, a pull-up reflects a spatial growing of occluded |
---|
503 | regions. |
---|
504 | |
---|
505 | |
---|
506 | %The first case corresponds to pull-down of the cut, the latter to pull-up. |
---|
507 | |
---|
508 | \end{itemize} |
---|
509 | |
---|
510 | By avoiding queries on the opened nodes, we can save $1/k$ of the queries |
---|
511 | for a hierarchy with branching factor $k$ (assuming visibility remains |
---|
512 | constant). Thus for the kD-tree, up to half of the queries can be |
---|
513 | saved. The actual savings in the total query time are even larger: the |
---|
514 | higher we are at the hierarchy, the larger boxes we would have to |
---|
515 | check for occlusion. Consequently, the higher is the fill rate that |
---|
516 | would have been required to rasterize the boxes. In particular, |
---|
517 | assuming that the sum of the screen space projected area for nodes at |
---|
518 | each level of the kD-tree is equal and the opened nodes form a |
---|
519 | complete binary subtree of depth $d$, the fill rate is reduced $(d+2)$ |
---|
520 | times. |
---|
521 | |
---|
522 | |
---|
523 | |
---|
524 | \subsection{Reduction of CPU stalls and GPU starvation} |
---|
525 | |
---|
526 | \label{sec:latency} |
---|
527 | |
---|
528 | The reduction of CPU stalls and GPU starvation is achieved by |
---|
529 | interleaving occlusion queries with the rendering of visible geometry. The |
---|
530 | immediate rendering of previously visible termination nodes and the |
---|
531 | subsequent issuing of occlusion queries eliminates the requirement of |
---|
532 | waiting for the query result during the processing of the initial |
---|
533 | depth layers containing previously visible nodes. In an optimal case, |
---|
534 | new query results become available in between and thus we completely |
---|
535 | eliminate CPU stalls. In a static scenario, we achieve exactly the |
---|
536 | same visibility classification as the hierarchical stop-and-wait |
---|
537 | method. |
---|
538 | |
---|
539 | If the visibility is changing, the situation can be different: if the |
---|
540 | results of the queries arrive too late, it is possible that we |
---|
541 | initiated an occlusion query on a previously occluded node $A$ that is |
---|
542 | in fact occluded by another previously occluded node $B$ that became |
---|
543 | visible. If $B$ is still in the query queue, we do not capture a |
---|
544 | possible occlusion of $A$ by $B$ since the geometry associated with |
---|
545 | $B$ has not yet been rendered. In Section~\ref{sec:results} we show |
---|
546 | that the increase of the number of rendered objects compared to the |
---|
547 | stop-and-wait method is usually very small. |
---|
548 | |
---|
549 | %% It is possible that we have already traversed all previously |
---|
550 | %% visible nodes and we must stall the CPU by waiting for the result |
---|
551 | %% of the oldest occlusion query. A technique completely eliminating |
---|
552 | %% CPU stalls at the cost of an increase of the number of rendered |
---|
553 | %% objects will be discussed in Section~\ref{sec:full}. |
---|
554 | |
---|
555 | \subsection{Front-to-back scene traversal} |
---|
556 | |
---|
557 | For kD-trees the front-to-back scene traversal can be easily |
---|
558 | implemented using a depth first |
---|
559 | traversal~\cite{bittner01:jvca}. However, at a modest increase in |
---|
560 | computational cost we can also use a more general breadth-first |
---|
561 | traversal based on a priority queue. The priority of the node then |
---|
562 | corresponds to an inverse of the minimal distance of the viewpoint and |
---|
563 | the bounding box associated with the given node of the |
---|
564 | kD-tree~\cite{Klosowski:2001:ECV,Staneker:2004:OCO}. |
---|
565 | |
---|
566 | In the context of our culling algorithm, there are two main advantages |
---|
567 | of the breadth-first front-to-back traversal : |
---|
568 | |
---|
569 | \begin{itemize} |
---|
570 | \item {\em Better query scheduling.} By spreading the traversal of the |
---|
571 | scene in a breadth-first manner, we process the scene in depth |
---|
572 | layers. Within each layer, the node processing order is practically |
---|
573 | independent, which minimizes the problem of occlusion query |
---|
574 | dependence. The breadth-first traversal interleaves |
---|
575 | occlusion-independent nodes, which can provide a more accurate |
---|
576 | visibility classification if visibility changes quickly. In |
---|
577 | particular, it reduces the problem of false classifications due to |
---|
578 | missed occlusion by nodes waiting in the query queue (discussed in |
---|
579 | Section~\ref{sec:latency}). |
---|
580 | |
---|
581 | \item {\em Using other spatial data structures.} By using a |
---|
582 | breadth-first traversal, we are no longer restricted to the |
---|
583 | kD-tree. Instead we can use an arbitrary spatial data structure |
---|
584 | such as a bounding volume hierarchy, octree, grid, hierarchical grid, |
---|
585 | etc. Once we compute a distance from a node to the viewpoint, the |
---|
586 | node processing order is established by the priority queue. |
---|
587 | \end{itemize} |
---|
588 | |
---|
589 | When using the priority queue, our culling algorithm can also be |
---|
590 | applied directly to the scene graph hierarchy, thus avoiding the |
---|
591 | construction of any auxiliary data structure for spatial |
---|
592 | partitioning. This is especially important for dynamic scenes, in |
---|
593 | which maintenance of a spatial classification of moving objects can be |
---|
594 | costly. |
---|
595 | |
---|
596 | \subsection{Checking the query result} |
---|
597 | |
---|
598 | The presented algorithm repeatedly checks if the result of the |
---|
599 | occlusion query is available before fetching any node from the |
---|
600 | traversal stack (line 6 in Figure~\ref{fig:pseudocode}). Our |
---|
601 | practical experiments have proven that the cost of this check is |
---|
602 | negligible and thus it can used frequently without any performance |
---|
603 | penalty. If the cost of this check were significantly higher, we could |
---|
604 | delay asking for the query result by a time established by empirical |
---|
605 | measurements for the particular hardware. This delay should also |
---|
606 | reflect the size of the queried node to match the expected |
---|
607 | availability of the query result as precise as possible. |
---|
608 | |
---|
609 | |
---|
610 | \section{Further Optimizations} |
---|
611 | |
---|
612 | \label{sec:optimizations} |
---|
613 | |
---|
614 | This section discusses a couple of optimizations of our method that |
---|
615 | can further improve the overall rendering performance. In contrast to |
---|
616 | the basic algorithm from the previous section, these optimizations |
---|
617 | rely on some user specified parameters that should be tuned for a |
---|
618 | particular scene and hardware configuration. |
---|
619 | |
---|
620 | |
---|
621 | \subsection{Conservative visibility testing} |
---|
622 | |
---|
623 | The first optimization addresses the reduction of the number of |
---|
624 | visibility tests at the cost of a possible increase in the number of |
---|
625 | rendered objects. This optimization is based on the idea of skipping |
---|
626 | some occlusion tests of visible nodes. We assume that whenever a node |
---|
627 | becomes visible, it remains visible for a number of frames. Within the |
---|
628 | given number of frames we avoid issuing occlusion queries and simply |
---|
629 | assume the node remains visible~\cite{bittner01:jvca}. |
---|
630 | |
---|
631 | This technique can significantly reduce the number of visibility tests |
---|
632 | applied on visible nodes of the hierarchy. Especially in the case of |
---|
633 | sparsely occluded scenes, there is a large number of visible nodes |
---|
634 | being tested, which does not provide any benefit since most of them |
---|
635 | remain visible. On the other hand, we do not immediately capture all |
---|
636 | changes from visibility to invisibility, and thus we may render |
---|
637 | objects that have already become invisible from the moment when the |
---|
638 | last occlusion test was issued. |
---|
639 | |
---|
640 | In the simplest case, the number of frames a node is assumed visible |
---|
641 | can be a predefined constant. In a more complicated scenario this |
---|
642 | number should be influenced by the history of the success of occlusion |
---|
643 | queries and/or the current speed of camera movement. |
---|
644 | |
---|
645 | |
---|
646 | \subsection{Approximate visibility} |
---|
647 | |
---|
648 | The algorithm as presented computes a conservative visibility |
---|
649 | classification with respect to the resolution of the z-buffer. We |
---|
650 | can easily modify the algorithm to cull nodes more aggressively in |
---|
651 | cases when a small part of the node is visible. We compare the |
---|
652 | number of visible pixels returned by the occlusion query with a |
---|
653 | user specified constant and cull the node if this number drops |
---|
654 | below this constant. |
---|
655 | |
---|
656 | |
---|
657 | \subsection{Complete elimination of CPU stalls} |
---|
658 | |
---|
659 | \label{sec:full} |
---|
660 | |
---|
661 | The basic algorithm eliminates CPU stalls unless the traversal stack |
---|
662 | is empty. If there is no node to traverse in the traversal stack and |
---|
663 | the result of the oldest query in the query queue is still not |
---|
664 | available, it stalls the CPU by waiting for the query result. To |
---|
665 | completely eliminate the CPU stalls, we can speculatively render some |
---|
666 | nodes with undecided visibility. In particular, we select a node from |
---|
667 | the query queue and render the geometry associated with the node (or |
---|
668 | the whole subtree if it is an interior node). The node is marked as |
---|
669 | rendered but the associated occlusion query is kept in the queue to |
---|
670 | fetch its result later. If we are unlucky and the node remains |
---|
671 | invisible, the effort of rendering the node's geometry is wasted. On |
---|
672 | the other hand, if the node has become visible, we have used the time |
---|
673 | slot before the next query arrives in an optimal manner. |
---|
674 | |
---|
675 | To avoid the problem of spending more time on rendering invisible |
---|
676 | nodes than would be spent by waiting for the result of the query, we |
---|
677 | select a node with the lowest estimated rendering cost and compare |
---|
678 | this cost with a user specified constant. If the cost is larger than |
---|
679 | the constant we conclude that it is too risky to render the node and |
---|
680 | wait till the result of the query becomes available. |
---|
681 | |
---|
682 | \begin{figure}%[htb] |
---|
683 | {\footnotesize |
---|
684 | \input{code/pseudocode_batching} |
---|
685 | } |
---|
686 | \caption{Pseudo-code of coherent hierarchical culling using multiple queries.} |
---|
687 | \label{fig:pseudocode_batching} |
---|
688 | \end{figure} |
---|
689 | |
---|
690 | \subsection{Initial depth pass} |
---|
691 | \label{sec:initial} |
---|
692 | |
---|
693 | \begin{figure} |
---|
694 | \centering |
---|
695 | \includegraphics[width=0.49\textwidth,draft=\DRAFTIMAGES]{images/transp_wrong} |
---|
696 | \includegraphics[width=0.49\textwidth,draft=\DRAFTIMAGES]{images/transp_delayed} |
---|
697 | \caption{(left) all passes are rendered with CHC. Note that the soldiers are |
---|
698 | visible through the tree. (right) Only the solid passes are rendered using CHC, afterwards the transparent passes.} |
---|
699 | \label{fig:online_transparency} |
---|
700 | \end{figure} |
---|
701 | |
---|
702 | |
---|
703 | To achieve maximal performance on modern GPU's, one has to take care of a number of issues. |
---|
704 | First, it is very important to reduce material switching. Thus modern rendering engines sort the |
---|
705 | objects (or patches) by materials in order to eliminate the material switching as good as possible. |
---|
706 | |
---|
707 | Next, materials can be very costly, sometimes complicated shaders have to be evaluated several times per batch. Hence it should be avoided to render the full material for fragments which eventually turn out to be occluded. This can be achieved |
---|
708 | by rendering an initial depth pass (i.e., enabling only depth write to fill the depth |
---|
709 | buffer). Afterwards the geometry is rendered again, this time with full shading. Because the |
---|
710 | depth buffer is already established, invisible fragments will be discarded before any shading is done calculated. |
---|
711 | |
---|
712 | This approach can be naturally adapted for use with the CHC algorithm. Only an initial depth pass is rendered |
---|
713 | in front-to-back order using the CHC algorithm. The initial pass is sufficient to fill the depth buffer and determine the visible geometry. Then only the visible geometry is rendered again, exploiting the full optimization and material sorting capability of the rendering engine. |
---|
714 | |
---|
715 | If the materials requires several rendering passes, we can use a variant of the depth pass method. We render |
---|
716 | only the first passes using the algorithm (e.g., the solid passes), determining the visibility of the patches, and |
---|
717 | render all the other passes afterwards. This approach can be used when there are passes which require a special |
---|
718 | kind of sorting to be rendered correctly (e.g., transparent passes, shadow passes). In figure~\ref{fig:online_transparency}, |
---|
719 | we can see that artifacts occur in the left image if the transparent passes are not rendered in the correct order after |
---|
720 | applying the hierarchical algorithm (right image). In a similar fashion, we are able to handle shadows~\ref{fig:chc_shadows}. |
---|
721 | |
---|
722 | \begin{figure} |
---|
723 | \centering |
---|
724 | \includegraphics[width=0.35\textwidth,draft=\DRAFTIMAGES]{images/chc_shadows} |
---|
725 | \caption{We can correctly handle shadow volumes together with CHC.} |
---|
726 | \label{fig:chc_shadows} |
---|
727 | \end{figure} |
---|
728 | |
---|
729 | \subsection{Batching multiple queries} |
---|
730 | \label{sec:batching} |
---|
731 | |
---|
732 | When occlusion queries are rendered interleaved with geometry, there is always a state change involved. To |
---|
733 | reduce state changes, it is beneficial not to execute one query at a time, but multiple queries at once~\cite{Staneker:2004:EMO}. |
---|
734 | Instead of immediately executing a query for a node when we fetch it from the traversal stack, we add it to the {\em pending queue}. If {\em n} of these queries are accumulated in the queue, we can execute them at once. |
---|
735 | To optain an optimal value for {\em n}, several some heuristics can be applied, e.g., a fraction of the number of queries issued in the last frame. The pseudo-code of the algorithm including the batching is given in Figure~\ref{fig:pseudocode_batching}. |
---|
736 | |
---|
737 | |
---|
738 | \section{Results} |
---|
739 | \label{sec:results} |
---|
740 | |
---|
741 | We have incorporated our method into an OpenGL-based scene graph |
---|
742 | library and tested it on three scenes of different types. All tests |
---|
743 | were conducted on a PC with a 3.2GHz P4, 1GB of memory, and a |
---|
744 | GeForce FX5950 graphics card. |
---|
745 | |
---|
746 | \subsection{Test scenes} |
---|
747 | |
---|
748 | The three test scenes comprise a synthetic arrangement of 5000 |
---|
749 | randomly positioned teapots (11.6M polygons); an urban environment (1M |
---|
750 | polygons); and the UNC power plant model (13M polygons). The test |
---|
751 | scenes are depicted in Figure~\ref{fig:scenes}. All scenes were |
---|
752 | partitioned using a kD-tree constructed according to the surface-area |
---|
753 | heuristics~\cite{MacDonald90}. |
---|
754 | |
---|
755 | |
---|
756 | Although the teapot scene would intuitively offer good occlusion, it |
---|
757 | is a complicated case to handle for occlusion culling. Firstly, the |
---|
758 | teapots consist of small triangles and so only the effect of fused |
---|
759 | occlusion due to a large number of visible triangles can bring a |
---|
760 | culling benefit. Secondly, there are many thin holes through which it |
---|
761 | is possible to see quite far into the arrangement of teapots. Thirdly, |
---|
762 | the arrangement is long and thin and so we can see almost |
---|
763 | half of the teapots along the longer side of the arrangement. |
---|
764 | |
---|
765 | %Although the teapot scene would intuitively seem to offer good |
---|
766 | %occlusion, it is one of the more complicated cases for occlusion |
---|
767 | %culling. Due to the high number of very dense objects, a very fine |
---|
768 | %kD-tree subdivision was necessary, which in turn leads to higher costs |
---|
769 | %for testing many nodes in the hierarchy. Additionally, misclassifying |
---|
770 | %a single teapot can lead to a significant increase in frame time due |
---|
771 | %to the high polygon density (2,304 polygons per teapot). There are also |
---|
772 | %many locations where it is possible to see very far into the block of |
---|
773 | %teapots. This can be seen in the kD-tree node classification which is |
---|
774 | %shown in the accompanying video. |
---|
775 | |
---|
776 | The complete power plant model is quite challenging even to load into memory, |
---|
777 | but on the other hand it offers good |
---|
778 | occlusion. This scene is an interesting candidate for testing not only |
---|
779 | due to its size, but also due to significant changes in visibility and depth complexity in |
---|
780 | its different parts. |
---|
781 | |
---|
782 | %as can also be seen in the |
---|
783 | %walkthrough. A relatively coarse subdivision of the model triangles |
---|
784 | %into about 4,600 nodes was sufficient to capture most occlusion |
---|
785 | %interdependencies. |
---|
786 | |
---|
787 | The city scene is a classical target for occlusion culling |
---|
788 | algorithms. Due to the urban structure consisting of buildings and |
---|
789 | streets, most of the model is occluded when viewed from the |
---|
790 | streets. Note that the scene does not contain any detailed geometry |
---|
791 | inside the buildings. See Figure~\ref{fig:city_vis} for a |
---|
792 | visualization of the visibility classification of the kD-tree nodes |
---|
793 | for the city scene. |
---|
794 | |
---|
795 | %that |
---|
796 | %could further |
---|
797 | |
---|
798 | |
---|
799 | |
---|
800 | %% However, we still included a section in the walkthrough where the |
---|
801 | %% viewpoint is elevated above the building roofs in order to show that |
---|
802 | %% the algorithm continues to work without modification, albeit at some |
---|
803 | %% expense. It will be an interesting topic of research to automatically |
---|
804 | %% recognize such situations and decrease the frequency of visibility |
---|
805 | %% queries or turn the occlusion culling algorithm off completely. |
---|
806 | |
---|
807 | |
---|
808 | |
---|
809 | \begin{figure} |
---|
810 | \centering \includegraphics[width=0.32\textwidth,draft=\DRAFTIMAGES]{images/city_vis} |
---|
811 | \caption{Visibility classification of the kD-tree nodes in the city scene. |
---|
812 | The orange nodes were found visible, all the other depicted nodes are invisible. |
---|
813 | Note the increasing size of the occluded nodes with increasing distance from the visible set.} |
---|
814 | \label{fig:city_vis} |
---|
815 | \end{figure} |
---|
816 | |
---|
817 | |
---|
818 | \subsection{Basic tests} |
---|
819 | |
---|
820 | |
---|
821 | We have measured the frame times for rendering with only view-frustum culling |
---|
822 | (VFC), the hierarchical stop-and-wait method (S\&W), and our new |
---|
823 | coherent hierarchical culling method (CHC). Additionally, we have |
---|
824 | evaluated the time for an ``ideal'' algorithm. The ideal algorithm |
---|
825 | renders the visible objects found by the S\&W algorithm without |
---|
826 | performing any visibility tests. This is an optimal solution with |
---|
827 | respect to the given hierarchy, i.e., no occlusion culling algorithm |
---|
828 | operating on the same hierarchy can be faster. For the basic tests we |
---|
829 | did not apply any of the optimizations discussed in |
---|
830 | Section~\ref{sec:optimizations}, which require user specified |
---|
831 | parameters. |
---|
832 | |
---|
833 | %% In order to obtain repeatable and comparable timings, the process |
---|
834 | %% priority of our application was set to high, timings were obtained |
---|
835 | %% using the CPU cycle counter instruction, and the OpenGL {\tt Finish()} |
---|
836 | %% instruction was executed before each invocation of the timer at the |
---|
837 | %% start and the end of a frame. Furthermore, we did not include clear |
---|
838 | %% and buffer swapping times in our results, since they depend strongly |
---|
839 | %% on the windowing mode used. |
---|
840 | |
---|
841 | For each test scene, we have constructed a walkthrough which is shown |
---|
842 | in full in the accompanying |
---|
843 | video. Figures~\ref{fig:teapot_walkthrough},~\ref{fig:city_walkthrough}, |
---|
844 | and~\ref{fig:plant_walkthrough} depict the frame times measured for |
---|
845 | the walkthroughs. Note that Figure~\ref{fig:plant_walkthrough} uses a |
---|
846 | logarithmic scale to capture the high variations in frame times during |
---|
847 | the power plant walkthrough. |
---|
848 | \begin{figure} |
---|
849 | \centering |
---|
850 | \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/teapotgraph1} |
---|
851 | \caption{Frame times for the teapot scene.}\label{fig:teapot_walkthrough} |
---|
852 | \end{figure} |
---|
853 | \begin{figure} |
---|
854 | \centering |
---|
855 | \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/citygraph1} |
---|
856 | \caption{Frame times for the city walkthrough. Note |
---|
857 | the spike around frame 1600, where the viewpoint was elevated above the roofs, |
---|
858 | practically eliminating any occlusion.}\label{fig:city_walkthrough} |
---|
859 | \end{figure} |
---|
860 | \begin{figure} |
---|
861 | \centering |
---|
862 | \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/plantgraph1log} |
---|
863 | \caption{Frame times for the power plant walkthrough. |
---|
864 | The plot shows the weakness of the S\&W method: when there is not much occlusion it becomes slower than VFC (near frame 2200). |
---|
865 | The CHC can keep up even in these situations and in the same time it can exploit occlusion when it appears |
---|
866 | (e.g. near frame 3700).} |
---|
867 | \label{fig:plant_walkthrough} |
---|
868 | \end{figure} |
---|
869 | To better demonstrate the behavior of our algorithm, all walkthroughs |
---|
870 | contain sections with both restricted and unrestricted visibility. For |
---|
871 | the teapots, we viewed the arrangement of teapots along the longer |
---|
872 | side of the arrangement (frames 25--90). In the city we elevated the |
---|
873 | viewpoint above the roofs and gained sight over most of the city |
---|
874 | (frames 1200--1800). The power plant walkthrough contains several |
---|
875 | viewpoints from which a large part of the model is visible (spikes in |
---|
876 | Figure~\ref{fig:plant_walkthrough} where all algorithms are slow), |
---|
877 | viewpoints along the border of the model directed outwards with low depth complexity (holes in |
---|
878 | Figure~\ref{fig:plant_walkthrough} where all algorithms are fast), and |
---|
879 | viewpoints inside the power plant with high depth complexity where occlusion culling produces a |
---|
880 | significant speedup over VFC (e.g. frame 3800). |
---|
881 | |
---|
882 | As we can see for a number frames in the walkthroughs, the CHC method |
---|
883 | can produce a speedup of more than one order of magnitude compared to |
---|
884 | VFC. The maximum speedup for the teapots, the city, and the power |
---|
885 | plant walkthroughs is 8, 20, and 70, respectively. We can also observe |
---|
886 | that CHC maintains a significant gain over S\&W and in many cases it |
---|
887 | almost matches the performance of the ideal algorithm. In complicated |
---|
888 | scenarios the S\&W method caused a significant slowdown compared to |
---|
889 | VFC (e.g. frames 1200--1800 of |
---|
890 | Figure~\ref{fig:city_walkthrough}). Even in these cases, the CHC |
---|
891 | method maintained a good speedup over VFC except for a small number of |
---|
892 | frames. |
---|
893 | |
---|
894 | \begin{table*} |
---|
895 | \centering \footnotesize |
---|
896 | \begin{tabular}{|c|c|c|c|c|c|c|c|} |
---|
897 | \hline\hline |
---|
898 | scene & method & \#queries & wait time [ms] & rendered triangles & frame time [ms] & speedup \\\hline\hline |
---|
899 | |
---|
900 | Teapots & VFC & --- & --- & 11,139,928 & 310.42 & 1.0 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
901 | 11,520,000 triangles & S\&W & 4704 & 83.19 & 2,617,801 & 154.95 & 2.3 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
902 | 21,639 kD-Tree nodes & {\bf CHC} &{\bf 2827} & {\bf 1.31} & {\bf 2,852,514 } & {\bf 81,18 } & {\bf 4.6 } \\\hhline{~|-|-|-|-|-|-|-|} |
---|
903 | & Ideal & --- & --- & 2,617,801 & 72.19 & 5.2 \\ |
---|
904 | \hline\hline |
---|
905 | City & VFC & --- & --- & 156,521 & 19.79 & 1.0 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
906 | 1,036,146 triangles & S\&W & 663 & 9.49 & 30,594 & 19.9 & 1.5 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
907 | 33,195 kD-Tree nodes &{\bf CHC} & {\bf 345} & {\bf 0.18} & {\bf 31,034} & {\bf 8.47} & {\bf 4.0} \\\hhline{~|-|-|-|-|-|-|-|} |
---|
908 | & Ideal & --- & --- & 30,594 & 4.55 & 6.6 \\ |
---|
909 | \hline\hline |
---|
910 | Power Plant & VFC & --- & --- & 1,556,300 & 138.76 & 1.0 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
911 | 12,748,510 triangles & S\&W & 485 & 16.16 & 392,962 & 52.29 & 3.2 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
912 | 18,719 kD-Tree nodes &{\bf CHC} & {\bf 263} & {\bf 0.70} & {\bf 397,920} & {\bf 38.73} & {\bf 4.7} \\\hhline{~|-|-|-|-|-|-|-|} |
---|
913 | & Ideal & --- & --- & 392,962 & 36.34 & 5.8 \\\hline\hline |
---|
914 | \end{tabular} |
---|
915 | \caption{Statistics for the three test scenes. VFC is rendering with only view-frustum culling, S\&W is the |
---|
916 | hierarchical stop and wait method, CHC is our new method, and Ideal |
---|
917 | is a perfect method with respect to the given hierarchy. All values are averages over |
---|
918 | all frames (including the speedup).} |
---|
919 | \label{tab:averages} |
---|
920 | \end{table*} |
---|
921 | |
---|
922 | %VFC S&W CHC Ideal |
---|
923 | %(01:50:59) salzrat: 1 3,243823523 4,744645293 5,765981148 |
---|
924 | %(01:51:04) salzrat: for the powerplant |
---|
925 | % for the powerplant |
---|
926 | %(01:53:59) salzrat: 1 1,519898136 4,01897468 6,599413211 |
---|
927 | %(01:54:01) salzrat: for the city |
---|
928 | %(01:55:15) salzrat: and |
---|
929 | %(01:55:15) salzrat: 1 2,321137905 4,646185621 5,199086993 |
---|
930 | |
---|
931 | %salzrat: teapot: 1,977162565 1,128631431 |
---|
932 | %(02:01:52) salzrat: city: 2,607396227 1,674188137 |
---|
933 | %(02:02:59) salzrat: powerplant: 1,635036084 1,244446469 |
---|
934 | |
---|
935 | |
---|
936 | Next, we summarized the scene statistics and the average values per |
---|
937 | frame in Table~\ref{tab:averages}. The table shows the number of |
---|
938 | issued occlusion queries, the wait time representing the CPU stalls, |
---|
939 | the number of rendered triangles, the total frame time, and the |
---|
940 | speedup over VFC. |
---|
941 | |
---|
942 | |
---|
943 | We can see that the CHC method practically eliminates the CPU stalls |
---|
944 | (wait time) compared to the S\&W method. This is paid for by a slight |
---|
945 | increase in the number of rendered triangles. For the three |
---|
946 | walkthroughs, the CHC method produces average speedups of 4.6, 4.0, and |
---|
947 | 4.7 over view frustum culling and average speedups of 2.0, 2.6, and |
---|
948 | 1.6 over the S\&W method. CHC is only 1.1, 1.7, and 1.2 times |
---|
949 | slower than the ideal occlusion culling algorithm. Concerning the |
---|
950 | accuracy, the increase of the average number of rendered triangles for |
---|
951 | CHC method compared to S\&W was 9\%, 1.4\%, and 1.3\%. This increase |
---|
952 | was always recovered by the reduction of CPU stalls for the tested |
---|
953 | walkthroughs. |
---|
954 | |
---|
955 | %% While this number may be very low, the algorithm also incurs a |
---|
956 | %% non-negligible overhead which is caused by the graphics card |
---|
957 | %% reconfiguration required for the queries and rasterization of the |
---|
958 | %% bounding volume geometries. |
---|
959 | |
---|
960 | % triangles |
---|
961 | % 9% |
---|
962 | % 1.4% |
---|
963 | % 1.3% |
---|
964 | |
---|
965 | |
---|
966 | % speedup over SW |
---|
967 | % 1.9 |
---|
968 | % 2.32 |
---|
969 | % 1.33 |
---|
970 | |
---|
971 | % compare to ideal |
---|
972 | % 1.13 |
---|
973 | % 1.9 |
---|
974 | % 1.05 |
---|
975 | |
---|
976 | |
---|
977 | |
---|
978 | \subsection{Optimizations} |
---|
979 | |
---|
980 | First of all we have observed that the technique of complete |
---|
981 | elimination of CPU stalls discussed in Section~\ref{sec:full} has a |
---|
982 | very limited scope. In fact for all our tests the stalls were almost |
---|
983 | completely eliminated by the basic algorithm already (see wait time in |
---|
984 | Table~\ref{tab:averages}). We did not find constants that could |
---|
985 | produce additional speedup using this technique. |
---|
986 | |
---|
987 | The measurements for the other optimizations discussed in |
---|
988 | Section~\ref{sec:optimizations} are summarized in |
---|
989 | Table~\ref{tab:assumed}. We have measured the average number of issued queries |
---|
990 | and the average frame time in dependence on the number of frames a node is |
---|
991 | assumed visible and the pixel threshold of approximate visibility. We |
---|
992 | have observed that the effectiveness of the optimizations depends strongly on the scene. If |
---|
993 | the hierarchy is deep and the geometry associated with a leaf node is |
---|
994 | not too complex, the conservative visibility testing produces a |
---|
995 | significant speedup (city and power plant). For the teapot scene the |
---|
996 | penalty for false rendering of actually occluded objects became larger |
---|
997 | than savings achieved by the reduction of the number of queries. On the other |
---|
998 | hand since the teapot scene contains complex visible geometry the |
---|
999 | approximate visibility optimization produced a significant |
---|
1000 | speedup. This is however paid for by introducing errors in the image |
---|
1001 | proportional to the pixel threshold used. |
---|
1002 | |
---|
1003 | %Good parameters for these |
---|
1004 | %optimizations have to be estimated on a scene-by-scene basis. |
---|
1005 | |
---|
1006 | |
---|
1007 | |
---|
1008 | |
---|
1009 | %% , where we compare the reference frame time and |
---|
1010 | %% number of queries already shown in the previous table to the |
---|
1011 | %% approximate visibility optimization using a higher pixel threshold, |
---|
1012 | %% and to the conservative visibility testing optimization assuming a |
---|
1013 | %% node remains visible for 2 frames, both for the teapot and the city |
---|
1014 | %% scene. These tests clearly show that a high visibility threshold |
---|
1015 | %% improves frame times, albeit at the expense of image quality. The |
---|
1016 | %% benefit of the conservative visibility optimization, is however |
---|
1017 | %% limited. While the number of queries can be reduced significantly, the |
---|
1018 | %% number of misclassified objects rises and increases the frame time |
---|
1019 | %% again. |
---|
1020 | |
---|
1021 | |
---|
1022 | |
---|
1023 | %% \begin{table} |
---|
1024 | %% \centering \footnotesize |
---|
1025 | %% \begin{tabular}{|l|c|c|} |
---|
1026 | %% \hline\hline |
---|
1027 | %% & Teapots & City \\\hline\hline |
---|
1028 | %% frame time CHC & 130.42 & 13.91\\\hline |
---|
1029 | %% \#queries & 2267 & 94.04\\\hline\hline |
---|
1030 | %% \multicolumn{3}{|l|}{assume 2 frames visible}\\\hline |
---|
1031 | %% frame time & 123.84 & 13.27 \\\hline |
---|
1032 | %% \#queries & 1342 & 42 \\\hline\hline |
---|
1033 | %% \multicolumn{3}{|l|}{pixel threshold 50 pixels and assume 2 frames visible}\\\hline |
---|
1034 | %% frame time & 73.22 & 11.47 \\\hline |
---|
1035 | %% \#queries & 1132 & 50 \\\hline\hline |
---|
1036 | %% \end{tabular} |
---|
1037 | %% \caption{Influence of optimizations on two test scenes. All values are averages |
---|
1038 | %% over all frames, frame times are in [ms].}\label{tab:assumed} |
---|
1039 | %% \end{table} |
---|
1040 | |
---|
1041 | |
---|
1042 | \begin{table} |
---|
1043 | \centering \footnotesize |
---|
1044 | \begin{tabular}{|l|c|c|c|c|} |
---|
1045 | \hline\hline |
---|
1046 | scene & t$_{av}$ & n$_{vp}$ & \#queries & frame time [ms] \\ \hline\hline |
---|
1047 | \multirow{3}{*}{Teapots} & 0 & 0 & 2827 & 81.18 \\ \hhline{~----} |
---|
1048 | & 2 & 0 & 1769 & 86.31 \\ \hhline{~----} |
---|
1049 | & 2 & 25 & 1468 & 55.90 \\ \hline\hline |
---|
1050 | \multirow{3}{*}{City} & 0 & 0 & 345 & 8.47 \\ \hhline{~----} |
---|
1051 | & 2 & 0 & 192 & 6.70 \\ \hhline{~----} |
---|
1052 | & 2 & 25 & 181 & 6.11 \\ \hline\hline |
---|
1053 | \multirow{3}{*}{Power Plant} & 0 & 0 & 263 & 38.73 \\ \hhline{~----} |
---|
1054 | & 2 & 0 & 126 & 31.17 \\ \hhline{~----} |
---|
1055 | & 2 & 25 & 120 & 36.62 \\ \hline\hline |
---|
1056 | \end{tabular} |
---|
1057 | |
---|
1058 | \caption{Influence of optimizations on the CHC method. |
---|
1059 | $t_{av}$ is the number of assumed visibility frames for conservative |
---|
1060 | visibility testing, n$_{vp}$ is the pixel threshold for |
---|
1061 | approximate visibility.}\label{tab:assumed} |
---|
1062 | \end{table} |
---|
1063 | |
---|
1064 | |
---|
1065 | \subsection{Comparison to PVS-based rendering} |
---|
1066 | |
---|
1067 | We also compared the CHC method against precalculated visibility. In |
---|
1068 | particular, we used the PVS computed by an offline visibility |
---|
1069 | algorithm~\cite{wonka00}. While the walkthrough using the PVS was |
---|
1070 | 1.26ms faster per frame on average, our method does not require costly |
---|
1071 | precomputation and can be used at any general 3D position in the |
---|
1072 | model, not only in a predefined view space. |
---|
1073 | |
---|
1074 | |
---|
1075 | \section{Conclusion} |
---|
1076 | \label{sec:conclusion} |
---|
1077 | |
---|
1078 | We have presented a method for the optimized scheduling of hardware |
---|
1079 | accelerated occlusion queries. The method schedules occlusion queries |
---|
1080 | in order to minimize the number of the queries and their latency. |
---|
1081 | This is achieved by exploiting spatial and temporal coherence of |
---|
1082 | visibility. Our results show that the CPU stalls and GPU starvation |
---|
1083 | are almost completely eliminated at the cost of a slight increase in |
---|
1084 | the number of rendered objects. |
---|
1085 | |
---|
1086 | %Additionally it schedules the queries in an order |
---|
1087 | %minimizing the overhead due to the latency of availability of query |
---|
1088 | %results. |
---|
1089 | |
---|
1090 | Our technique can be used with practically arbitrary scene |
---|
1091 | partitioning data structures such as kD-trees, bounding volume |
---|
1092 | hierarchies, or hierarchical grids. The implementation of the method |
---|
1093 | is straightforward as it uses a simple OpenGL interface to the |
---|
1094 | hardware occlusion queries. In particular, the method requires no |
---|
1095 | complicated geometrical operations or data structures. The algorithm |
---|
1096 | is suitable for application on scenes of arbitrary structure and it |
---|
1097 | requires no preprocessing or scene dependent tuning. |
---|
1098 | |
---|
1099 | We have experimentally verified that the method is well suited to the |
---|
1100 | \nvoq{} supported on current consumer grade graphics hardware. We |
---|
1101 | have obtained an average speedup of 4.0--4.7 compared to pure view-frustum |
---|
1102 | culling and 1.6--2.6 compared to the hierarchical |
---|
1103 | stop-and-wait application of occlusion queries. |
---|
1104 | |
---|
1105 | |
---|
1106 | The major potential in improving the method is a better estimation |
---|
1107 | of changes in the visibility classification of hierarchy nodes. If |
---|
1108 | nodes tend to be mostly visible, we could automatically decrease the |
---|
1109 | frequency of occlusion tests and thus better adapt the method to the |
---|
1110 | actual occlusion in the scene. Another possibility for improvement is |
---|
1111 | better tuning for a particular graphics hardware by means of more |
---|
1112 | accurate rendering cost estimation. Skipping occlusion tests for |
---|
1113 | simpler geometry can be faster than issuing comparably expensive |
---|
1114 | occlusion queries. |
---|
1115 | |
---|
1116 | |
---|
1117 | \begin{figure*}[htb] |
---|
1118 | \centerline{ |
---|
1119 | \hfill |
---|
1120 | \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/teapots} |
---|
1121 | \hfill |
---|
1122 | \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/city} |
---|
1123 | \hfill |
---|
1124 | \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/pplant} |
---|
1125 | \hfill |
---|
1126 | } |
---|
1127 | \caption{The test scenes: the teapots, the city, and the power plant.} |
---|
1128 | \label{fig:scenes} |
---|
1129 | \end{figure*} |
---|
1130 | |
---|
1131 | %% \begin{figure}[htb] |
---|
1132 | %% \centerline{ |
---|
1133 | %% \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/city} |
---|
1134 | %% } |
---|
1135 | %% \caption{The the city scene.} |
---|
1136 | %% \label{fig:images1} |
---|
1137 | %% \end{figure} |
---|
1138 | |
---|
1139 | %% \begin{figure}[htb] |
---|
1140 | %% \centerline{ |
---|
1141 | %% \includegraphics[width=0.25\textwidth,draft=\DRAFTFIGS]{images/teapots} |
---|
1142 | %% } |
---|
1143 | %% \caption{The the teapot scene.} |
---|
1144 | %% \label{fig:images2} |
---|
1145 | %% \end{figure} |
---|
1146 | |
---|
1147 | %% \begin{figure}[htb] |
---|
1148 | %% \centerline{ |
---|
1149 | %% \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/pplant} |
---|
1150 | %% } |
---|
1151 | %% \caption{The the power plant scene.} |
---|
1152 | %% \label{fig:images3} |
---|
1153 | %% \end{figure} |
---|
1154 | |
---|
1155 | |
---|