[243] | 1 | \chapter{Online Visibility Culling} |
---|
| 2 | %***************************************************************************** |
---|
| 3 | |
---|
| 4 | \section{Introduction} |
---|
| 5 | |
---|
| 6 | Visibility culling is one of the major acceleration techniques for |
---|
| 7 | the real-time rendering of complex scenes. The ultimate goal of |
---|
| 8 | visibility culling techniques is to prevent invisible objects from |
---|
| 9 | being sent to the rendering pipeline. A standard visibility-culling |
---|
| 10 | technique is {\em view-frustum culling}, which eliminates objects |
---|
| 11 | outside of the current view frustum. View-frustum culling is a fast |
---|
| 12 | and simple technique, but it does not eliminate objects in the view |
---|
| 13 | frustum that are occluded by other objects. This can lead to significant |
---|
| 14 | {\em overdraw}, i.e., the same image area gets covered more than |
---|
| 15 | once. The overdraw causes a waste of computational effort both in the |
---|
| 16 | pixel and the vertex processing stages of modern graphic hardware. |
---|
| 17 | %that is, we render each pixel in the image several times. For many complex |
---|
| 18 | %scenes with high occlusion, the effort wasted on overdraw cannot simply be |
---|
| 19 | %eliminated by the increasing the raw computational power of the hardware. |
---|
| 20 | %This problem becomes even more apparent on recent graphics hardware with |
---|
| 21 | %programmable vertex and pixel shaders: all complex per-vertex or per-pixel |
---|
| 22 | %computations on object get completely wasted if the object is occluded. |
---|
| 23 | The elimination of occluded objects is addressed by {\em occlusion |
---|
| 24 | culling}. In an optimized rendering pipeline, occlusion culling complements |
---|
| 25 | other rendering acceleration techniques such as levels of detail or |
---|
| 26 | impostors. |
---|
| 27 | |
---|
| 28 | Occlusion culling can either be applied offline or online. When |
---|
| 29 | applied offline as a preprocess, we compute a potentially visible set |
---|
| 30 | (PVS) for cells of a fixed subdivision of the scene. At runtime, we |
---|
| 31 | can quickly identify a PVS for the given viewpoint. However, this |
---|
| 32 | approach suffers from four major problems: (1) the PVS is valid only |
---|
| 33 | for the original static scene configuration, (2) for a given |
---|
| 34 | viewpoint, the corresponding cell-based PVS can be overly |
---|
| 35 | conservative, (3) computing all PVSs is computationally expensive, and (4) |
---|
| 36 | an accurate PVS computation is difficult to implement for general |
---|
| 37 | scenes. Online occlusion culling can solve these problems at the cost |
---|
| 38 | of applying extra computations at each frame. To make these additional |
---|
| 39 | computations efficient, most online occlusion culling methods rely on a |
---|
| 40 | number of assumptions about the scene structure and its occlusion |
---|
| 41 | characteristics (e.g. presence of large occluders, occluder |
---|
| 42 | connectivity, occlusion by few closest depth layers). |
---|
| 43 | |
---|
| 44 | Recent graphics hardware natively supports an \emph{occlusion query} |
---|
| 45 | to detect the visibility of an object against the current contents of the |
---|
| 46 | z-buffer. Although the query itself is processed quickly using the |
---|
| 47 | raw power of the graphics processing unit (GPU), its result is not |
---|
| 48 | available immediately due to the delay between issuing the query and |
---|
| 49 | its actual processing in the graphics pipeline. As a result, a naive |
---|
| 50 | application of occlusion queries can even decrease the overall |
---|
| 51 | application performance due the associated CPU stalls and GPU |
---|
| 52 | starvation. In this paper, we present an algorithm that aims to |
---|
| 53 | overcome these problems by reducing the number of issued queries and |
---|
| 54 | eliminating the CPU stalls and GPU starvation. To schedule the |
---|
| 55 | queries, the algorithm makes use of both the spatial and the temporal |
---|
| 56 | coherence of visibility. A major strength of our technique is its |
---|
| 57 | simplicity and versatility: the method can be easily integrated in |
---|
| 58 | existing real-time rendering packages on architectures supporting the |
---|
| 59 | underlying occlusion query. |
---|
| 60 | |
---|
| 61 | %Using spatial and assuming temporal coherence |
---|
| 62 | |
---|
| 63 | \section{Related Work} |
---|
| 64 | |
---|
| 65 | With the demand for rendering scenes of ever increasing size, there |
---|
| 66 | have been a number of visibility culling methods developed in the |
---|
| 67 | last decade. A comprehensive survey of visibility culling methods |
---|
| 68 | was presented by Cohen-Or et al.~\cite{Cohen:2002:survey}. Another |
---|
| 69 | recent survey of Bittner and Wonka~\cite{bittner03:jep} discusses |
---|
| 70 | visibility culling in a broader context of other visibility problems. |
---|
| 71 | |
---|
| 72 | According to the domain of visibility computation, we distinguish |
---|
| 73 | between {\em from-point} and {\em from-region} visibility algorithms. |
---|
| 74 | From-region algorithms compute a PVS and are applied offline in a |
---|
| 75 | preprocessing phase~\cite{Airey90,Teller91a,Leyvand:2003:RSF}. |
---|
| 76 | From-point algorithms are applied online for each particular |
---|
| 77 | viewpoint~\cite{Greene93a,Hudson97,EVL-1997-163,bittner98b,Wonka:1999:OSF,Klosowski:2001:ECV}. |
---|
| 78 | In our further discussion we focus on online occlusion culling methods |
---|
| 79 | that exploit graphics hardware. |
---|
| 80 | |
---|
| 81 | A conceptually important online occlusion culling method is the |
---|
| 82 | hierarchical z-buffer introduced by Greene et al.~\cite{Greene93a}. It |
---|
| 83 | organizes the z-buffer as a pyramid, where the standard z-buffer is |
---|
| 84 | the finest level. At all other levels, each z-value is the farthest in |
---|
| 85 | the window corresponding to the adjacent finer level. The hierarchical |
---|
| 86 | z-buffer allows to quickly determine if the geometry in question is |
---|
| 87 | occluded. To a certain extent this idea is used in the current |
---|
| 88 | generation of graphics hardware by applying early z-tests of |
---|
| 89 | fragments in the graphics pipeline (e.g., Hyper-Z technology of ATI or |
---|
| 90 | Z-cull of NVIDIA). However, the geometry still needs to be sent to the |
---|
| 91 | GPU, transformed, and coarsely rasterized even if it is later |
---|
| 92 | determined invisible. |
---|
| 93 | |
---|
| 94 | Zhang~\cite{EVL-1997-163} proposed hierarchical occlusion maps, which |
---|
| 95 | do not rely on the hardware support for the z-pyramid, but instead |
---|
| 96 | make use of hardware texturing. The hierarchical occlusion map is |
---|
| 97 | computed on the GPU by rasterizing and down sampling a given set of |
---|
| 98 | occluders. The occlusion map is used for overlap tests whereas the |
---|
| 99 | depths are compared using a coarse depth estimation buffer. Wonka and |
---|
| 100 | Schmalstieg~\cite{Wonka:1999:OSF} use occluder shadows to compute |
---|
| 101 | from-point visibility in \m25d scenes with the help of the GPU. This |
---|
| 102 | method has been further extended to online computation of from-region |
---|
| 103 | visibility executed on a server~\cite{wonka:01:eg}. |
---|
| 104 | |
---|
| 105 | %% In |
---|
| 106 | %% parallel to rendering, the visibility server calculates visibility for |
---|
| 107 | %% the neighborhood of the given viewpoint and sends them back to the |
---|
| 108 | %% display host. |
---|
| 109 | |
---|
| 110 | %% A similar method designed by Aila and |
---|
| 111 | %% Miettinen~\cite{aila:00:msc}. The incremental occlusion maps are |
---|
| 112 | %% created on the CPU, which takes the CPU computational time but it does |
---|
| 113 | %% not suffer the problem of slow read-back from the GPU. |
---|
| 114 | |
---|
| 115 | %Klosowski:2001:ECV |
---|
| 116 | |
---|
| 117 | Bartz et al.~\cite{Bartz98} proposed an OpenGL extension for occlusion |
---|
| 118 | queries along with a discussion concerning a potential realization in |
---|
| 119 | hardware. A first hardware implementation of occlusion queries came |
---|
| 120 | with the VISUALIZE fx graphics hardware~\cite{Scott:1998:OVF}. The |
---|
| 121 | corresponding OpenGL extension is called \hpot{}. A more recent OpenGL |
---|
| 122 | extension, \nvoq{}, was introduced by NVIDIA with the GeForce 3 graphics |
---|
| 123 | card and it is now also available as an official ARB extension. |
---|
| 124 | |
---|
| 125 | %Both |
---|
| 126 | %extensions have been used in several several occlusion methods |
---|
| 127 | %algorithms~\cite{Meissner01,Hillesland02,Staneker:2004:OCO}. |
---|
| 128 | %\cite{Meissner01} |
---|
| 129 | |
---|
| 130 | Hillesland et al.~\cite{Hillesland02} have proposed an algorithm |
---|
| 131 | which employs the \nvoq. They subdivide the scene using a uniform |
---|
| 132 | grid. Then the cubes are traversed in slabs roughly perpendicular to |
---|
| 133 | the viewport. The queries are issued for all cubes of a slab at once, |
---|
| 134 | after the visible geometry of this slab has been rendered. The method |
---|
| 135 | can also use nested grids: a cell of the grid contains another grid |
---|
| 136 | that is traversed if the cell is proven visible. This method however |
---|
| 137 | does not exploit temporal and spatial coherence of visibility and it |
---|
| 138 | is restricted to regular subdivision data structures. Our new method |
---|
| 139 | addresses both these problems and provides natural extensions to |
---|
| 140 | balance the accuracy of visibility classification and the associated |
---|
| 141 | computational costs. |
---|
| 142 | |
---|
| 143 | Recently, Staneker et al.~\cite{Staneker:2004:OCO} developed a method |
---|
| 144 | integrating occlusion culling into the OpenSG scene graph |
---|
| 145 | framework. Their technique uses occupancy maps maintained in software |
---|
| 146 | to avoid queries on visible scene graph nodes, and temporal coherence |
---|
| 147 | to reduce the number of occlusion queries. The drawback of the method |
---|
| 148 | is that it performs the queries in a serial fashion and thus it |
---|
| 149 | suffers from the CPU stalls and GPU starvation. |
---|
| 150 | |
---|
| 151 | %% The methods proposed in this paper can be used to make use of temporal |
---|
| 152 | %% and spatial coherence in the scope of existing visibility algorithms, |
---|
| 153 | %% that utilise a spatial hierarchy. Examples of these are algorithms |
---|
| 154 | %% based on hierarchical occlusion maps~\cite{Zhang97}, coverage |
---|
| 155 | %% masks~\cite{Greene:1996:HPT}, shadow frusta~\cite{Hudson97}, and |
---|
| 156 | %% occlusion trees~\cite{bittner98b_long}. |
---|
| 157 | |
---|
| 158 | On a theoretical level, our paper is related to methods aiming to |
---|
| 159 | exploit the temporal coherence of visibility. Greene et |
---|
| 160 | al.~\cite{Greene93a} used the set of visible objects from one frame to |
---|
| 161 | initialize the z-pyramid in the next frame in order to reduce the |
---|
| 162 | overdraw of the hierarchical z-buffer. The algorithm of Coorg and |
---|
| 163 | Teller~\cite{Coorg96b} restricts the hierarchical traversal to nodes |
---|
| 164 | associated with visual events that were crossed between successive |
---|
| 165 | viewpoint positions. Another method of Coorg and Teller~\cite{Coorg97} |
---|
| 166 | exploits temporal coherence by caching occlusion relationships. |
---|
| 167 | Chrysanthou and Slater have proposed a probabilistic scheme for |
---|
| 168 | view-frustum culling~\cite{Chrysanthou:97}. |
---|
| 169 | |
---|
| 170 | The above mentioned methods for exploiting temporal coherence are |
---|
| 171 | tightly interwoven with the particular culling algorithm. On the |
---|
| 172 | contrary, Bittner et al.~\cite{bittner01:jvca} presented a general |
---|
| 173 | acceleration technique for exploiting spatial and temporal coherence |
---|
| 174 | in hierarchical visibility algorithms. The central idea, which is also |
---|
| 175 | vital for this paper, is to avoid repeated visibility tests of |
---|
| 176 | interior nodes of the hierarchy. The problem of direct adoption of |
---|
| 177 | this method is that it is designed for the use with instantaneous CPU |
---|
| 178 | based occlusion queries, whereas hardware occlusion queries exhibit |
---|
| 179 | significant latency. The method presented herein efficiently overcomes |
---|
| 180 | the problem of latency while keeping the benefits of a generality and |
---|
| 181 | simplicity of the original hierarchical technique. As a result we |
---|
| 182 | obtain a simple and efficient occlusion culling algorithm utilizing |
---|
| 183 | hardware occlusion queries. |
---|
| 184 | |
---|
| 185 | |
---|
| 186 | %In this paper we show how to overcome this problem while |
---|
| 187 | %keeping the benefits of exploiting. |
---|
| 188 | |
---|
| 189 | |
---|
| 190 | The rest of the paper is organized as follows: |
---|
| 191 | Section~\ref{sec:hwqueries} discusses hardware supported occlusion |
---|
| 192 | queries and a basic application of these queries using a kD-tree. |
---|
| 193 | Section~\ref{sec:hoq} presents our new algorithm and |
---|
| 194 | Section~\ref{sec:optimizations} describes several additional |
---|
| 195 | optimizations. Section~\ref{sec:results} presents results obtained by |
---|
| 196 | experimental evaluation of the method and discusses its |
---|
| 197 | behavior. Finally Section~\ref{sec:conclusion} concludes the paper. |
---|
| 198 | |
---|
| 199 | |
---|
| 200 | |
---|
| 201 | |
---|
| 202 | \section{Hardware Occlusion Queries} |
---|
| 203 | \label{sec:hwqueries} |
---|
| 204 | |
---|
| 205 | |
---|
| 206 | Hardware occlusion queries follow a simple pattern: To test visibility |
---|
| 207 | of an occludee, we send its bounding volume to the GPU. The volume is |
---|
| 208 | rasterized and its fragments are compared to the current contents of |
---|
| 209 | the z-buffer. The GPU then returns the number of visible |
---|
| 210 | fragments. If there is no visible fragment, the occludee is invisible |
---|
| 211 | and it need not be rendered. |
---|
| 212 | |
---|
| 213 | \subsection{Advantages of hardware occlusion queries} |
---|
| 214 | |
---|
| 215 | There are several advantages of hardware occlusion queries: |
---|
| 216 | |
---|
| 217 | \begin{itemize} |
---|
| 218 | |
---|
| 219 | \item {\em Generality of occluders.} We can use the original scene geometry |
---|
| 220 | as occluders, since the queries use the current contents of the z-buffer. |
---|
| 221 | |
---|
| 222 | |
---|
| 223 | \item {\em Occluder fusion.} The occluders are merged in the z-buffer, |
---|
| 224 | so the queries automatically account for occluder fusion. Additionally |
---|
| 225 | this fusion comes for free since we use the intermediate result of the |
---|
| 226 | rendering itself. |
---|
| 227 | |
---|
| 228 | \item {\em Generality of occludees.} We can use complex |
---|
| 229 | occludees. Anything that can be rasterized quickly is suitable. |
---|
| 230 | |
---|
| 231 | \item {\em Exploiting the GPU power.} The queries take full advantage of |
---|
| 232 | the high fill rates and internal parallelism provided by modern GPUs. |
---|
| 233 | |
---|
| 234 | \item {\em Simple use.} Hardware occlusion queries can be easily |
---|
| 235 | integrated into a rendering algorithm. They provide a powerful tool to |
---|
| 236 | minimize the implementation effort, especially when compared to |
---|
| 237 | CPU-based occlusion culling. |
---|
| 238 | |
---|
| 239 | |
---|
| 240 | \end{itemize} |
---|
| 241 | |
---|
| 242 | |
---|
| 243 | |
---|
| 244 | \subsection{Problems of hardware occlusion queries} |
---|
| 245 | |
---|
| 246 | Currently there are two main hardware supported variants of occlusion |
---|
| 247 | queries: the HP test (\hpot) and the more recent NV query (\nvoq, now |
---|
| 248 | also available as \arboq). The most important difference between the |
---|
| 249 | HP test and the NV query is that multiple NV queries can be |
---|
| 250 | issued before asking for their results, while only one HP test is |
---|
| 251 | allowed at a time, which severely limits its possible algorithmic |
---|
| 252 | usage. Additionally the NV query returns the number of visible pixels |
---|
| 253 | whereas the HP test returns only a binary visibility |
---|
| 254 | classification. |
---|
| 255 | |
---|
| 256 | %% Issuing multiple independent NV queries is crucial for |
---|
| 257 | %% the design of algorithms that strive to exploit power of the GPU. |
---|
| 258 | |
---|
| 259 | The main problem of both the HP test and the NV query is the latency |
---|
| 260 | between issuing the query and the availability of the result. The |
---|
| 261 | latency occurs due to the delayed processing of the query in a long |
---|
| 262 | graphics pipeline, the cost of processing the query itself, and the |
---|
| 263 | cost of transferring the result back to the CPU. The latency causes |
---|
| 264 | two major problems: CPU stalls and GPU starvation. After issuing the |
---|
| 265 | query, the CPU waits for its result and does not feed the GPU with new |
---|
| 266 | data. When the result finally becomes available, the GPU pipeline can |
---|
| 267 | already be empty. Thus the GPU needs to wait for the CPU to process |
---|
| 268 | the result of the query and to feed the GPU with new data. |
---|
| 269 | |
---|
| 270 | A major challenge when using hardware occlusion queries is to avoid |
---|
| 271 | the CPU stalls by filling the latency time with other tasks, such as |
---|
| 272 | rendering visible scene objects or issuing other, independent |
---|
| 273 | occlusion queries (see Figure~\ref{fig:latency}) |
---|
| 274 | |
---|
| 275 | |
---|
| 276 | \begin{figure}[htb] |
---|
| 277 | \centerline{ |
---|
| 278 | \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{figs/latency2} |
---|
| 279 | } |
---|
| 280 | |
---|
| 281 | \caption{(top) Illustration of CPU stalls and GPU starvation. |
---|
| 282 | Qn, Rn, and Cn denote querying, rendering, and culling of object |
---|
| 283 | n, respectively. Note that object 5 is found invisible by Q5 and |
---|
| 284 | thus not rendered. (bottom) More efficient query scheduling. The |
---|
| 285 | scheduling assumes that objects 4 and 6 will be visible in the |
---|
| 286 | current frame and renders them without waiting for the result of |
---|
| 287 | the corresponding queries. } |
---|
| 288 | \label{fig:latency} |
---|
| 289 | \end{figure} |
---|
| 290 | |
---|
| 291 | |
---|
| 292 | |
---|
| 293 | |
---|
| 294 | |
---|
| 295 | \subsection{Hierarchical stop-and-wait method} |
---|
| 296 | \label{sec:basic} |
---|
| 297 | |
---|
| 298 | Many rendering algorithms rely on hierarchical structures in order to |
---|
| 299 | deal with complex scenes. In the context of occlusion culling, such a |
---|
| 300 | data structure allows to efficiently cull large scene blocks, and thus |
---|
| 301 | to exploit spatial coherence of visibility and provide a key to |
---|
| 302 | achieving output sensitivity. |
---|
| 303 | |
---|
| 304 | This section outlines a naive application of occlusion queries in the |
---|
| 305 | scope of a hierarchical algorithm. We refer to this approach as the |
---|
| 306 | {\em hierarchical stop-and-wait} method. Our discussion is based on |
---|
| 307 | kD-trees, which proved to be efficient for point location, ray |
---|
| 308 | tracing, and visibility |
---|
| 309 | culling~\cite{MacDonald90,Hudson97,Coorg97,bittner01:jvca}. The |
---|
| 310 | concept applies to general hierarchical data structures as well, though. |
---|
| 311 | |
---|
| 312 | The hierarchical stop-and-wait method proceeds as follows: Once a |
---|
| 313 | kD-tree node passes view-frustum culling, it is tested for |
---|
| 314 | occlusion by issuing the occlusion query and waiting for its |
---|
| 315 | result. If the node is found visible, we continue by recursively |
---|
| 316 | testing its children in a front-to-back order. If the node is a |
---|
| 317 | leaf, we render its associated objects. |
---|
| 318 | |
---|
| 319 | The problem with this approach is that we can continue the tree |
---|
| 320 | traversal only when the result of the last occlusion query becomes |
---|
| 321 | available. If the result is not available, we have to stall the CPU, |
---|
| 322 | which causes significant performance penalties. As we document in |
---|
| 323 | Section~\ref{sec:results}, these penalties together with the overhead |
---|
| 324 | of the queries themselves can even decrease the overall application |
---|
| 325 | performance compared to pure view-frustum culling. Our new method aims |
---|
| 326 | to eliminate this problem by issuing multiple occlusion queries for |
---|
| 327 | independent scene parts and exploiting temporal coherence of |
---|
| 328 | visibility classifications. |
---|
| 329 | |
---|
| 330 | |
---|
| 331 | \section{Coherent Hierarchical Culling} |
---|
| 332 | |
---|
| 333 | \label{sec:hoq} |
---|
| 334 | |
---|
| 335 | In this section we first present an overview of our new |
---|
| 336 | algorithm. Then we discuss its steps in more detail. |
---|
| 337 | |
---|
| 338 | %and present a |
---|
| 339 | %generalization of the method to other spatial data structures. |
---|
| 340 | |
---|
| 341 | \subsection{Algorithm Overview} |
---|
| 342 | |
---|
| 343 | \label{sec:overview} |
---|
| 344 | |
---|
| 345 | |
---|
| 346 | Our method is based on exploiting temporal coherence of visibility |
---|
| 347 | classification. In particular, it is centered on the following |
---|
| 348 | three ideas: |
---|
| 349 | |
---|
| 350 | \begin{itemize} |
---|
| 351 | \item We initiate occlusion queries on nodes of the hierarchy where |
---|
| 352 | the traversal terminated in the last frame. Thus we avoid queries |
---|
| 353 | on all previously visible interior nodes~\cite{bittner01:jvca}. |
---|
| 354 | \item We assume that a previously visible leaf node remains visible |
---|
| 355 | and render the associated geometry without waiting for the result |
---|
| 356 | of the corresponding occlusion query. |
---|
| 357 | \item Issued occlusion queries are stored in a query queue until they are known to be carried out by the GPU. This allows |
---|
| 358 | interleaving the queries with the rendering of visible geometry. |
---|
| 359 | \end{itemize} |
---|
| 360 | |
---|
| 361 | |
---|
| 362 | The algorithm performs a traversal of the hierarchy that is |
---|
| 363 | terminated either at leaf nodes or nodes that are classified as |
---|
| 364 | invisible. Let us call such nodes the {\em termination nodes}, and |
---|
| 365 | interior nodes that have been classified visible the {\em opened |
---|
| 366 | nodes}. We denote sets of termination and opened nodes in the $i$-th |
---|
| 367 | frame $\mathcal{T}_i$ and $\mathcal{ O}_i$, respectively. In the |
---|
| 368 | $i$-th frame, we traverse the kD-tree in a front-to-back order, skip |
---|
| 369 | all nodes of $\mathcal{ O}_{i-1}$ and apply occlusion queries first on |
---|
| 370 | the termination nodes $\mathcal{ T}_{i-1}$. When reaching a |
---|
| 371 | termination node, the algorithm proceeds as follows: |
---|
| 372 | |
---|
| 373 | \begin{itemize} |
---|
| 374 | \item For a previously visible node (this must be a leaf), we issue |
---|
| 375 | the occlusion query and store it in the query queue. Then we |
---|
| 376 | immediately render the associated geometry without waiting for the |
---|
| 377 | result of the query. |
---|
| 378 | \item For a previously invisible node, we issue the query and store |
---|
| 379 | it in the query queue. |
---|
| 380 | \end{itemize} |
---|
| 381 | |
---|
| 382 | \begin{figure}[htb] |
---|
| 383 | {\footnotesize |
---|
| 384 | \input{code/pseudocode2} |
---|
| 385 | } |
---|
| 386 | \caption{Pseudo-code of coherent hierarchical culling.} |
---|
| 387 | \label{fig:pseudocode} |
---|
| 388 | \end{figure} |
---|
| 389 | |
---|
| 390 | When the query queue is not empty, we check if the result of the |
---|
| 391 | oldest query in the queue is already available. If the query result is |
---|
| 392 | not available, we continue by recursively processing other nodes of |
---|
| 393 | the kD-tree as described above. If the query result is available, we |
---|
| 394 | fetch the result and remove the node from the query queue. If the node |
---|
| 395 | is visible, we process its children recursively. Otherwise, the whole |
---|
| 396 | subtree of the node is invisible and thus it is culled. |
---|
| 397 | |
---|
| 398 | |
---|
| 399 | In order to propagate changes in visibility upwards in the hierarchy, |
---|
| 400 | the visibility classification is \emph{pulled up} according to the |
---|
| 401 | following rule: An interior node is invisible only if all its children |
---|
| 402 | have been classified invisible. Otherwise, it remains visible and thus |
---|
| 403 | opened. The pseudo-code of the complete algorithm is given in |
---|
| 404 | Figure~\ref{fig:pseudocode}. An example of the behavior of the method |
---|
| 405 | on a small kD-tree for two subsequent frames is depicted |
---|
| 406 | Figure~\ref{fig:cut}. |
---|
| 407 | |
---|
| 408 | \begin{figure*}[htb] |
---|
| 409 | \centerline{ |
---|
| 410 | \includegraphics[width=0.73\textwidth,draft=\DRAFTFIGS]{figs/cut} } |
---|
| 411 | \caption{(left) Visibility classification of a node of the kD-tree and |
---|
| 412 | the termination nodes. (right) Visibility classification after the |
---|
| 413 | application of the occlusion test and the new set of termination nodes. |
---|
| 414 | Nodes on which occlusion queries were applied are depicted with a |
---|
| 415 | solid outline. Note the pull-up and pull-down due to visibility changes.} |
---|
| 416 | \label{fig:cut} |
---|
| 417 | \end{figure*} |
---|
| 418 | |
---|
| 419 | |
---|
| 420 | |
---|
| 421 | The sets of opened nodes and termination nodes need not be maintained |
---|
| 422 | explicitly. Instead, these sets can be easily identified by |
---|
| 423 | associating with each node an information about its visibility and an |
---|
| 424 | id of the last frame when it was visited. The node is an opened node |
---|
| 425 | if it is an interior visible node that was visited in the last frame |
---|
| 426 | (line 23 in the pseudocode). Note that in the actual implementation of |
---|
| 427 | the pull up we can set all visited nodes to invisible by default and |
---|
| 428 | then pull up any changes from invisible to visible (lines 25 and line |
---|
| 429 | 12 in Figure~\ref{fig:pseudocode}). This modification eliminates |
---|
| 430 | checking children for invisibility during the pull up. |
---|
| 431 | |
---|
| 432 | |
---|
| 433 | |
---|
| 434 | %% \begin{figure*}[htb] |
---|
| 435 | %% \centerline{\includegraphics[width=0.8\textwidth,draft=\DRAFTFIGS]{figs/basic}} |
---|
| 436 | %% \caption{Illustration of the hierarchy traversal. Initially the algorithm |
---|
| 437 | %% starts at the root of the hierarchy (left). In the second frame the |
---|
| 438 | %% opened nodes $\mathcal{ O}_0$ are skipped and the occlusion queries |
---|
| 439 | %% are first applied on the termination nodes $\mathcal{T}_0$. Visibility changes are propagated upwarsds in the hierarchy and a |
---|
| 440 | %% new set of termination nodes $\mathcal{ T}_1$ is established.} |
---|
| 441 | %% \label{fig:basic} |
---|
| 442 | %% \end{figure*} |
---|
| 443 | |
---|
| 444 | |
---|
| 445 | \subsection{Reduction of the number of queries} |
---|
| 446 | |
---|
| 447 | %Skipping the opened nodes and issuing occlusion queries for the |
---|
| 448 | %termination nodes assists in exploiting two main characteristics of |
---|
| 449 | %scene visibility: |
---|
| 450 | |
---|
| 451 | %Identifying the set of termination nodes assists in finding a subdivision |
---|
| 452 | |
---|
| 453 | Our method reduces the number of visibility queries in two ways: |
---|
| 454 | Firstly, as other hierarchical culling methods we consider only a |
---|
| 455 | subtree of the whole hierarchy (opened nodes + termination |
---|
| 456 | nodes). Secondly, by avoiding queries on opened nodes we eliminate |
---|
| 457 | part of the overhead of identification of this subtree. These |
---|
| 458 | reductions reflect the following coherence properties of scene |
---|
| 459 | visibility: |
---|
| 460 | |
---|
| 461 | \begin{itemize} |
---|
| 462 | |
---|
| 463 | |
---|
| 464 | \item {\em Spatial coherence.} The invisible termination nodes |
---|
| 465 | approximate the occluded part of the scene with the smallest number of |
---|
| 466 | nodes with respect to the given hierarchy, i.e., each invisible |
---|
| 467 | termination node has a visible parent. This induces an adaptive |
---|
| 468 | spatial subdivision that reflects spatial coherence of visibility, |
---|
| 469 | more precisely the coherence of occluded regions. The adaptive nature |
---|
| 470 | of the subdivision allows to minimize the number of subsequent |
---|
| 471 | occlusion queries by applying the queries on the largest spatial |
---|
| 472 | regions that are expected to remain occluded. |
---|
| 473 | |
---|
| 474 | %, and visible termination |
---|
| 475 | %nodes correspond to the smallest unoccluded regions (visible |
---|
| 476 | %leafs) |
---|
| 477 | |
---|
| 478 | \item {\em Temporal coherence.} If visibility remains constant the set |
---|
| 479 | of termination nodes needs no adaptation. If an occluded node becomes |
---|
| 480 | visible we recursively process its children (pull-down). If a visible |
---|
| 481 | node becomes occluded we propagate the change higher in the hierarchy |
---|
| 482 | (pull-up). A pull-down reflects a spatial growing of visible |
---|
| 483 | regions. Similarly, a pull-up reflects a spatial growing of occluded |
---|
| 484 | regions. |
---|
| 485 | |
---|
| 486 | |
---|
| 487 | %The first case corresponds to pull-down of the cut, the latter to pull-up. |
---|
| 488 | |
---|
| 489 | \end{itemize} |
---|
| 490 | |
---|
| 491 | By avoiding queries on the opened nodes, we can save $1/k$ of the queries |
---|
| 492 | for a hierarchy with branching factor $k$ (assuming visibility remains |
---|
| 493 | constant). Thus for the kD-tree, up to half of the queries can be |
---|
| 494 | saved. The actual savings in the total query time are even larger: the |
---|
| 495 | higher we are at the hierarchy, the larger boxes we would have to |
---|
| 496 | check for occlusion. Consequently, the higher is the fill rate that |
---|
| 497 | would have been required to rasterize the boxes. In particular, |
---|
| 498 | assuming that the sum of the screen space projected area for nodes at |
---|
| 499 | each level of the kD-tree is equal and the opened nodes form a |
---|
| 500 | complete binary subtree of depth $d$, the fill rate is reduced $(d+2)$ |
---|
| 501 | times. |
---|
| 502 | |
---|
| 503 | |
---|
| 504 | |
---|
| 505 | \subsection{Reduction of CPU stalls and GPU starvation} |
---|
| 506 | |
---|
| 507 | \label{sec:latency} |
---|
| 508 | |
---|
| 509 | The reduction of CPU stalls and GPU starvation is achieved by |
---|
| 510 | interleaving occlusion queries with the rendering of visible geometry. The |
---|
| 511 | immediate rendering of previously visible termination nodes and the |
---|
| 512 | subsequent issuing of occlusion queries eliminates the requirement of |
---|
| 513 | waiting for the query result during the processing of the initial |
---|
| 514 | depth layers containing previously visible nodes. In an optimal case, |
---|
| 515 | new query results become available in between and thus we completely |
---|
| 516 | eliminate CPU stalls. In a static scenario, we achieve exactly the |
---|
| 517 | same visibility classification as the hierarchical stop-and-wait |
---|
| 518 | method. |
---|
| 519 | |
---|
| 520 | If the visibility is changing, the situation can be different: if the |
---|
| 521 | results of the queries arrive too late, it is possible that we |
---|
| 522 | initiated an occlusion query on a previously occluded node $A$ that is |
---|
| 523 | in fact occluded by another previously occluded node $B$ that became |
---|
| 524 | visible. If $B$ is still in the query queue, we do not capture a |
---|
| 525 | possible occlusion of $A$ by $B$ since the geometry associated with |
---|
| 526 | $B$ has not yet been rendered. In Section~\ref{sec:results} we show |
---|
| 527 | that the increase of the number of rendered objects compared to the |
---|
| 528 | stop-and-wait method is usually very small. |
---|
| 529 | |
---|
| 530 | %% It is possible that we have already traversed all previously |
---|
| 531 | %% visible nodes and we must stall the CPU by waiting for the result |
---|
| 532 | %% of the oldest occlusion query. A technique completely eliminating |
---|
| 533 | %% CPU stalls at the cost of an increase of the number of rendered |
---|
| 534 | %% objects will be discussed in Section~\ref{sec:full}. |
---|
| 535 | |
---|
| 536 | \subsection{Front-to-back scene traversal} |
---|
| 537 | |
---|
| 538 | For kD-trees the front-to-back scene traversal can be easily |
---|
| 539 | implemented using a depth first |
---|
| 540 | traversal~\cite{bittner01:jvca}. However, at a modest increase in |
---|
| 541 | computational cost we can also use a more general breadth-first |
---|
| 542 | traversal based on a priority queue. The priority of the node then |
---|
| 543 | corresponds to an inverse of the minimal distance of the viewpoint and |
---|
| 544 | the bounding box associated with the given node of the |
---|
| 545 | kD-tree~\cite{Klosowski:2001:ECV,Staneker:2004:OCO}. |
---|
| 546 | |
---|
| 547 | In the context of our culling algorithm, there are two main advantages |
---|
| 548 | of the breadth-first front-to-back traversal : |
---|
| 549 | |
---|
| 550 | \begin{itemize} |
---|
| 551 | \item {\em Better query scheduling.} By spreading the traversal of the |
---|
| 552 | scene in a breadth-first manner, we process the scene in depth |
---|
| 553 | layers. Within each layer, the node processing order is practically |
---|
| 554 | independent, which minimizes the problem of occlusion query |
---|
| 555 | dependence. The breadth-first traversal interleaves |
---|
| 556 | occlusion-independent nodes, which can provide a more accurate |
---|
| 557 | visibility classification if visibility changes quickly. In |
---|
| 558 | particular, it reduces the problem of false classifications due to |
---|
| 559 | missed occlusion by nodes waiting in the query queue (discussed in |
---|
| 560 | Section~\ref{sec:latency}). |
---|
| 561 | |
---|
| 562 | \item {\em Using other spatial data structures.} By using a |
---|
| 563 | breadth-first traversal, we are no longer restricted to the |
---|
| 564 | kD-tree. Instead we can use an arbitrary spatial data structure |
---|
| 565 | such as a bounding volume hierarchy, octree, grid, hierarchical grid, |
---|
| 566 | etc. Once we compute a distance from a node to the viewpoint, the |
---|
| 567 | node processing order is established by the priority queue. |
---|
| 568 | \end{itemize} |
---|
| 569 | |
---|
| 570 | When using the priority queue, our culling algorithm can also be |
---|
| 571 | applied directly to the scene graph hierarchy, thus avoiding the |
---|
| 572 | construction of any auxiliary data structure for spatial |
---|
| 573 | partitioning. This is especially important for dynamic scenes, in |
---|
| 574 | which maintenance of a spatial classification of moving objects can be |
---|
| 575 | costly. |
---|
| 576 | |
---|
| 577 | \subsection{Checking the query result} |
---|
| 578 | |
---|
| 579 | The presented algorithm repeatedly checks if the result of the |
---|
| 580 | occlusion query is available before fetching any node from the |
---|
| 581 | traversal stack (line 6 in Figure~\ref{fig:pseudocode}). Our |
---|
| 582 | practical experiments have proven that the cost of this check is |
---|
| 583 | negligible and thus it can used frequently without any performance |
---|
| 584 | penalty. If the cost of this check were significantly higher, we could |
---|
| 585 | delay asking for the query result by a time established by empirical |
---|
| 586 | measurements for the particular hardware. This delay should also |
---|
| 587 | reflect the size of the queried node to match the expected |
---|
| 588 | availability of the query result as precise as possible. |
---|
| 589 | |
---|
| 590 | |
---|
| 591 | \section{Further Optimizations} |
---|
| 592 | |
---|
| 593 | \label{sec:optimizations} |
---|
| 594 | |
---|
| 595 | This section discusses a couple of optimizations of our method that |
---|
| 596 | can further improve the overall rendering performance. In contrast to |
---|
| 597 | the basic algorithm from the previous section, these optimizations |
---|
| 598 | rely on some user specified parameters that should be tuned for a |
---|
| 599 | particular scene and hardware configuration. |
---|
| 600 | |
---|
| 601 | |
---|
| 602 | \subsection{Conservative visibility testing} |
---|
| 603 | |
---|
| 604 | The first optimization addresses the reduction of the number of |
---|
| 605 | visibility tests at the cost of a possible increase in the number of |
---|
| 606 | rendered objects. This optimization is based on the idea of skipping |
---|
| 607 | some occlusion tests of visible nodes. We assume that whenever a node |
---|
| 608 | becomes visible, it remains visible for a number of frames. Within the |
---|
| 609 | given number of frames we avoid issuing occlusion queries and simply |
---|
| 610 | assume the node remains visible~\cite{bittner01:jvca}. |
---|
| 611 | |
---|
| 612 | This technique can significantly reduce the number of visibility tests |
---|
| 613 | applied on visible nodes of the hierarchy. Especially in the case of |
---|
| 614 | sparsely occluded scenes, there is a large number of visible nodes |
---|
| 615 | being tested, which does not provide any benefit since most of them |
---|
| 616 | remain visible. On the other hand, we do not immediately capture all |
---|
| 617 | changes from visibility to invisibility, and thus we may render |
---|
| 618 | objects that have already become invisible from the moment when the |
---|
| 619 | last occlusion test was issued. |
---|
| 620 | |
---|
| 621 | In the simplest case, the number of frames a node is assumed visible |
---|
| 622 | can be a predefined constant. In a more complicated scenario this |
---|
| 623 | number should be influenced by the history of the success of occlusion |
---|
| 624 | queries and/or the current speed of camera movement. |
---|
| 625 | |
---|
| 626 | |
---|
| 627 | \subsection{Approximate visibility} |
---|
| 628 | |
---|
| 629 | The algorithm as presented computes a conservative visibility |
---|
| 630 | classification with respect to the resolution of the z-buffer. We |
---|
| 631 | can easily modify the algorithm to cull nodes more aggressively in |
---|
| 632 | cases when a small part of the node is visible. We compare the |
---|
| 633 | number of visible pixels returned by the occlusion query with a |
---|
| 634 | user specified constant and cull the node if this number drops |
---|
| 635 | below this constant. |
---|
| 636 | |
---|
| 637 | |
---|
| 638 | \subsection{Complete elimination of CPU stalls} |
---|
| 639 | |
---|
| 640 | \label{sec:full} |
---|
| 641 | |
---|
| 642 | The basic algorithm eliminates CPU stalls unless the traversal stack |
---|
| 643 | is empty. If there is no node to traverse in the traversal stack and |
---|
| 644 | the result of the oldest query in the query queue is still not |
---|
| 645 | available, it stalls the CPU by waiting for the query result. To |
---|
| 646 | completely eliminate the CPU stalls, we can speculatively render some |
---|
| 647 | nodes with undecided visibility. In particular, we select a node from |
---|
| 648 | the query queue and render the geometry associated with the node (or |
---|
| 649 | the whole subtree if it is an interior node). The node is marked as |
---|
| 650 | rendered but the associated occlusion query is kept in the queue to |
---|
| 651 | fetch its result later. If we are unlucky and the node remains |
---|
| 652 | invisible, the effort of rendering the node's geometry is wasted. On |
---|
| 653 | the other hand, if the node has become visible, we have used the time |
---|
| 654 | slot before the next query arrives in an optimal manner. |
---|
| 655 | |
---|
| 656 | To avoid the problem of spending more time on rendering invisible |
---|
| 657 | nodes than would be spent by waiting for the result of the query, we |
---|
| 658 | select a node with the lowest estimated rendering cost and compare |
---|
| 659 | this cost with a user specified constant. If the cost is larger than |
---|
| 660 | the constant we conclude that it is too risky to render the node and |
---|
| 661 | wait till the result of the query becomes available. |
---|
| 662 | |
---|
| 663 | \section{Results} |
---|
| 664 | \label{sec:results} |
---|
| 665 | |
---|
| 666 | We have incorporated our method into an OpenGL-based scene graph |
---|
| 667 | library and tested it on three scenes of different types. All tests |
---|
| 668 | were conducted on a PC with a 3.2GHz P4, 1GB of memory, and a |
---|
| 669 | GeForce FX5950 graphics card. |
---|
| 670 | |
---|
| 671 | \subsection{Test scenes} |
---|
| 672 | |
---|
| 673 | The three test scenes comprise a synthetic arrangement of 5000 |
---|
| 674 | randomly positioned teapots (11.6M polygons); an urban environment (1M |
---|
| 675 | polygons); and the UNC power plant model (13M polygons). The test |
---|
| 676 | scenes are depicted in Figure~\ref{fig:scenes}. All scenes were |
---|
| 677 | partitioned using a kD-tree constructed according to the surface-area |
---|
| 678 | heuristics~\cite{MacDonald90}. |
---|
| 679 | |
---|
| 680 | |
---|
| 681 | Although the teapot scene would intuitively offer good occlusion, it |
---|
| 682 | is a complicated case to handle for occlusion culling. Firstly, the |
---|
| 683 | teapots consist of small triangles and so only the effect of fused |
---|
| 684 | occlusion due to a large number of visible triangles can bring a |
---|
| 685 | culling benefit. Secondly, there are many thin holes through which it |
---|
| 686 | is possible to see quite far into the arrangement of teapots. Thirdly, |
---|
| 687 | the arrangement is long and thin and so we can see almost |
---|
| 688 | half of the teapots along the longer side of the arrangement. |
---|
| 689 | |
---|
| 690 | %Although the teapot scene would intuitively seem to offer good |
---|
| 691 | %occlusion, it is one of the more complicated cases for occlusion |
---|
| 692 | %culling. Due to the high number of very dense objects, a very fine |
---|
| 693 | %kD-tree subdivision was necessary, which in turn leads to higher costs |
---|
| 694 | %for testing many nodes in the hierarchy. Additionally, misclassifying |
---|
| 695 | %a single teapot can lead to a significant increase in frame time due |
---|
| 696 | %to the high polygon density (2,304 polygons per teapot). There are also |
---|
| 697 | %many locations where it is possible to see very far into the block of |
---|
| 698 | %teapots. This can be seen in the kD-tree node classification which is |
---|
| 699 | %shown in the accompanying video. |
---|
| 700 | |
---|
| 701 | The complete power plant model is quite challenging even to load into memory, |
---|
| 702 | but on the other hand it offers good |
---|
| 703 | occlusion. This scene is an interesting candidate for testing not only |
---|
| 704 | due to its size, but also due to significant changes in visibility and depth complexity in |
---|
| 705 | its different parts. |
---|
| 706 | |
---|
| 707 | %as can also be seen in the |
---|
| 708 | %walkthrough. A relatively coarse subdivision of the model triangles |
---|
| 709 | %into about 4,600 nodes was sufficient to capture most occlusion |
---|
| 710 | %interdependencies. |
---|
| 711 | |
---|
| 712 | The city scene is a classical target for occlusion culling |
---|
| 713 | algorithms. Due to the urban structure consisting of buildings and |
---|
| 714 | streets, most of the model is occluded when viewed from the |
---|
| 715 | streets. Note that the scene does not contain any detailed geometry |
---|
| 716 | inside the buildings. See Figure~\ref{fig:city_vis} for a |
---|
| 717 | visualization of the visibility classification of the kD-tree nodes |
---|
| 718 | for the city scene. |
---|
| 719 | |
---|
| 720 | %that |
---|
| 721 | %could further |
---|
| 722 | |
---|
| 723 | |
---|
| 724 | |
---|
| 725 | %% However, we still included a section in the walkthrough where the |
---|
| 726 | %% viewpoint is elevated above the building roofs in order to show that |
---|
| 727 | %% the algorithm continues to work without modification, albeit at some |
---|
| 728 | %% expense. It will be an interesting topic of research to automatically |
---|
| 729 | %% recognize such situations and decrease the frequency of visibility |
---|
| 730 | %% queries or turn the occlusion culling algorithm off completely. |
---|
| 731 | |
---|
| 732 | |
---|
| 733 | |
---|
| 734 | \begin{figure} |
---|
| 735 | \centering \includegraphics[width=0.32\textwidth,draft=\DRAFTIMAGES]{images/city_vis} |
---|
| 736 | \caption{Visibility classification of the kD-tree nodes in the city scene. |
---|
| 737 | The orange nodes were found visible, all the other depicted nodes are invisible. |
---|
| 738 | Note the increasing size of the occluded nodes with increasing distance from the visible set.} |
---|
| 739 | \label{fig:city_vis} |
---|
| 740 | \end{figure} |
---|
| 741 | |
---|
| 742 | |
---|
| 743 | \subsection{Basic tests} |
---|
| 744 | |
---|
| 745 | |
---|
| 746 | We have measured the frame times for rendering with only view-frustum culling |
---|
| 747 | (VFC), the hierarchical stop-and-wait method (S\&W), and our new |
---|
| 748 | coherent hierarchical culling method (CHC). Additionally, we have |
---|
| 749 | evaluated the time for an ``ideal'' algorithm. The ideal algorithm |
---|
| 750 | renders the visible objects found by the S\&W algorithm without |
---|
| 751 | performing any visibility tests. This is an optimal solution with |
---|
| 752 | respect to the given hierarchy, i.e., no occlusion culling algorithm |
---|
| 753 | operating on the same hierarchy can be faster. For the basic tests we |
---|
| 754 | did not apply any of the optimizations discussed in |
---|
| 755 | Section~\ref{sec:optimizations}, which require user specified |
---|
| 756 | parameters. |
---|
| 757 | |
---|
| 758 | %% In order to obtain repeatable and comparable timings, the process |
---|
| 759 | %% priority of our application was set to high, timings were obtained |
---|
| 760 | %% using the CPU cycle counter instruction, and the OpenGL {\tt Finish()} |
---|
| 761 | %% instruction was executed before each invocation of the timer at the |
---|
| 762 | %% start and the end of a frame. Furthermore, we did not include clear |
---|
| 763 | %% and buffer swapping times in our results, since they depend strongly |
---|
| 764 | %% on the windowing mode used. |
---|
| 765 | |
---|
| 766 | For each test scene, we have constructed a walkthrough which is shown |
---|
| 767 | in full in the accompanying |
---|
| 768 | video. Figures~\ref{fig:teapot_walkthrough},~\ref{fig:city_walkthrough}, |
---|
| 769 | and~\ref{fig:plant_walkthrough} depict the frame times measured for |
---|
| 770 | the walkthroughs. Note that Figure~\ref{fig:plant_walkthrough} uses a |
---|
| 771 | logarithmic scale to capture the high variations in frame times during |
---|
| 772 | the power plant walkthrough. |
---|
| 773 | \begin{figure} |
---|
| 774 | \centering |
---|
| 775 | \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/teapotgraph1} |
---|
| 776 | \caption{Frame times for the teapot scene.}\label{fig:teapot_walkthrough} |
---|
| 777 | \end{figure} |
---|
| 778 | \begin{figure} |
---|
| 779 | \centering |
---|
| 780 | \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/citygraph1} |
---|
| 781 | \caption{Frame times for the city walkthrough. Note |
---|
| 782 | the spike around frame 1600, where the viewpoint was elevated above the roofs, |
---|
| 783 | practically eliminating any occlusion.}\label{fig:city_walkthrough} |
---|
| 784 | \end{figure} |
---|
| 785 | \begin{figure} |
---|
| 786 | \centering |
---|
| 787 | \includegraphics[width=0.5\textwidth,draft=\DRAFTFIGS]{images/plantgraph1log} |
---|
| 788 | \caption{Frame times for the power plant walkthrough. |
---|
| 789 | The plot shows the weakness of the S\&W method: when there is not much occlusion it becomes slower than VFC (near frame 2200). |
---|
| 790 | The CHC can keep up even in these situations and in the same time it can exploit occlusion when it appears |
---|
| 791 | (e.g. near frame 3700).} |
---|
| 792 | \label{fig:plant_walkthrough} |
---|
| 793 | \end{figure} |
---|
| 794 | To better demonstrate the behavior of our algorithm, all walkthroughs |
---|
| 795 | contain sections with both restricted and unrestricted visibility. For |
---|
| 796 | the teapots, we viewed the arrangement of teapots along the longer |
---|
| 797 | side of the arrangement (frames 25--90). In the city we elevated the |
---|
| 798 | viewpoint above the roofs and gained sight over most of the city |
---|
| 799 | (frames 1200--1800). The power plant walkthrough contains several |
---|
| 800 | viewpoints from which a large part of the model is visible (spikes in |
---|
| 801 | Figure~\ref{fig:plant_walkthrough} where all algorithms are slow), |
---|
| 802 | viewpoints along the border of the model directed outwards with low depth complexity (holes in |
---|
| 803 | Figure~\ref{fig:plant_walkthrough} where all algorithms are fast), and |
---|
| 804 | viewpoints inside the power plant with high depth complexity where occlusion culling produces a |
---|
| 805 | significant speedup over VFC (e.g. frame 3800). |
---|
| 806 | |
---|
| 807 | As we can see for a number frames in the walkthroughs, the CHC method |
---|
| 808 | can produce a speedup of more than one order of magnitude compared to |
---|
| 809 | VFC. The maximum speedup for the teapots, the city, and the power |
---|
| 810 | plant walkthroughs is 8, 20, and 70, respectively. We can also observe |
---|
| 811 | that CHC maintains a significant gain over S\&W and in many cases it |
---|
| 812 | almost matches the performance of the ideal algorithm. In complicated |
---|
| 813 | scenarios the S\&W method caused a significant slowdown compared to |
---|
| 814 | VFC (e.g. frames 1200--1800 of |
---|
| 815 | Figure~\ref{fig:city_walkthrough}). Even in these cases, the CHC |
---|
| 816 | method maintained a good speedup over VFC except for a small number of |
---|
| 817 | frames. |
---|
| 818 | |
---|
| 819 | \begin{table*} |
---|
| 820 | \centering \footnotesize |
---|
| 821 | \begin{tabular}{|c|c|c|c|c|c|c|c|} |
---|
| 822 | \hline\hline |
---|
| 823 | scene & method & \#queries & wait time [ms] & rendered triangles & frame time [ms] & speedup \\\hline\hline |
---|
| 824 | |
---|
| 825 | Teapots & VFC & --- & --- & 11,139,928 & 310.42 & 1.0 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
| 826 | 11,520,000 triangles & S\&W & 4704 & 83.19 & 2,617,801 & 154.95 & 2.3 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
| 827 | 21,639 kD-Tree nodes & {\bf CHC} &{\bf 2827} & {\bf 1.31} & {\bf 2,852,514 } & {\bf 81,18 } & {\bf 4.6 } \\\hhline{~|-|-|-|-|-|-|-|} |
---|
| 828 | & Ideal & --- & --- & 2,617,801 & 72.19 & 5.2 \\ |
---|
| 829 | \hline\hline |
---|
| 830 | City & VFC & --- & --- & 156,521 & 19.79 & 1.0 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
| 831 | 1,036,146 triangles & S\&W & 663 & 9.49 & 30,594 & 19.9 & 1.5 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
| 832 | 33,195 kD-Tree nodes &{\bf CHC} & {\bf 345} & {\bf 0.18} & {\bf 31,034} & {\bf 8.47} & {\bf 4.0} \\\hhline{~|-|-|-|-|-|-|-|} |
---|
| 833 | & Ideal & --- & --- & 30,594 & 4.55 & 6.6 \\ |
---|
| 834 | \hline\hline |
---|
| 835 | Power Plant & VFC & --- & --- & 1,556,300 & 138.76 & 1.0 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
| 836 | 12,748,510 triangles & S\&W & 485 & 16.16 & 392,962 & 52.29 & 3.2 \\\hhline{~|-|-|-|-|-|-|-|} |
---|
| 837 | 18,719 kD-Tree nodes &{\bf CHC} & {\bf 263} & {\bf 0.70} & {\bf 397,920} & {\bf 38.73} & {\bf 4.7} \\\hhline{~|-|-|-|-|-|-|-|} |
---|
| 838 | & Ideal & --- & --- & 392,962 & 36.34 & 5.8 \\\hline\hline |
---|
| 839 | \end{tabular} |
---|
| 840 | \caption{Statistics for the three test scenes. VFC is rendering with only view-frustum culling, S\&W is the |
---|
| 841 | hierarchical stop and wait method, CHC is our new method, and Ideal |
---|
| 842 | is a perfect method with respect to the given hierarchy. All values are averages over |
---|
| 843 | all frames (including the speedup).} |
---|
| 844 | \label{tab:averages} |
---|
| 845 | \end{table*} |
---|
| 846 | |
---|
| 847 | %VFC S&W CHC Ideal |
---|
| 848 | %(01:50:59) salzrat: 1 3,243823523 4,744645293 5,765981148 |
---|
| 849 | %(01:51:04) salzrat: for the powerplant |
---|
| 850 | % for the powerplant |
---|
| 851 | %(01:53:59) salzrat: 1 1,519898136 4,01897468 6,599413211 |
---|
| 852 | %(01:54:01) salzrat: for the city |
---|
| 853 | %(01:55:15) salzrat: and |
---|
| 854 | %(01:55:15) salzrat: 1 2,321137905 4,646185621 5,199086993 |
---|
| 855 | |
---|
| 856 | %salzrat: teapot: 1,977162565 1,128631431 |
---|
| 857 | %(02:01:52) salzrat: city: 2,607396227 1,674188137 |
---|
| 858 | %(02:02:59) salzrat: powerplant: 1,635036084 1,244446469 |
---|
| 859 | |
---|
| 860 | |
---|
| 861 | Next, we summarized the scene statistics and the average values per |
---|
| 862 | frame in Table~\ref{tab:averages}. The table shows the number of |
---|
| 863 | issued occlusion queries, the wait time representing the CPU stalls, |
---|
| 864 | the number of rendered triangles, the total frame time, and the |
---|
| 865 | speedup over VFC. |
---|
| 866 | |
---|
| 867 | |
---|
| 868 | We can see that the CHC method practically eliminates the CPU stalls |
---|
| 869 | (wait time) compared to the S\&W method. This is paid for by a slight |
---|
| 870 | increase in the number of rendered triangles. For the three |
---|
| 871 | walkthroughs, the CHC method produces average speedups of 4.6, 4.0, and |
---|
| 872 | 4.7 over view frustum culling and average speedups of 2.0, 2.6, and |
---|
| 873 | 1.6 over the S\&W method. CHC is only 1.1, 1.7, and 1.2 times |
---|
| 874 | slower than the ideal occlusion culling algorithm. Concerning the |
---|
| 875 | accuracy, the increase of the average number of rendered triangles for |
---|
| 876 | CHC method compared to S\&W was 9\%, 1.4\%, and 1.3\%. This increase |
---|
| 877 | was always recovered by the reduction of CPU stalls for the tested |
---|
| 878 | walkthroughs. |
---|
| 879 | |
---|
| 880 | %% While this number may be very low, the algorithm also incurs a |
---|
| 881 | %% non-negligible overhead which is caused by the graphics card |
---|
| 882 | %% reconfiguration required for the queries and rasterization of the |
---|
| 883 | %% bounding volume geometries. |
---|
| 884 | |
---|
| 885 | % triangles |
---|
| 886 | % 9% |
---|
| 887 | % 1.4% |
---|
| 888 | % 1.3% |
---|
| 889 | |
---|
| 890 | |
---|
| 891 | % speedup over SW |
---|
| 892 | % 1.9 |
---|
| 893 | % 2.32 |
---|
| 894 | % 1.33 |
---|
| 895 | |
---|
| 896 | % compare to ideal |
---|
| 897 | % 1.13 |
---|
| 898 | % 1.9 |
---|
| 899 | % 1.05 |
---|
| 900 | |
---|
| 901 | |
---|
| 902 | |
---|
| 903 | \subsection{Optimizations} |
---|
| 904 | |
---|
| 905 | First of all we have observed that the technique of complete |
---|
| 906 | elimination of CPU stalls discussed in Section~\ref{sec:full} has a |
---|
| 907 | very limited scope. In fact for all our tests the stalls were almost |
---|
| 908 | completely eliminated by the basic algorithm already (see wait time in |
---|
| 909 | Table~\ref{tab:averages}). We did not find constants that could |
---|
| 910 | produce additional speedup using this technique. |
---|
| 911 | |
---|
| 912 | The measurements for the other optimizations discussed in |
---|
| 913 | Section~\ref{sec:optimizations} are summarized in |
---|
| 914 | Table~\ref{tab:assumed}. We have measured the average number of issued queries |
---|
| 915 | and the average frame time in dependence on the number of frames a node is |
---|
| 916 | assumed visible and the pixel threshold of approximate visibility. We |
---|
| 917 | have observed that the effectiveness of the optimizations depends strongly on the scene. If |
---|
| 918 | the hierarchy is deep and the geometry associated with a leaf node is |
---|
| 919 | not too complex, the conservative visibility testing produces a |
---|
| 920 | significant speedup (city and power plant). For the teapot scene the |
---|
| 921 | penalty for false rendering of actually occluded objects became larger |
---|
| 922 | than savings achieved by the reduction of the number of queries. On the other |
---|
| 923 | hand since the teapot scene contains complex visible geometry the |
---|
| 924 | approximate visibility optimization produced a significant |
---|
| 925 | speedup. This is however paid for by introducing errors in the image |
---|
| 926 | proportional to the pixel threshold used. |
---|
| 927 | |
---|
| 928 | %Good parameters for these |
---|
| 929 | %optimizations have to be estimated on a scene-by-scene basis. |
---|
| 930 | |
---|
| 931 | |
---|
| 932 | |
---|
| 933 | |
---|
| 934 | %% , where we compare the reference frame time and |
---|
| 935 | %% number of queries already shown in the previous table to the |
---|
| 936 | %% approximate visibility optimization using a higher pixel threshold, |
---|
| 937 | %% and to the conservative visibility testing optimization assuming a |
---|
| 938 | %% node remains visible for 2 frames, both for the teapot and the city |
---|
| 939 | %% scene. These tests clearly show that a high visibility threshold |
---|
| 940 | %% improves frame times, albeit at the expense of image quality. The |
---|
| 941 | %% benefit of the conservative visibility optimization, is however |
---|
| 942 | %% limited. While the number of queries can be reduced significantly, the |
---|
| 943 | %% number of misclassified objects rises and increases the frame time |
---|
| 944 | %% again. |
---|
| 945 | |
---|
| 946 | |
---|
| 947 | |
---|
| 948 | %% \begin{table} |
---|
| 949 | %% \centering \footnotesize |
---|
| 950 | %% \begin{tabular}{|l|c|c|} |
---|
| 951 | %% \hline\hline |
---|
| 952 | %% & Teapots & City \\\hline\hline |
---|
| 953 | %% frame time CHC & 130.42 & 13.91\\\hline |
---|
| 954 | %% \#queries & 2267 & 94.04\\\hline\hline |
---|
| 955 | %% \multicolumn{3}{|l|}{assume 2 frames visible}\\\hline |
---|
| 956 | %% frame time & 123.84 & 13.27 \\\hline |
---|
| 957 | %% \#queries & 1342 & 42 \\\hline\hline |
---|
| 958 | %% \multicolumn{3}{|l|}{pixel threshold 50 pixels and assume 2 frames visible}\\\hline |
---|
| 959 | %% frame time & 73.22 & 11.47 \\\hline |
---|
| 960 | %% \#queries & 1132 & 50 \\\hline\hline |
---|
| 961 | %% \end{tabular} |
---|
| 962 | %% \caption{Influence of optimizations on two test scenes. All values are averages |
---|
| 963 | %% over all frames, frame times are in [ms].}\label{tab:assumed} |
---|
| 964 | %% \end{table} |
---|
| 965 | |
---|
| 966 | |
---|
| 967 | \begin{table} |
---|
| 968 | \centering \footnotesize |
---|
| 969 | \begin{tabular}{|l|c|c|c|c|} |
---|
| 970 | \hline\hline |
---|
| 971 | scene & t$_{av}$ & n$_{vp}$ & \#queries & frame time [ms] \\ \hline\hline |
---|
| 972 | \multirow{3}{*}{Teapots} & 0 & 0 & 2827 & 81.18 \\ \hhline{~----} |
---|
| 973 | & 2 & 0 & 1769 & 86.31 \\ \hhline{~----} |
---|
| 974 | & 2 & 25 & 1468 & 55.90 \\ \hline\hline |
---|
| 975 | \multirow{3}{*}{City} & 0 & 0 & 345 & 8.47 \\ \hhline{~----} |
---|
| 976 | & 2 & 0 & 192 & 6.70 \\ \hhline{~----} |
---|
| 977 | & 2 & 25 & 181 & 6.11 \\ \hline\hline |
---|
| 978 | \multirow{3}{*}{Power Plant} & 0 & 0 & 263 & 38.73 \\ \hhline{~----} |
---|
| 979 | & 2 & 0 & 126 & 31.17 \\ \hhline{~----} |
---|
| 980 | & 2 & 25 & 120 & 36.62 \\ \hline\hline |
---|
| 981 | \end{tabular} |
---|
| 982 | |
---|
| 983 | \caption{Influence of optimizations on the CHC method. |
---|
| 984 | $t_{av}$ is the number of assumed visibility frames for conservative |
---|
| 985 | visibility testing, n$_{vp}$ is the pixel threshold for |
---|
| 986 | approximate visibility.}\label{tab:assumed} |
---|
| 987 | \end{table} |
---|
| 988 | |
---|
| 989 | |
---|
| 990 | \subsection{Comparison to PVS-based rendering} |
---|
| 991 | |
---|
| 992 | We also compared the CHC method against precalculated visibility. In |
---|
| 993 | particular, we used the PVS computed by an offline visibility |
---|
| 994 | algorithm~\cite{wonka00}. While the walkthrough using the PVS was |
---|
| 995 | 1.26ms faster per frame on average, our method does not require costly |
---|
| 996 | precomputation and can be used at any general 3D position in the |
---|
| 997 | model, not only in a predefined view space. |
---|
| 998 | |
---|
| 999 | |
---|
| 1000 | \section{Conclusion} |
---|
| 1001 | \label{sec:conclusion} |
---|
| 1002 | |
---|
| 1003 | We have presented a method for the optimized scheduling of hardware |
---|
| 1004 | accelerated occlusion queries. The method schedules occlusion queries |
---|
| 1005 | in order to minimize the number of the queries and their latency. |
---|
| 1006 | This is achieved by exploiting spatial and temporal coherence of |
---|
| 1007 | visibility. Our results show that the CPU stalls and GPU starvation |
---|
| 1008 | are almost completely eliminated at the cost of a slight increase in |
---|
| 1009 | the number of rendered objects. |
---|
| 1010 | |
---|
| 1011 | %Additionally it schedules the queries in an order |
---|
| 1012 | %minimizing the overhead due to the latency of availability of query |
---|
| 1013 | %results. |
---|
| 1014 | |
---|
| 1015 | Our technique can be used with practically arbitrary scene |
---|
| 1016 | partitioning data structures such as kD-trees, bounding volume |
---|
| 1017 | hierarchies, or hierarchical grids. The implementation of the method |
---|
| 1018 | is straightforward as it uses a simple OpenGL interface to the |
---|
| 1019 | hardware occlusion queries. In particular, the method requires no |
---|
| 1020 | complicated geometrical operations or data structures. The algorithm |
---|
| 1021 | is suitable for application on scenes of arbitrary structure and it |
---|
| 1022 | requires no preprocessing or scene dependent tuning. |
---|
| 1023 | |
---|
| 1024 | We have experimentally verified that the method is well suited to the |
---|
| 1025 | \nvoq{} supported on current consumer grade graphics hardware. We |
---|
| 1026 | have obtained an average speedup of 4.0--4.7 compared to pure view-frustum |
---|
| 1027 | culling and 1.6--2.6 compared to the hierarchical |
---|
| 1028 | stop-and-wait application of occlusion queries. |
---|
| 1029 | |
---|
| 1030 | The major potential in improving the method is a better estimation |
---|
| 1031 | of changes in the visibility classification of hierarchy nodes. If |
---|
| 1032 | nodes tend to be mostly visible, we could automatically decrease the |
---|
| 1033 | frequency of occlusion tests and thus better adapt the method to the |
---|
| 1034 | actual occlusion in the scene. Another possibility for improvement is |
---|
| 1035 | better tuning for a particular graphics hardware by means of more |
---|
| 1036 | accurate rendering cost estimation. Skipping occlusion tests for |
---|
| 1037 | simpler geometry can be faster than issuing comparably expensive |
---|
| 1038 | occlusion queries. |
---|
| 1039 | |
---|
| 1040 | |
---|
| 1041 | |
---|
| 1042 | \begin{figure*}[htb] |
---|
| 1043 | \centerline{ |
---|
| 1044 | \hfill |
---|
| 1045 | \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/teapots} |
---|
| 1046 | \hfill |
---|
| 1047 | \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/city} |
---|
| 1048 | \hfill |
---|
| 1049 | \includegraphics[height=0.2\textwidth,draft=\DRAFTFIGS]{images/pplant} |
---|
| 1050 | \hfill |
---|
| 1051 | } |
---|
| 1052 | \caption{The test scenes: the teapots, the city, and the power plant.} |
---|
| 1053 | \label{fig:scenes} |
---|
| 1054 | \end{figure*} |
---|
| 1055 | |
---|
| 1056 | %% \begin{figure}[htb] |
---|
| 1057 | %% \centerline{ |
---|
| 1058 | %% \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/city} |
---|
| 1059 | %% } |
---|
| 1060 | %% \caption{The the city scene.} |
---|
| 1061 | %% \label{fig:images1} |
---|
| 1062 | %% \end{figure} |
---|
| 1063 | |
---|
| 1064 | %% \begin{figure}[htb] |
---|
| 1065 | %% \centerline{ |
---|
| 1066 | %% \includegraphics[width=0.25\textwidth,draft=\DRAFTFIGS]{images/teapots} |
---|
| 1067 | %% } |
---|
| 1068 | %% \caption{The the teapot scene.} |
---|
| 1069 | %% \label{fig:images2} |
---|
| 1070 | %% \end{figure} |
---|
| 1071 | |
---|
| 1072 | %% \begin{figure}[htb] |
---|
| 1073 | %% \centerline{ |
---|
| 1074 | %% \includegraphics[width=0.2\textwidth,draft=\DRAFTFIGS]{images/pplant} |
---|
| 1075 | %% } |
---|
| 1076 | %% \caption{The the power plant scene.} |
---|
| 1077 | %% \label{fig:images3} |
---|
| 1078 | %% \end{figure} |
---|
| 1079 | |
---|
| 1080 | |
---|