Some parallelization aspects of volume rendering algorithms ----------------------------------------------------------- J. Kraak December 8 1994 (preliminay version) Introduction ------------ Medical imaging techniques such as MRI, PET and CT produce a pile of 2D images measured in parallel planes forming a volume of data defined on a grid in 3D space. The data are scalar values S(i,j,k) of some physical quantity. Grids are mostly rectilinear, with gridpoints (x1,yi,zi) defined by xi = i*dx, yi = j*dy, zi = k*dz, with i = 1..Ni, j=1..Nj, K=1..Nk. dx=dy for most medical application. Typically nz=4*dx, Ni=256, Nj=256 and Nk=64 for current MRI applications, but with the ever improving technology, values as 1024, 1024 and 256 for respect- ively Ni, Nj and Nk will soon no longer be an exception. Some visualization algorithms require cubic grids (dx=dy=dz) which can be computed by linear interpolation in e.g. the z-direc- tion of the rectilinear grid data. The datavalues S(i,j,k) are either point values at the nodes of the computational gridcells or voxels representing a constant value for the gridcell. Computational gridcells differ from voxels in that they are defined as the volume contained within the rectilinear box bounded by the eight corner nodes, and, importantly, that the scalar function varies through the cell. Often, trilinear interpolation of the form S(x,y,z)=a1+ a2*x + a3*y + a4*z + a5*x*y + a6*x*z + a7*y*z + a8*x*y*z is applied to approximate S within a gridcell. 3D-reconstruction Radiotherapists and other medical specialists have a great skill to mentally build a 3D image out of a pile of parallel slices. Nevertheless, 3D visualization techniques, the topic of this paper, are important tools to support this 3D-reconstruc- tion. Before we consider two 'real' 3D visualization methods, we shortly describe two 2D visualization methods for planar slices through the 3D data volume. These relatively simple and efficient techniques serve as an introduction to the 3D methods. Slices through a 3D Volume --------------------------- Slices can be orthogonal to one of the three coordinate axes or have an arbitrary orientation. Orthogonal slices use often the same datavalues and gridpoints as the intersected data volume. In an slice with an arbitrary orientation, one defines a square grid. Gridvalues are acquired by either taking the data value of the nearest gridpoint of the original 3D data volume, or performing a trilinear interpolation at the eight corners of the gridcell. Contourlines Contour lines, which connect points with the same contour level, are one of the oldest methods to display scalar data at a regular grid. Heightlines on a topographical map are examples of contourlines. A 2D gridcell contains one or more contour lines if some of the values in one of the four corners is above the contour level and some are below. The intersection of the contour line with an edge is computed by linear interpolation. Contourlines inside a gridcell are formed by connecting the intersections with a straight line (sometimes higher order interpolations are used to get smooth contour lines, but this can cause subjective interpretation of the data). Two intersec- tions define one contourline but one, three or more introduce problems or ambiguities - see Fig. 1 - for which the algorithm must find some solution. Fig.1 < plaatje met ambiguity door 4 snijpunten > A simple implementation of this algorithm performs independent computations for all gridcells, which makes this algorithm suitable for parallelization. In the MIMD approach the gridcel- ls are partitioned in contiguous blocks each of which is assigned to a processor performing the contour algorithm, with the results combined in the end. Often, the density of contour lines is not the same for each blocks, which causes loadbalan- cing problems. How to minimize these problems is discussed in the description of the Marching Cubes Algorithm below. Images A coloured image of the data is formed by assigning (R,G,B,alp- ha) values to the gridvalues according to some colour look up table. R,G and B are the red, green and blue colour components in the range 0 ..1, while alpha is the opacity value which also assume values between 0 (transparent) and 1 (opaque). The opacity is only used for blending two or more images. Assume that R1, a1 are the red and alpha values of a pixel of image 1 and that R2 and a2 are similar values for image 2. Blending the two images, with image 1 lying in front of image 2, results in a red value pixel value a1*R1 + (1-a1)*a2*R2. The colouring algorithm is suitable for MIMD parallelization. Because all computations are the same, and not dependant upon the data values as in the case of contourlines, there are no load balance problems. Isosurfaces ------------ One standard method of visualizing 3D volume data is to genera- te and display isosurfaces - surfaces where the data value is constant. Isosurfaces are useful in situations where data values can be related to discrete classifications of the material represented, for example, different tissue types in medical MRI data. The isosurface algorithm is the 3D analogue of the contour algorithm. For successive gridcells, intersections are computed of the isosurface with the 12 edges of the 3D gridcell using linear interpolation. These intersections are connected to form polygons which are in general non-planar, but which can be converted into triangles for rendering. In addition, the algorithm estimates the gradient of the scalar field at each intersection, and uses this as the surface normal at that point of the surface. These normals are used to compute the shading. Interpolating the normals over the triangles result in a smooth shaded surface (Gouraud shading). The computation of the intersections is the most expensive part of the algorithm as there are 256 different cases to be distin- guished, i.e. the data value at each of the 8 corners of the gridcell can be either greater (or equal) than the isovalue, or below the isovalue. The so called Marching Cubes algorithm reduces this computation by the observation that the 256 cases can be reduced to 14 cases by applying symmetry operations (Lorensen et al. 1987). However, some ambiguities remain, as in the case of the contour algorithm. Some improvements have been suggested to the original Marching Cubes algorithm e.g. by M- ackerras (1992). MIMD parallelization of Marching Cubes Algorithm Parallelization the Marching Cubes algorithm on a MIMD parallel machine is essentially quite straightforward: simply partition the gridcells, or cubes, between the cell processors, execute the serial Marching Cubes code on each cell processor and combine the results. In the implementation of Mackerras (Mack- erras 1992), each X-Y plane of data of dimension Ni*Nj is con- sidered to consist of (Ni-1)*(Nj-1) faces of cubes. These faces are divided in rectilinear blocks which are each assigned to one processor. Each processor processes the same block on each X-Y plane, thus the blocks are extended in the Z-direction to form 'pillars' with Nk-1 layers in the Z-direction. On the boundaries of the pillars, isosurface intersections and gradient values are computed more than once. It would be possible to reduce these redundant computations with message passing (?), but the communication cost would be likely greater than the computational cost for any reasonable simple implemen- tation. Since cubes which are not intersected by the isosurface can be processed much faster than those which are, there can be a considerable variation in process time for each pillar. This is particularly so if the volume contains a region of interest near the centre, with very little of the isosurface occurring near the boundaries of the volume. If each processor only handles one pillar, the load imbalance can be typically as much as 50%, where 'load imbalance' is defined as the proportion of the total execution time which processors spend on the average waiting for the slowest cell processor. A solution of this problem is to divide the volume into several times as many pillars as there are processors and by assigning the pillars to the processors in an interleaved way. However, a consequence of a finer subdivision is that the amount of redundant boundary computations increases. Thus a trade-off arises between load balance and the amount of computation performed: subdividing more finely improves the load balance but increases the total computational load. Measurements are needed to find the degree of interleaving with a minimal computation time. It is the experience of Mackerass that the Marching Cubes algorithm parallelizes well with the MIMD approach; in fact the speedup obtained was actually greater than the number of processors (presumably due to cache effects). Ray casting ----------- The ray casting method is the 3D analogue of the image method for 2D data. An image is generated from all volume data without an intermediate geometrical model, as is the case with the isosurface method. Typically this is done by mapping the volume data to (R,G,B,alpha) values of an imaginary semi-transparent material, and then rendering an image of the volume filled with this material. Data values of interest can be assigned high opacity values and a specific colour to highlight their loca- tion within the volume while other data can be assigned low opacity values to reduce their importance (Upson et al 1988?, Drebin 1988?). It is also possible to render geometric and volumetric primitives together, allowing the inclusion of geometric primitives as coordinate axes and other reference objects (Hin 1994). The ray-casting algorithm is conceptually straightforward. Parallel rays are cast through the pixels of the image plane - one ray for each pixel- into the volume data. As each ray passes through the volume, data are sampled in points along the ray, e.g. in evenly spaced intervals. The sampled values can either be obtained by taking the constant voxel value in the gridcell through which the ray passes, or by performing trili- near interpolation. Using a loop up table, this sampled values are translated in colours and opacities. Each of the three colour components are integrated along the ray, thereby blend- ing the successive samples using the opacity values. This blending process can be adequately by modelled by the same formula as given above for the blending of two images. How the sampling and integration of the sampled values is performed, and whether the rays are cast from back (volume) to front (image) or from front to back, are the key differences between the various ray casting techniques reported on in the litera- ture. The front to back technique has the advantage of early view ray termination when the accumulated opacity value along the ray is one (Wilhelms 198??). Volume ray casting generates high quality images. Therefore it is an important but also a very compute intensive visualization technique. To be useful for medical application, it should be possible to compute at interactive speed to steer the explora- tion of unknown 3D data volumes. The number of the volume data- sets that are produced to day , e.g. 256*256*64, and the number that will soon be possible like 1024*1024*256, present a enormous challenge to current computer architectures and techniques. One solution is to build specific hardware (Yoo et al. 1992). Another solution is to use general purpose scalable computer architecture, whereby increases in the size and the number of volume data can be dealt with by scaling the archi- tecture with the problem size. SIMD parallelization The number of parallel ray casting algorithms for massively parallel computers in literature, either for MIMD parallelizat- ion (Sakas 1992, Corrie et al. 1992) or for SIMD paralleliza- tion (Schroeder et al. 1993) is still relatively small. Here we give a description of a SIMD implementation of a simplified form of ray casting for a CM5 massively parallel computer. A more extensive description of this so called object space algorithm is given in the paper of Schroeder et al. (1993). A essential aspect of the design of algorithms for the data parallel programming model is trying to avoid communication overhead between (virtual) processors. If communication is not to be avoided, regular communication must be preferred above irregular communication. Therefore algorithms should attempt to exploit all regularity in the access patterns of data, e.g. in the sampling of data along the rays using trilinear interpola- tion. Before a more comprehensive algorithm is given, we consider the most regular form of ray casting, when rays are parallel to one of the coordinate axes, here the z-axis. Additionally, the following assumptions are made: - do not blend the colours, but simply add the sampled data along the ray; - let the rays start from the centres of the faces of gridcells for z=0; - sample one point per cell and assume that the data value is constant in the gridcell. Then, the implementation of ray casting of data in the three dimensional array S(N,N,N) in the direction of the z-axis is extremely simple and can be expressed by the following state- ments in FORTRAN 90/CM Fortran: IM1=S(:,:,1) DO k = 2 , N IM1 = IM1 + S(:,:,k) END DO IM1 is a two dimensional array with dimensions (N,N), which can be visualized as an image as described before. The above summation is performed in parallel when all data in the x-y plane are assigned to a separate virtual processor. The speed is optimal if all data with the same (x,y) but with different z-value are lied out in the same virtual processor. This optimal data layout is realized by the following CM Fortran compiler directives: CMF$ LAYOUT IM1(:NEWS, :NEWS) CMF$ LAYOUT S(:NEWS, :NEWS, :SERIAL) Note: The most concise implementation is IM1 = SUM(S, DIM=3), but this is more compute intensive than the given implementa- tion. The part 'del.eq.0' of the program in Appendix A shows the implemention in CM Fortran for a CM5 massively parallel com- puter. Note that the addition of voxel values along array dimensions is one of the options of the module X-ray of the visualization system AVS. Another option is finding the maximum voxel value along the ray. X-ray is useful for the quick exploration of a 3D data volume. If the rays traverse the volume in an arbitrary direction, then all rays must traverse the voxels in the same regular pattern, with no overlap of voxels to avoid irregular communication. This can be realized using the idiom of line drawing on dis- crete grids using the so called digital difference algorithm (Foley et al. 1990). A careful choice of the ray alignment in the image plane leads to a sequence of ray steps through the volume such that the volume is perfectly tiled by the rays. Each ray is a translated version of a prototype ray with exactly the same stepping pattern through the volume. This idea is essential to all parallel ray casting algorithms. In order to simplify the discussion, we consider rays rotated around one coordinate axis, here the x-axis. As before, only addition of voxel values is considered. However all arguments used readily generalize to an arbitrary ray direction and to blending and sampling with trilinear interpolation. Fig.2 shows a 2D intersection with the volume in a plane perpendicular to the x-axis. All rays start from the image plane, but not from the pixels as in the original algorithm: the mutual distances of the rays are chosen so that they intersect the outermost column of voxels at the centres (which causes a distortion in the final image). An example for two non-adjacent rays is given with the shaded voxels. All rays move in lock step along the mayor axis, i.e. the axis with the greatest step size per voxel traversal. Fig. 2 .... If we only consider rays entering through the left face and leaving through the right face, then the addition of voxel values along the rays from adjacent voxel planes in the Z- directions can be accomplished by: - a shift in the Y-direction of rows of the voxel plane, this shift requires only regular communication between the pro- cessors; - adding (accumulating) the voxelvalues to a conform 2D matrix. The speed of the accumulation can be improved if the conform 2D matrix is shifted, instead of the successive voxel planes. This requires only a shift with a constant stepsize _+1 or 0 (no shift), instead of a variable shift which is more time consum- ing. Fig. 3 ... What about rays entering from above or leaving the volume before the last Z-plane? The answer is based on the observation that rays entering on the top (see Fig. 3) correspond in a one- to-one fashion to rays leaving on the bottom. Conceptually, we tile space with a copy of the volume. However the actual code does not create any copy; instead the same effect is achieved by simply taking advantage of the toroidal connectivity of the processor interconnect. With this concept in mind, we see that there are only rays that start on the front-most face and continue until they exit at the far face. In the 2D example any ray has at most two parts - an initial segment - and then, if and when it leaves - a segment corresponding to a new ray entry. This can be implemented by the following steps: - circular shifts in the Y-direction with fixed stepsize of the two 2D arrays IM1 and IM2 (conform to the voxel plane) used to accumulate the voxel values; - partition of a voxelplane in two parts according to the two parts of rays, one part must be added to the similar part of IM1 and the other part to the similar part ofIM2. The final images must be scaled in the Y-direction with a factor 1/ sqrt(a+ arctan(angle)*arctan(angle)), where angle is the angle of the rays with the Z-axis, to correct for the distortion caused by the way the initial ray starting points were chosen. If Y is the major axis of projection instead of Z, the voxel planes must be transposed in order to use the same program code. Appenidix A shows the source code of a implementation of this raycasting algorithm in CM Fortran for a CM5 massively parallel computer. Rays entering the volume in an arbitrary direction can enter or leave the data volume from three (imaginary) neighbouring copies of the volume, so four accumulation images are needed. The following steps are required (apart from a possible trans- position of the initial volume and a scaling of the final image): - perform stepwise circular shifts in the x-and y-direction of the four accumulation arrays that are conform to a voxel plane; - divide the voxel plane four parts, each of which has to be added to a similar part of the four corresponding accumulation 2D arrays. As a final remark: in the SIMD data-parallel programming model it is difficult to incorporate such optimizations as early ray termination, without requiring a large communication overhead to perform the desirable load balancing (Schroeder et al. 1993). References (nog onvolledig) Corrie, B., Mackerras, P., Parallel Volume Rendering and Data Coherence on the Fujitsu AP1000, 1992, rapport ... Drenin, R.A., Carpenter L., Hanrahan, P, Volume Rendering, ACM 1988 ... Foley, J.D., Van Dam, A., Feiner, S.K., Hughes, J.F., Computer Graphics -principles and practice (second edition), Addison- Wesley Publishing Company, 1990. Hin, A.J.S, Visualization of Turbulent Flow, dissertation, Delft, 1994. Lorensen, W.E., Cile, H.E.,Marching Cubes: a high resolution 3D surface construction algorithm, SIGGRAPH '87, .... Mackerras, P., A Fast Parallel Marching-Cubes Implementation on the Fujitsu AP1000, 1992, rapport .... Sakas, G., Interactive volume rendering of large fields, The Visual Computer, 1993, 9:425-438. Schroeder, P., Krueger, W., Data parallel volumerendering algorithms for interactive visualization, The Visual Computer, 1993, 9(405-416) Upson, C., Keeler, M., V-buffer: Visible Volume Rendering, SIGGAPH '88, .. Wilhelms, J., Van Gelder, A., A Coherent Projection Approach for Direct Volume Rendering, Computer Graphics, 1991, vol. 25, 4(275-283) Yoo, T., Neumann, U., Fuchs, H., Pizer, S., Cullip, T., Rhoade- s, J., Whitaker, R., Direct Visualization of Volume Data, IEEE Computer Graphics and Applications 12, 4(63-71), 1992. Appendix A PROGRAM raycast c This program demonstrates a simple ray-casting algorithm: c add voxels along rays rotated around the X-axis. c Programmed in CM Fortran for a CM5 massively parallel c computer in the MIMD data parallel programming mode. c Use next commands to compile at the cms.rc.rug.nl c and to execute the program at the cm5.rc.rug.nl: c cmf .fcm c rsh cm5 time `pwd`/a.out c This program only prints the execution times, it c does NOT transpose the data matrix, scale the final image or c visualize the final image. Use e.g. the AVS Interface routine c AVSI_2D_scalar_real_uniform to visualize the final image. INTEGER, PARAMETER :: n = 256 REAL, array(n,n,n) :: S CMF$ LAYOUT s(:NEWS,:NEWS,:SERIAL) c accumulation arrays: REAL ,array (n,n) :: im1, im2 CMF$ LAYOUT im1(:NEWS,:NEWS) CMF$ LAYOUT im2(:NEWS,:NEWS) INTEGER, array(n) :: ishft CMF$ LAYOUT ishft(:SERIAL) c generate test data: FORALL (i = 1:n, j = 1:n, k = 1:n) s(i,j,k)=j 100 continue c rotate rays around x-axis with angle arctan(del)with z-axis , c -1 <= del <= 1 c transpose volume for other directions .... print * print *,'Ray casting with rays rotated around X-as,' print *,'rotation angle = arctan(del).' print *,'Enter del' print *,'*** abs(del)> 1: STOP' read *, del if(abs(del).gt.1.0) then print *,'abs(del) > 1, STOP' STOP end if c initialize timer: call CM_timer_clear(1) call CM_timer_start(1) if (del.eq.0) then c >>>>>>>>>>>> del = 0 im1=s(:,:,1) DO k = 2 , n im1 = im1 + s(:,:,k) END DO c <<<<<<<<<<<< del = 0 else c >>>>>>>>>>>> del ne 0 c compute absolute shifts in y-direction, similar to c the digital differential algorithm : idel=sign(1.0,del) ys = 0 DO k = 2 , n ys = ys + del c idel*0.5 causes rounding: ishft(k ) = ys +idel*0.5 ENDDO c initialize accumulation arrays im1 and im2: if (del.gt.0.0) then im1=s(:,:,1) im2=0 else im1=0 im2=s(:,:,1) end if c initialize current shift: ish= 0 c DO loop over voxel planes in z-direction: DO k = 2, N if (del.gt.0) then c stepwize circular right shift? if (ishft(k).ne.ish) then im1= CSHIFT(im1,DIM=2,SHIFT=1) im2= CSHIFT(im2,DIM=2,SHIFT=1) ish=ishft(k) end if im1(:,1:N-ishft(k))= im1(:,1:N-ishft(k))+ + s(:,1:N-ishft(k),k) im2(:,N-ishft(k)+1:N)=im2(:,N-ishft(k)+1:N)+ + s(:,N-ishft(k)+1:N,k) else c stepwize circular right shift? if(ishft(k).ne.ish) then im1= CSHIFT(im1,DIM=2,SHIFT=-1) im2= CSHIFT(im2,DIM=2,SHIFT=-1) ish=ishft(k) end if im1(:,1:-ishft(k))= im1(:,1:-ishft(k))+ + s(:,1:-ishft(k),k) im2(:,-ishft(k)+1:N)=im2(:,-ishft(k)+1:N)+ + s(:,-ishft(k)+1:N,k) end if ENDDO c shift images to original position : im1= CSHIFT(im1,DIM=2,SHIFT=idel*(n-abs(ish))) im2= CSHIFT(im2,DIM=2,SHIFT=idel*(n-abs(ish))) c <<<<<<<<<<<< del ne 0 end if c add here : c - combination of im1 and im2; c - scaling; c - visualization. c print timing information: call CM_timer_stop( 1) call CM_timer_print(1) c repeat .. go to 100 END Concluding remarks: 1) experiment with programs e.g. - what is the effect of various values of del in program raycast2? - change LAYOUT - replace DO loop in raycast1 by SUM(s,DIM=3) + effect of layout 2) kop data of Kamman as test data .... MIMD versie van simple raycasting ????? .