Some parallelization aspects of volume rendering algorithms
-----------------------------------------------------------

J. Kraak
December 8 1994

(preliminay version)

Introduction
------------

Medical imaging techniques such as MRI, PET and CT produce a
pile of 2D images measured in parallel planes forming a volume
of data defined on a grid in 3D space. The data are scalar
values S(i,j,k) of some physical quantity. Grids are mostly
rectilinear, with gridpoints (x1,yi,zi) defined by xi = i*dx,
yi = j*dy, zi = k*dz, with i = 1..Ni, j=1..Nj, K=1..Nk. dx=dy
for most medical application. Typically nz=4*dx, Ni=256, Nj=256
and Nk=64 for current MRI applications, but with the ever
improving technology, values as 1024, 1024 and 256 for respect-
ively Ni, Nj and Nk will soon no longer be an exception. Some
visualization algorithms require cubic grids (dx=dy=dz) which
can be computed by  linear interpolation in e.g. the z-direc-
tion of the rectilinear grid data.
The datavalues S(i,j,k) are either point values at the nodes of
the computational gridcells or voxels representing a constant
value for the gridcell. Computational gridcells differ from
voxels in that they are defined as the volume contained within
the rectilinear box bounded by the eight corner nodes, and,
importantly, that the scalar function varies through the cell.
Often, trilinear interpolation of the form

S(x,y,z)=a1+ a2*x + a3*y + a4*z + a5*x*y + a6*x*z + a7*y*z +
a8*x*y*z

is applied to approximate S within a gridcell.

3D-reconstruction
Radiotherapists and other medical specialists have a great
skill to mentally build a 3D image out of a pile of parallel
slices. Nevertheless, 3D visualization techniques, the topic of
this paper, are important tools to support this 3D-reconstruc-
tion. Before we consider two 'real' 3D visualization methods,
we shortly describe two 2D visualization methods for planar
slices through the 3D data volume. These relatively simple and
efficient techniques serve as an introduction to the 3D
methods.


Slices through a 3D Volume
---------------------------

Slices can be orthogonal to one of the three coordinate axes or
have an arbitrary orientation. Orthogonal slices use often the
same datavalues and gridpoints as the intersected data volume.
In an slice with an arbitrary orientation, one defines a square
grid. Gridvalues are acquired by either taking the data value
of the nearest gridpoint of the original 3D data volume, or
performing a trilinear interpolation at the eight corners of
the gridcell.

Contourlines
Contour lines, which connect points with the same contour
level, are one of the oldest methods to display scalar data at
a regular grid. Heightlines on a topographical map are examples
of contourlines. A 2D gridcell contains one or more contour
lines if some of the values in one of the four corners is above
the contour level and some are below. The intersection of the
contour line with an edge is computed by linear interpolation.
Contourlines inside a gridcell are formed by connecting the
intersections with a straight line (sometimes higher order
interpolations are used to get smooth contour lines, but this
can cause subjective interpretation of the data). Two intersec-
tions define one contourline but one, three or more introduce
problems or ambiguities - see Fig. 1 -  for which the algorithm
must find some solution.

Fig.1  

< plaatje met ambiguity door 4 snijpunten >

A simple implementation of this algorithm performs independent
computations for all gridcells, which makes this algorithm
suitable for parallelization. In the MIMD approach the gridcel-
ls are partitioned in contiguous blocks each of which is
assigned to a processor performing the contour algorithm, with
the results combined in the end. Often, the density of contour
lines is not the same for each blocks, which causes loadbalan-
cing problems. How to minimize these problems is discussed in
the description of the Marching Cubes Algorithm below.

Images
A coloured image of the data is formed by assigning (R,G,B,alp-
ha) values to the gridvalues according to some colour look up
table. R,G and B are the red, green and blue colour components
in the range 0 ..1, while alpha is the opacity value which also
assume values between 0 (transparent) and 1 (opaque). The
opacity is only used for blending two or more images. Assume
that R1, a1 are the red and alpha values of a pixel of image 1
and that R2 and a2 are similar values for image 2. Blending the
two images, with image 1 lying in front of image 2, results in
a red value pixel value a1*R1 + (1-a1)*a2*R2.
The colouring algorithm is suitable for MIMD parallelization.
Because all computations are the same, and not dependant upon
the data values as in the case of contourlines, there are no
load balance problems.
   

Isosurfaces
------------
One standard method of visualizing 3D volume data is to genera-
te and display isosurfaces - surfaces where the data value is
constant. Isosurfaces are useful in situations where data
values can be related to discrete classifications of the
material represented, for example, different tissue types in  
medical MRI data.
The isosurface algorithm is the 3D analogue of the contour
algorithm. For successive gridcells, intersections are computed
of the isosurface with the 12 edges of the 3D gridcell using
linear interpolation. These intersections are connected to form  
polygons which are in general non-planar, but which can be
converted into triangles for rendering. In addition, the
algorithm estimates the gradient of the scalar field at each
intersection, and uses this as the surface normal at that point
of the surface. These normals are used to compute the shading.
Interpolating the normals over the triangles result in a smooth
shaded surface (Gouraud shading).

The computation of the intersections is the most expensive part
of the algorithm as there are 256 different cases to be distin-
guished, i.e. the data value at each of the 8 corners of the
gridcell can be either greater (or equal) than the isovalue, or
below the isovalue. The so called Marching Cubes algorithm
reduces this computation by the observation that the 256 cases
can be reduced to 14 cases by applying symmetry operations
(Lorensen et al. 1987). However, some ambiguities remain, as in
the case of the contour algorithm. Some improvements have been
suggested to the original Marching Cubes algorithm  e.g. by M-
ackerras (1992).  

MIMD parallelization of Marching Cubes Algorithm
Parallelization the Marching Cubes algorithm on a MIMD parallel
machine is essentially quite straightforward: simply partition
the gridcells, or cubes, between the cell processors, execute
the serial Marching Cubes code on each cell processor and
combine the results. In the implementation of Mackerras (Mack-
erras 1992), each X-Y plane of data of dimension Ni*Nj is con-
sidered to consist of (Ni-1)*(Nj-1) faces of cubes. These faces
are divided in rectilinear blocks which are each assigned to
one processor. Each processor processes the same block on each
X-Y plane, thus the blocks are extended in the Z-direction to
form 'pillars' with  Nk-1 layers in the Z-direction.
On the boundaries of the pillars, isosurface intersections and
gradient values are computed more than once. It would be
possible to reduce these redundant computations with message
passing (?), but the communication cost would be likely greater
than the computational cost for any reasonable simple implemen-
tation.
Since cubes which are not intersected by the isosurface can be
processed much faster than those which are, there can be a
considerable variation in process time for each pillar. This is
particularly so if the volume contains a region of interest
near the centre, with very little of the isosurface occurring
near the boundaries of the volume. If each processor only
handles one pillar, the load imbalance can be typically as much
as 50%, where 'load imbalance' is defined as the proportion of
the total execution time which processors spend on the average
waiting for the slowest cell processor.
A solution of this problem is to divide the volume into several
times as many pillars as there are processors and by assigning
the pillars to the processors in an interleaved way. However, a
consequence of a finer subdivision is that the amount of
redundant boundary computations increases. Thus a trade-off
arises between load balance and the amount of computation
performed: subdividing more finely improves the load balance
but increases the total computational load. Measurements are
needed to find the degree of interleaving with a minimal
computation time.
It is the experience of Mackerass that the Marching Cubes
algorithm parallelizes well with the MIMD approach; in fact the
speedup obtained was actually greater than the number of
processors (presumably due to cache effects).


Ray casting
-----------
The ray casting method is the 3D analogue of the image method
for 2D data. An image is generated from all volume data without
an intermediate geometrical model, as is the case with the
isosurface method. Typically this is done by mapping the volume
data to (R,G,B,alpha) values of an imaginary semi-transparent
material, and then rendering an image of the volume filled with
this material. Data values of interest can be assigned high
opacity values and a specific colour to highlight their loca-
tion within the volume while other data can be assigned low
opacity values to reduce their importance (Upson et al 1988?,
Drebin 1988?). It is also possible to render geometric and
volumetric primitives together, allowing the inclusion of
geometric primitives as coordinate axes and other reference
objects (Hin 1994).

The ray-casting algorithm is conceptually straightforward.
Parallel rays are cast through the pixels of the image plane -
one ray for each pixel- into the volume data. As each ray
passes through the volume, data are sampled in points along the
ray, e.g. in evenly spaced intervals. The sampled values can
either be obtained by taking the constant voxel value in the
gridcell through which the ray passes, or by performing trili-
near interpolation. Using a loop up table, this sampled values
are translated in colours and opacities. Each of the three
colour components are integrated along the ray, thereby blend-
ing the successive samples using the opacity values. This
blending process can be adequately by modelled by the same
formula as given above for the blending of two images. How the
sampling and integration of the sampled values is performed,
and whether the rays are cast from back (volume) to front
(image) or from front to back, are the key differences between
the various ray casting techniques reported on in the litera-
ture. The front to back technique has the advantage of early
view ray termination when the accumulated opacity value along
the ray is one (Wilhelms 198??).

Volume ray casting generates high quality images. Therefore it
is an important but also a very compute intensive visualization
technique. To be useful for medical application, it should be
possible to compute at interactive speed to steer the explora-
tion of unknown 3D data volumes. The number of the volume data-
sets that are produced to day , e.g. 256*256*64, and the number
that will soon be possible like 1024*1024*256, present a
enormous challenge to current computer architectures and
techniques. One solution is to build specific hardware (Yoo et
al. 1992). Another solution is to use general purpose scalable
computer architecture, whereby increases in the size and the
number of volume data can be dealt with by scaling the archi-
tecture with the problem size.

SIMD parallelization
The number of parallel ray casting algorithms for massively
parallel computers in literature, either for MIMD parallelizat-
ion (Sakas 1992, Corrie et al. 1992) or for SIMD paralleliza-
tion (Schroeder et al. 1993) is still relatively small.
Here we give a description of a SIMD implementation of a
simplified form of ray casting for a CM5 massively parallel
computer. A more extensive description of this so called object
space algorithm is given in the paper of Schroeder et al.
(1993).

A essential aspect of the design of algorithms for the data
parallel programming model is trying to avoid communication
overhead between (virtual) processors. If communication is not
to be avoided, regular communication must be preferred above
irregular communication. Therefore algorithms should attempt to
exploit all regularity in the access patterns of data, e.g. in
the sampling of data along the rays using trilinear interpola-
tion.  

Before a more comprehensive algorithm is given, we consider the
most regular form of ray casting, when rays are parallel to one
of the coordinate axes, here the z-axis. Additionally, the
following assumptions are made:
- do not blend the colours, but simply add the sampled data
along the ray;
- let the rays start from the centres of the faces of gridcells
for z=0;
- sample one point per cell and assume that the data value is
constant in the gridcell.
Then, the implementation of ray casting of data in the three
dimensional array S(N,N,N) in the direction of the z-axis is
extremely simple and can be expressed by the following state-
ments in FORTRAN 90/CM Fortran:

     IM1=S(:,:,1)
     DO k = 2 , N
       IM1 = IM1 + S(:,:,k)
     END DO

IM1 is a two dimensional array with dimensions (N,N), which can
be visualized as an image as described before.  
The above summation is performed in parallel when all data in
the x-y plane are assigned to a separate virtual processor. The
speed is optimal if all data with the same (x,y) but with
different z-value are lied out in the same virtual processor.
This optimal data layout is realized by the following CM
Fortran compiler directives:

CMF$ LAYOUT IM1(:NEWS, :NEWS)
CMF$ LAYOUT   S(:NEWS, :NEWS, :SERIAL)

Note: The most concise implementation is IM1 = SUM(S, DIM=3),
but this is more compute intensive than the given implementa-
tion.


The part 'del.eq.0' of the program in Appendix A shows the
implemention in CM Fortran for a CM5 massively parallel com-
puter.  

Note that the addition of voxel values along array dimensions
is one of the options of the module X-ray of the visualization
system AVS. Another option is finding the maximum voxel value
along the ray. X-ray is useful for the quick exploration of a
3D data volume.

If the rays traverse the volume in an arbitrary direction, then
all rays must traverse the voxels in the same regular pattern,
with no overlap of voxels to avoid irregular communication.
This can be realized using the idiom of line drawing on dis-
crete grids using the so called digital difference algorithm
(Foley et al. 1990). A careful choice of the ray alignment in
the image plane leads to a sequence of ray steps through the
volume such that the volume is perfectly tiled by the rays.
Each ray is a translated version of a prototype ray with
exactly the same stepping pattern through the volume. This idea
is essential to all parallel ray casting algorithms.

In order to simplify the discussion, we consider rays rotated
around one coordinate axis, here the x-axis. As before, only
addition of voxel values is considered. However all arguments
used readily generalize to an arbitrary ray direction and to
blending and sampling with trilinear interpolation. Fig.2 shows
a 2D intersection with the volume in a plane perpendicular to
the x-axis. All rays start from the image plane, but not from
the pixels as in the original algorithm: the mutual distances
of the rays are chosen so that they intersect the outermost
column of voxels at the centres (which causes a distortion in
the final image). An example for two non-adjacent rays is given
with the shaded voxels. All rays move in lock step along the
mayor axis, i.e. the axis with the greatest step size per voxel
traversal.  

Fig. 2 ....

If we only consider rays entering through the left face and
leaving through the right face, then the addition of voxel
values along the rays from adjacent voxel planes in the Z-
directions can be accomplished by:
- a shift in the Y-direction of rows of the voxel plane, this
shift requires only regular communication between the pro-
cessors;
- adding (accumulating) the voxelvalues to a conform 2D matrix.  
The speed of the accumulation can be improved if the conform 2D
matrix is shifted, instead of the successive voxel planes. This
requires only a shift with a constant stepsize _+1 or 0 (no
shift), instead of a variable shift which is more time consum-
ing.

Fig. 3 ...

What about rays entering from above or leaving the volume
before the last Z-plane? The answer is based on the observation
that rays entering on the top (see Fig. 3) correspond in a one-
to-one fashion to rays leaving on the bottom. Conceptually, we
tile space with a copy of the volume. However the actual code
does not create any copy; instead the same effect is achieved
by simply taking advantage of the toroidal connectivity of the
processor interconnect. With this concept in mind, we see that
there are only rays that start on the front-most face and
continue until they exit at the far face. In the 2D example any
ray has at most two parts - an initial segment - and then, if
and when it leaves - a segment corresponding to a new ray
entry. This can be implemented by the following steps:
- circular shifts in the Y-direction with fixed stepsize of the
two 2D arrays IM1 and IM2 (conform to the voxel plane) used to
accumulate the voxel values;
- partition of a voxelplane in two parts according to the two
parts of rays, one part must be added to the similar part of
IM1 and the other part to the similar part ofIM2.
The final images must be scaled in the Y-direction with a
factor 1/ sqrt(a+ arctan(angle)*arctan(angle)), where angle is
the angle of the rays with the Z-axis, to correct for the
distortion caused by the way the initial ray starting points
were chosen. If Y is the major axis of projection instead of Z,
the voxel planes must be transposed in order to use the same
program code.  

Appenidix A shows the source code of a implementation of this
raycasting algorithm in CM Fortran for a CM5 massively parallel
computer.
   
Rays entering the volume in an arbitrary direction can enter or
leave the data volume from three (imaginary) neighbouring
copies of the volume, so four accumulation images are needed.
The following steps are required (apart from a possible trans-
position of the initial volume and a scaling of the final
image):
- perform stepwise circular shifts in the x-and y-direction of
the four accumulation arrays that are conform to a voxel plane;
- divide the voxel plane four parts, each of which has to be
added to a similar part of the four corresponding accumulation
2D arrays.  

As a final remark: in the SIMD data-parallel programming model
it is difficult to incorporate such optimizations as early ray
termination, without requiring a large communication  
overhead to perform the desirable load balancing (Schroeder et
al. 1993).

References (nog onvolledig)

Corrie, B., Mackerras, P., Parallel Volume Rendering and Data
Coherence on the Fujitsu AP1000, 1992, rapport ...

Drenin, R.A., Carpenter L., Hanrahan, P, Volume Rendering, ACM
1988 ...

Foley, J.D., Van Dam, A., Feiner, S.K., Hughes, J.F., Computer
Graphics -principles and practice (second edition), Addison-
Wesley Publishing Company, 1990.

Hin, A.J.S, Visualization of Turbulent Flow, dissertation,
Delft, 1994.

Lorensen, W.E., Cile, H.E.,Marching Cubes: a high resolution 3D
surface construction algorithm, SIGGRAPH '87, ....

Mackerras, P., A Fast Parallel Marching-Cubes Implementation on
the Fujitsu AP1000, 1992, rapport ....

Sakas, G., Interactive volume rendering of large fields, The
Visual Computer, 1993, 9:425-438.

Schroeder, P., Krueger, W., Data parallel volumerendering
algorithms for interactive visualization, The Visual Computer,
1993, 9(405-416)

Upson, C., Keeler, M., V-buffer: Visible Volume Rendering,
SIGGAPH '88, ..

Wilhelms, J., Van Gelder, A., A Coherent Projection Approach
for Direct Volume Rendering, Computer Graphics, 1991, vol. 25,
4(275-283)

Yoo, T., Neumann, U., Fuchs, H., Pizer, S., Cullip, T., Rhoade-
s, J., Whitaker, R., Direct Visualization of Volume Data, IEEE
Computer Graphics and Applications 12, 4(63-71), 1992.

Appendix A


      PROGRAM raycast  

c This program demonstrates a simple ray-casting algorithm:
c add voxels along rays rotated around the X-axis.
c Programmed in CM Fortran for a CM5 massively parallel
c computer in the MIMD data parallel programming mode.

c Use next commands to compile at the cms.rc.rug.nl  
c and to execute the program at the cm5.rc.rug.nl:
c     cmf <raycast>.fcm  
c     rsh cm5 time `pwd`/a.out

c This program only prints the execution times, it
c does NOT transpose the data matrix, scale the final image or
c visualize the final image. Use e.g. the AVS Interface routine
c AVSI_2D_scalar_real_uniform to visualize the final image.

      INTEGER, PARAMETER :: n = 256
      REAL, array(n,n,n) :: S
CMF$  LAYOUT s(:NEWS,:NEWS,:SERIAL)
c accumulation arrays:
      REAL ,array (n,n) :: im1, im2
CMF$  LAYOUT im1(:NEWS,:NEWS)
CMF$  LAYOUT im2(:NEWS,:NEWS)
      INTEGER, array(n) :: ishft
CMF$  LAYOUT ishft(:SERIAL)

c generate test data:
      FORALL (i = 1:n, j = 1:n, k = 1:n) s(i,j,k)=j

100   continue

c rotate rays around x-axis with angle arctan(del)with z-axis ,
c -1 <= del <= 1
c transpose volume for other directions ....

      print *
      print *,'Ray casting with rays rotated around X-as,'
      print *,'rotation angle = arctan(del).'
      print *,'Enter del'
      print *,'*** abs(del)> 1: STOP'
      read *, del
      if(abs(del).gt.1.0) then
         print *,'abs(del) > 1, STOP'
         STOP
      end if

c initialize timer:
      call CM_timer_clear(1)
      call CM_timer_start(1)

      if (del.eq.0) then
c >>>>>>>>>>>> del = 0
        im1=s(:,:,1)
        DO k = 2 , n
          im1 = im1 + s(:,:,k)
        END DO
c <<<<<<<<<<<< del = 0
      else
c >>>>>>>>>>>> del ne 0        
c compute absolute shifts in y-direction, similar to
c the digital differential algorithm :
      idel=sign(1.0,del)
      ys =  0
      DO k = 2 , n
        ys = ys + del
c  idel*0.5 causes rounding:
        ishft(k ) = ys +idel*0.5
      ENDDO

c initialize accumulation arrays im1 and im2:
      if (del.gt.0.0) then
        im1=s(:,:,1)
        im2=0
      else
        im1=0
        im2=s(:,:,1)
      end if

c initialize current shift:
      ish= 0

c DO loop over voxel planes in z-direction:  
      DO k = 2, N
        if (del.gt.0) then
c stepwize circular right shift?
           if (ishft(k).ne.ish) then
              im1= CSHIFT(im1,DIM=2,SHIFT=1)
              im2= CSHIFT(im2,DIM=2,SHIFT=1)
              ish=ishft(k)
           end if
           im1(:,1:N-ishft(k))= im1(:,1:N-ishft(k))+
     +       s(:,1:N-ishft(k),k)
           im2(:,N-ishft(k)+1:N)=im2(:,N-ishft(k)+1:N)+
     +       s(:,N-ishft(k)+1:N,k)
        else
c stepwize circular right shift?
           if(ishft(k).ne.ish) then
              im1= CSHIFT(im1,DIM=2,SHIFT=-1)
              im2= CSHIFT(im2,DIM=2,SHIFT=-1)
              ish=ishft(k)
           end if
           im1(:,1:-ishft(k))= im1(:,1:-ishft(k))+
     +       s(:,1:-ishft(k),k)
           im2(:,-ishft(k)+1:N)=im2(:,-ishft(k)+1:N)+
     +       s(:,-ishft(k)+1:N,k)
        end if
      ENDDO

c shift images to original position :
      im1= CSHIFT(im1,DIM=2,SHIFT=idel*(n-abs(ish)))
      im2= CSHIFT(im2,DIM=2,SHIFT=idel*(n-abs(ish)))

c <<<<<<<<<<<< del ne 0        
      end if

c add here :
c - combination of im1 and im2;
c - scaling;
c - visualization.

c print timing information:
        call CM_timer_stop( 1)
        call CM_timer_print(1)
c repeat ..
                                     go to 100
          
      END


Concluding remarks:

1) experiment with programs e.g.  
- what is the effect of various values of del in program
raycast2?
- change LAYOUT
- replace DO loop in raycast1 by SUM(s,DIM=3) + effect of
layout

2) kop data of Kamman as test data ....


MIMD versie van simple raycasting ?????
.