DevMaster.net Forums
3D Engines Database | Wiki | Articles/Tutorials | Game Jobs | IRC Chat Network | Contact Us

Go Back   DevMaster.net Forums > Site Discussions > Code & Snapshot Discussion
User Name
Password
Register FAQ Members List Search Today's Posts Mark Forums Read

Reply
 
Thread Tools Search this Thread Display Modes
Old 09-11-2004, 11:08 AM   #1
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 837
Default

It's quite surprising that the most elementary rendering operations, like rasterization, are so little documented. Graphics cards take care of most of this, so people have forgotten how it actually works. But there are many reasons left why understanding the inner workings of the graphics pipeline is still useful. Not only for software rendering, but for any situation where this type of operations is required. Knowledge of these algorithm simply make you a better graphics programmer.

Today I present you the theory and implementation of an advanced rasterizer. It is advanced in the sense that it has many nice properties the classical scanline conversion algorithm does not have. The main problem with the old algorithm is that it's hard to process pixels in parallel. It identifies filled scanlines, but this is only suited for processing one pixel at a time. A much more efficient approach is to process 2x2 pixels together. These are called quads. By sharing some setup cost per quad, and using advanced parallel instructions, this results in a significant speedup. Some of the current graphics hardware also uses quad pixel pipelines.

Our starting point is the half-space function approach. Before I go into details, all you have to know is that a half-space function is positive on one side of a line, and negative on the other. So it splits the screen in half, hence the name. Here's a small illustration of it, where the edges of a triangle each split the screen in a positive and negative part: half-space.png. Locations where all three half-space functions are positive define the inside of the triangle. When a half-edge function is zero, we're on an edge. When any of them is negative, we are outside the triangle. So this can be used for rasterization. If we can find formulas for the half-space functions, and evaluate them for each pixel location, we know which pixels need to be filled.

So what formula is positive on one side of a line and negative on the other? Surprisingly, the equation of a line is exactly what we're looking for. For a segment with starting coordinates (x1, y1) and end coordinates (x2, y2), the equation is:

(x2 - x1) * (y - y1) - (y2 - y1) * (x - x1) = 0

You can easily verify that this equation is true for any point (x, y) on this line (for example the start and end points themselves). But the left hand side also behaves perfectly as a half-space fuction. It is positive for (x, y) coordinates on one side, and negative on the other. It's zero on the line itself, which effectively defines the line in the above equation. Verifying this behaviour is left as an excercise for the reader. Let's write some code!
Code:
void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3) { float y1 = v1.y; float y2 = v2.y; float y3 = v3.y; float x1 = v1.x; float x2 = v2.x; float x3 = v3.x; // Bounding rectangle int minx = (int)min(x1, x2, x3); int maxx = (int)max(x1, x2, x3); int miny = (int)min(y1, y2, y3); int maxy = (int)max(y1, y2, y3); (char*&)colorBuffer += miny * stride; // Scan through bounding rectangle for(int y = miny; y < maxy; y++) { for(int x = minx; x < maxx; x++) { // When all half-space functions positive, pixel is in triangle if((x1 - x2) * (y - y1) - (y1 - y2) * (x - x1) > 0 && << (x2 - x3) * (y - y2) - (y2 - y3) * (x - x2) > 0 && << (x3 - x1) * (y - y3) - (y3 - y1) * (x - x3) > 0) { colorBuffer[x] = 0x00FFFFFF;<< // White } } (char*&)colorBuffer += stride; } }
This is the simplest working implementation of the half-space functions algorithm. Before I continue to explain how it can be improved, there are a few things to note. It would be possible to scan the whole screen, to see which pixels are inside the triangle, but this is of course a waste of time. Therefore only the smallest rectangle surrounding the triangle is scanned. You might also have noticed that I swapped some components in the formula. This is because in screen coordinates, the y-axis points downward. It is also very important that the triangle's vertices are in counter-clockwise order. This makes sure that all the positive sides of the half-spaces point inside the triangle.

Unfortunately, this implementation isn't fast at all. For every pixel, six multiplications and no less than fifteen subtractions are required. But don't worry, this can be optimized greatly. Per horizontal line that we scan, only the x component in the formula changes. For the first edge, this means only (y1 - y2) gets added to the half-edge function value of the previous pixel. That's just an addition and subtraction per pixel! Furthermore, the vertex coordinates are 'constant' per triangle, so (y1 - y2) and all other pairs like this only have to be computed once per pixel. Also for stepping in the y direction it's just an addition of a constant. Let's implement this:
Code:
void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3) { float y1 = v1.y; float y2 = v2.y; float y3 = v3.y; float x1 = v1.x; float x2 = v2.x; float x3 = v3.x; // Deltas float Dx12 = x1 - x2; float Dx23 = x2 - x3; float Dx31 = x3 - x1; float Dy12 = y1 - y2; float Dy23 = y2 - y3; float Dy31 = y3 - y1; // Bounding rectangle int minx = (int)min(x1, x2, x3); int maxx = (int)max(x1, x2, x3); int miny = (int)min(y1, y2, y3); int maxy = (int)max(y1, y2, y3); (char*&)colorBuffer += miny * stride; // Constant part of half-edge functions float C1 = Dy12 * x1 - Dx12 * y1; float C2 = Dy23 * x2 - Dx23 * y2; float C3 = Dy31 * x3 - Dx31 * y3; float Cy1 = C1 + Dx12 * miny - Dy12 * minx; float Cy2 = C2 + Dx23 * miny - Dy23 * minx; float Cy3 = C3 + Dx31 * miny - Dy31 * minx; // Scan through bounding rectangle for(int y = miny; y < maxy; y++) { // Start value for horizontal scan float Cx1 = Cy1; float Cx2 = Cy2; float Cx3 = Cy3; for(int x = minx; x < maxx; x++) { if(Cx1 > 0 && Cx2 > 0 && Cx3 > 0) { colorBuffer[x] = 0x00FFFFFF;<< // White } Cx1 -= Dy12; Cx2 -= Dy23; Cx3 -= Dy31; } Cy1 += Dx12; Cy2 += Dx23; Cy3 += Dx31; (char*&)colorBuffer += stride; } }
The C1-C3 variables are the constant part of the half-edge equation and are not recomputed per pixel. The Cy1-Cy3 variables determine the 'starting value' of the half-space functions at the top of the bounding rectangle. And finally the Dx1-Dx3 variables are the start values per horizontal scan. Only these are incremented (actually decremented) with the delta values every pixel. So testing whether a pixel is inside the triangle has become really cheap: aside from some setup per horizontal line and per triangle that is negligible, just three subtractions and three compares for zero.

Ok, so this algorithm is starting to show its usefulness. But there still are some serious issues to solve. Here's an actual image created using this code: first_try.png. If it isn't clear that this is actually a car, here's the shaded reference image: reference.png. As you can see, there are many gaps between the polygons. The cause is twofold: precision, and the fill convention.

There are precision issues because floating-point numbers are quite limited. They have 24-bit of precision. If you look at the half-space function, you see that we do several subtractions and multiplications. That's the perfect recipe for precision problems. What most people would do in such situation is use double-precision floating-point numbers. While this will make the issue unnoticable, it isn't exactly perfect and also slower. The real solution might sound crazy: use integers! Integers have 32-bit of precision, and do not suffer from precision loss when subtracting. Instead of using a 'floating-point', we'll use fixed-point arithmetic to assure we get sub-pixel accuracy.

The second cause of the gaps is the fill convention. The half-space functions can be positive or negative, but also zero. In this case, the pixel is exactly on the edge. Until now, we've ignored this case and treated it as a pixel outside. We could treat it as inside, using >= comparators, but this isn't correct either. What will happen is that pixels on the edge between two triangles will be drawn twice. While this may sound more acceptable than gaps, it causes some serious artifacts for things like transparency and stenciling.

What we'll use here is a top-left fill convention, just like DirectX and OpenGL. This means that all edges which are on the left side of the triangle, or on a horizontal top, are treated as inside the triangle. This convention assures that no gaps will occur, nor drawing the same pixel twice. Let's first see how we can detect whether an edge is 'top-left'. For humans, it's really easy to recognise them. Here are a few examples: top-left.png. I'm sure nobody really had a problem to identify the top-left edges himself, it's really simple. But give yourself a minute to think about how you would detect this using code... Don't look before you tried it, but here's the answer: detect.png. The arrows indicate the counter-clockwise order of the vertices. Note how for left edges they all point downward! To detect this in code we merely have to check the Dy1-Dy3 values! The only exception is the top edge. There, Dy# is zero, but Dx# is positive.

Now that we know how to detect top-left edges, we still have to deal with them correctly so pixels on these edges are treated as inside. What we actually want to do is change the Cx# > 0 statements into Cx# >= 0 statements. Testing for top-left edges per pixel is way to slow, and handling all these cases in separate loops is way too complex. But look at what Cx# > 0 fundamentally is. It is 'true' when Cx# is 1, 2, 3, etc, and 'false' when Cx# is 0, -1, -2, etc. All we want to do is make it true also when Cx# is zero. The solution is too simple for words: all we have to do is add 1 to Cx# beforehand. This can even be done sooner, with the C1-C3 variables! Ok, let's sum this all up:
Code:
void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3) { // 28.4 fixed-point coordinates const int Y1 = iround(16.0f * v1.y); const int Y2 = iround(16.0f * v2.y); const int Y3 = iround(16.0f * v3.y); const int X1 = iround(16.0f * v1.x); const int X2 = iround(16.0f * v2.x); const int X3 = iround(16.0f * v3.x); // Deltas const int DX12 = X1 - X2; const int DX23 = X2 - X3; const int DX31 = X3 - X1; const int DY12 = Y1 - Y2; const int DY23 = Y2 - Y3; const int DY31 = Y3 - Y1; // Fixed-point deltas const int FDX12 = DX12 << 4; const int FDX23 = DX23 << 4; const int FDX31 = DX31 << 4; const int FDY12 = DY12 << 4; const int FDY23 = DY23 << 4; const int FDY31 = DY31 << 4; // Bounding rectangle int minx = (min(X1, X2, X3) + 0xF) >> 4; int maxx = (max(X1, X2, X3) + 0xF) >> 4; int miny = (min(Y1, Y2, Y3) + 0xF) >> 4; int maxy = (max(Y1, Y2, Y3) + 0xF) >> 4; (char*&)colorBuffer += miny * stride; // Half-edge constants int C1 = DY12 * X1 - DX12 * Y1; int C2 = DY23 * X2 - DX23 * Y2; int C3 = DY31 * X3 - DX31 * Y3; // Correct for fill convention if(DY12 < 0 || (DY12 == 0 && DX12 > 0)) C1++; if(DY23 < 0 || (DY23 == 0 && DX23 > 0)) C2++; if(DY31 < 0 || (DY31 == 0 && DX31 > 0)) C3++; int CY1 = C1 + DX12 * (miny << 4) - DY12 * (minx << 4); int CY2 = C2 + DX23 * (miny << 4) - DY23 * (minx << 4); int CY3 = C3 + DX31 * (miny << 4) - DY31 * (minx << 4); for(int y = miny; y < maxy; y++) { int CX1 = CY1; int CX2 = CY2; int CX3 = CY3; for(int x = minx; x < maxx; x++) { if(CX1 > 0 && CX2 > 0 && CX3 > 0) { colorBuffer[x] = 0x00FFFFFF; } CX1 -= FDY12; CX2 -= FDY23; CX3 -= FDY31; } CY1 += FDX12; CY2 += FDX23; CY3 += FDX31; (char*&)colorBuffer += stride; } }
Here's the end result: correct.png. Not only is it now entirely flawless, it's also faster than the floating-point version! The range of the fixed-point integers is big enough for a 2048x2048 color buffer, with 4 bits of sub-pixel precision. Now we got this fixed, let's focus on performance again...

This implementation is perfect for small triangles. The setup cost per triangle is really low compared to the scanline conversion algorithm. Unfortunately, it's not optimal for big triangles. About half of all pixels will be outside the triangle, but we still pay the price of evaluating the half-space functions. Furthermore, until now I've only talked about drawing one pixel at a time, while the real benefits of this approach is the possibility for parallelism. I'll show you how to take advantage of this, and what other benefits we get from it.

What we really want to do is quickly skip pixels outside the triangle. We don't want to waste any time evaluating half-space functions there. We also don't want to spend too much time inside the triangle. The really interesting part is around the edges. So what we'll do is quickly detect whether 8x8 blocks are not covered, partially covered, or fully covered. Not covered or fully covered blocks can quicly be respectively rejected or accepted. Partially covered blocks will be completely scanned. There's another reason to do this block-based approach: it is very useful for visibility determination algorithms. And an extra benefit is that memory accesses are more localized.

So how do we detect coverage as fast as possible? With the half-space functions, it's really easy. All we have to do is evaluate the half-space functions in the corners of the blocks. A non-covered block has negative half-space values for the corners for at least one edge. A completely covered block has positive half-space values for all edges. Everything else is a partially covered block. So the final implementation is:
Code:
void Rasterizer::triangle(const Vertex &v1, const Vertex &v2, const Vertex &v3) { // 28.4 fixed-point coordinates const int Y1 = iround(16.0f * v1.y); const int Y2 = iround(16.0f * v2.y); const int Y3 = iround(16.0f * v3.y); const int X1 = iround(16.0f * v1.x); const int X2 = iround(16.0f * v2.x); const int X3 = iround(16.0f * v3.x); // Deltas const int DX12 = X1 - X2; const int DX23 = X2 - X3; const int DX31 = X3 - X1; const int DY12 = Y1 - Y2; const int DY23 = Y2 - Y3; const int DY31 = Y3 - Y1; // Fixed-point deltas const int FDX12 = DX12 << 4; const int FDX23 = DX23 << 4; const int FDX31 = DX31 << 4; const int FDY12 = DY12 << 4; const int FDY23 = DY23 << 4; const int FDY31 = DY31 << 4; // Bounding rectangle int minx = (min(X1, X2, X3) + 0xF) >> 4; int maxx = (max(X1, X2, X3) + 0xF) >> 4; int miny = (min(Y1, Y2, Y3) + 0xF) >> 4; int maxy = (max(Y1, Y2, Y3) + 0xF) >> 4; // Block size, standard 8x8 (must be power of two) const int q = 8; // Start in corner of 8x8 block minx &= ~(q - 1); miny &= ~(q - 1); (char*&)colorBuffer += miny * stride; // Half-edge constants int C1 = DY12 * X1 - DX12 * Y1; int C2 = DY23 * X2 - DX23 * Y2; int C3 = DY31 * X3 - DX31 * Y3; // Correct for fill convention if(DY12 < 0 || (DY12 == 0 && DX12 > 0)) C1++; if(DY23 < 0 || (DY23 == 0 && DX23 > 0)) C2++; if(DY31 < 0 || (DY31 == 0 && DX31 > 0)) C3++; // Loop through blocks for(int y = miny; y < maxy; y += q) { for(int x = minx; x < maxx; x += q) { // Corners of block int x0 = x << 4; int x1 = (x + q - 1) << 4; int y0 = y << 4; int y1 = (y + q - 1) << 4; // Evaluate half-space functions bool a00 = C1 + DX12 * y0 - DY12 * x0 > 0; bool a10 = C1 + DX12 * y0 - DY12 * x1 > 0; bool a01 = C1 + DX12 * y1 - DY12 * x0 > 0; bool a11 = C1 + DX12 * y1 - DY12 * x1 > 0; int a = (a00 << 0) | (a10 << 1) | (a01 << 2) | (a11 << 3); bool b00 = C2 + DX23 * y0 - DY23 * x0 > 0; bool b10 = C2 + DX23 * y0 - DY23 * x1 > 0; bool b01 = C2 + DX23 * y1 - DY23 * x0 > 0; bool b11 = C2 + DX23 * y1 - DY23 * x1 > 0; int b = (b00 << 0) | (b10 << 1) | (b01 << 2) | (b11 << 3); bool c00 = C3 + DX31 * y0 - DY31 * x0 > 0; bool c10 = C3 + DX31 * y0 - DY31 * x1 > 0; bool c01 = C3 + DX31 * y1 - DY31 * x0 > 0; bool c11 = C3 + DX31 * y1 - DY31 * x1 > 0; int c = (c00 << 0) | (c10 << 1) | (c01 << 2) | (c11 << 3); // Skip block when outside an edge if(a == 0x0 || b == 0x0 || c == 0x0) continue; unsigned int *buffer = colorBuffer; // Accept whole block when totally covered if(a == 0xF && b == 0xF && c == 0xF) { for(int iy = 0; iy < q; iy++) { for(int ix = x; ix < x + q; ix++) { buffer[ix] = 0x00007F00;<< // Green } (char*&)buffer += stride; } } else<< // Partially covered block { int CY1 = C1 + DX12 * y0 - DY12 * x0; int CY2 = C2 + DX23 * y0 - DY23 * x0; int CY3 = C3 + DX31 * y0 - DY31 * x0; for(int iy = y; iy < y + q; iy++) { int CX1 = CY1; int CX2 = CY2; int CX3 = CY3; for(int ix = x; ix < x + q; ix++) { if(CX1 > 0 && CX2 > 0 && CX3 > 0) { buffer[ix] = 0x0000007F;<< // Blue } CX1 -= FDY12; CX2 -= FDY23; CX3 -= FDY31; } CY1 += FDX12; CY2 += FDX23; CY3 += FDX31; (char*&)buffer += stride; } } } (char*&)colorBuffer += q * stride; } }

Note that I scan partially covered blocks completely, all 8x8 pixels. This is for consistency with the other blocks so they can be processed by the same visibility algorithm. Also, this is extremely fast when done in an unrolled loop, using assembly instructions. All further optimizations to this algorithm are best done in assembly anyway. So now the rasterizer can output coverage masks for 8x8 pixels. This can then easily be processed by the pixel pipeline(s). It's easy to process them as 4x4 quads, and many calculations can even be done per block. Everything taken together, there is no reason left to use the old scanline conversion algorithm.

Enjoy!

Nicolas "Nick" Capens
Nick is offline   Reply With Quote
Old 09-11-2004, 12:21 PM   #2
Mihail121
Senior Member
 
Mihail121's Avatar
 
Join Date: Jan 2003
Posts: 738
Default

agreed with the most but:

As i know the Pentium and better CPUs work with the 80bit float numbers actually faster than with the 32 and 64bits. Of course i know few languages which make use of the extended format.

There are other few stuff, which i'm not yet sure about.. but the article is really good! Like everything in the _nick_ style
Mihail121 is offline   Reply With Quote
Old 09-11-2004, 12:45 PM   #3
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 837
Default

Quote:
Originally Posted by Mihail121
As i know the Pentium and better CPUs work with the 80bit float numbers actually faster than with the 32 and 64bits. Of course i know few languages which make use of the extended format.
I'm really happy with the performance of my current version, and I don't think 80-bit floating-point can beat it. Anyway, I'm going to optimize all of it with MMX, which will already probably make it twice as fast...
Nick is offline   Reply With Quote
Old 09-11-2004, 12:52 PM   #4
Noor
Senior Member
 
Join Date: Jan 2003
Location: ON, Canada
Posts: 524
Default

this can be turned in an article
___________________________________________
"What ever happened to happily ever after?"
Noor is offline   Reply With Quote
Old 09-11-2004, 12:54 PM   #5
Mihail121
Senior Member
 
Mihail121's Avatar
 
Join Date: Jan 2003
Posts: 738
Default

Bah... too bad there ain't MMX/3DNow! on those damn nokias....
Mihail121 is offline   Reply With Quote
Old 09-11-2004, 01:42 PM   #6
z80
Member
 
Join Date: Sep 2004
Location: Most likely close to a bed or a keyboard
Posts: 72
Default

Good work, Nick! Very interesting!
z80 is offline   Reply With Quote
Old 09-11-2004, 02:16 PM   #7
john
Member
 
Join Date: Jan 2003
Posts: 85
Default

Impressive work Nick!
Could you provide some figures that reveals the performance (and compare them with traditional rasterizers)?
How do graphic chips currently do this? I wonder if graphic chip vendors would benifit if your algorithm was implemented in hardware.
Have you read the book by Andre LaMoth on:
Tricks of the 3D Game Programming Gurus-Advanced 3D Graphics and Rasterization. What is your recommendation for using it as a learning reference for software rendering?
john is offline   Reply With Quote
Old 09-12-2004, 12:50 AM   #8
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 837
Default

Quote:
Originally Posted by john
Impressive work Nick!
I hope I haven't upped the standard too much.
Quote:
Could you provide some figures that reveals the performance (and compare them with traditional rasterizers)?
For this car model, which has only small polygons, performance is equivalent when using the 8x8 block C++ version. For rendering a skybox (big polygons), performance is a bit lower (it beats the traditional rasterizer when using 16x16 blocks though). But let's put these mediocre results in the right perspective. This algorithm enables parallel pixel processing. For example, stencil shadows. When an 8x8 block is completely inside the polygon, we can process this block with 64-bit MMX instructions in a flash. So although rasterization is sometimes a bit slower, the classical rasterizer cannot achieve this fillrate when processing one pixel at a time. Also, the C++ version presented here has sub-optimal code for detecting block coverage. I am very sure the MMX version will be able to do it many times faster (it has a pmaddwd instruction which does -four- multiplications in parallel and adds pairs; merely two of these instructions are needed to test one edge). Either way, even if it needs some tweaking, the advantages far outweigh the disadvantages.

If you want numbers, I suggest using swShader, and plugging this code into it.
Quote:
How do graphic chips currently do this? I wonder if graphic chip vendors would benifit if your algorithm was implemented in hardware.
A lot of hardware has quad processing pipelines. It's quite probable that they already use this algorithm or a variant more optimal for hardware implementation.
Quote:
Have you read the book by Andre LaMoth on:
Tricks of the 3D Game Programming Gurus-Advanced 3D Graphics and Rasterization. What is your recommendation for using it as a learning reference for software rendering?
I bought it. It was a waste of my money. It didn't teach me anything new and on every page I was able to either spot an error or I know a fundamentally better approach. And it's a big book but that's only because the whole code is listed in it.

Anyway, it isn't all bad. I'm sure that for someone who has no experience with software rendering it's quite good. He writes in a 'popular' way so it's easy to read and understand. It nicely bundles all the primary problems someone has to solve to write a software renderer. Since there are no other books like this, it is automatically the best you can get. I would have scratched the 'Advanced' out of the subtitle though...
Nick is offline   Reply With Quote
Old 09-12-2004, 03:10 AM   #9
Mihail121
Senior Member
 
Mihail121's Avatar
 
Join Date: Jan 2003
Posts: 738
Default

Agreed with Nick on LaMothe's book 75% of it is code, wrong story about lightmaps, no correct organization no nothing!
Mihail121 is offline   Reply With Quote
Old 09-12-2004, 06:02 AM   #10
anubis
DevMaster Staff
 
anubis's Avatar
 
Join Date: Apr 2003
Location: Germany
Posts: 2,231
Default

Quote:
I bought it. It was a waste of my money. It didn't teach me anything new and on every page I was able to either spot an error or I know a fundamentally better approach. And it's a big book but that's only because the whole code is listed in it.

i was suprised how many errors (i mean real mathematical errors not typos) it contains. this has to be confusing for somebody whithout any experience.
___________________________________________
Heisenberg was probably here !
anubis is offline   Reply With Quote
Old 09-12-2004, 12:00 PM   #11
davepermen
Senior Member
 
davepermen's Avatar
 
Join Date: Jan 2003
Location: Switzerland
Posts: 1,328
Default

very nice done, nick. i think this shows again, premature optimisation isn't everything, optimizing the algorithm is what counts.

now rastericing works on a completely different level somehow, it's not about the individual pixel anymore, but really just the triangle.

it reminds me much of raytracing, this algorithm. the logic is the same, just with tons of restrictions on what rays you trace but it's the same, with the same optimisation possiblities. process full blocks of rays in parallel, skip blocks that are useless, etc..

it all sounds familiar.. can't wait to see another performance boost in sw-shader
___________________________________________
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....
davepermen is offline   Reply With Quote
Old 09-12-2004, 12:57 PM   #12
davepermen
Senior Member
 
davepermen's Avatar
 
Join Date: Jan 2003
Location: Switzerland
Posts: 1,328
Default

hehe.. you know what? your rastericer would rock quite a bit on this system:



i mean.. yeah! 16 parallel processors (actually 4 processors with 2 cores and 2 hyperthreads each). there, your parallelisation will get onto a next, higher level..

yehah!
___________________________________________
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....
davepermen is offline   Reply With Quote
Old 09-12-2004, 03:51 PM   #13
SnprBoB86
Valued Member
 
Join Date: Aug 2004
Posts: 120
Default

Very interesting article. Confirmed how I thought this worked, and then extended on it with optimizations that are very nice toknow. Thanks.

@davepermen:
Where did you get that screen? I would be interested to know who has the cash for that many processors, but not for a monitor large enough to show all the CPU usage graphs
___________________________________________
Brandon Bloom
http://brandonbloom.name
SnprBoB86 is offline   Reply With Quote
Old 09-12-2004, 09:48 PM   #14
davepermen
Senior Member
 
davepermen's Avatar
 
Join Date: Jan 2003
Location: Switzerland
Posts: 1,328
Default

Intel Developer Forum report at anandtech (just right-click and check the file properties to verify).

they aren't that expensive actually. both amd and intel start to show off their new multicore cpu's, for home and servers. this one is actually an ITANIUM dual core at 1.7ghz each, i think. each core has 2 hyperthreads, just as the newer p4 have, and the board is a 4x itanium board.. hence 4x2x2 hyperthreads, 4x2 cores, 4 cpus.

right now, it's just a "we have it" proof demo. but it's planned for next year to start shipping. ITANIUMs, Xeons, and Pentiums, all with dual cores.

AMD on it's side will show off Opterons with multicore (2x, 4x), and athlon 64, too.. should be available next year, too.. (they showed off a first sample, too).

so yes, it's some years till every home end user has one. but it's right today important to start thinking about designing with multi-threading, to support hyperthreading by today, multicore by tomorrow, and multi-cpu for the highest-end-systems in the future.

hehe
___________________________________________
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....
davepermen is offline   Reply With Quote
Old 09-13-2004, 11:32 AM   #15
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 837
Default

I can't wait for a 4 GHz dual-core 64-bit (double amount of registers), Hyper-Threading CPU with 1200 MHz dual-channel FSB and build-in memory controller.
Nick is offline   Reply With Quote
Old 09-13-2004, 11:56 AM   #16
Mihail121
Senior Member
 
Mihail121's Avatar
 
Join Date: Jan 2003
Posts: 738
Default

And when i think that only 12 years has passed since the graphical 'wonders' of Wolf3D...
Mihail121 is offline   Reply With Quote
Old 09-13-2004, 01:50 PM   #17
davepermen
Senior Member
 
davepermen's Avatar
 
Join Date: Jan 2003
Location: Switzerland
Posts: 1,328
Default

Quote:
Originally Posted by Nick
I can't wait for a 4 GHz dual-core 64-bit (double amount of registers), Hyper-Threading CPU with 1200 MHz dual-channel FSB and build-in memory controller.
[snapback]11373[/snapback]

hehe..

will you one day consider writing some raytracing-stuff? i mean, your renderer gets more and more into the right direction anyways

you'll have to at least write a protocoll how each and every 'tile' gets submitted to the right cpu.


btw, if you're quick, you'll be the first one with an api that wrapps well onto ps3!
___________________________________________
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....
davepermen is offline   Reply With Quote
Old 09-13-2004, 02:09 PM   #18
john
Member
 
Join Date: Jan 2003
Posts: 85
Default

not only does Nick's swShader beat quake's software rendering and other , but it beats the crap out of DirectX's REF rastersizer. There's no point in comparing against quake's s/w anyways. Good job Nick! You're going to be a millionare someday. You're pretty much garenteed a job at any company you apply for. What's the point of life after that? hehe (j/k) As far as I know, there isn't a single commercial or free product which emulates shaders except for swShader (isn't that right?).

I guess there wasn't any incentive for Microsoft to optimize their software renderer since they expect hardware vendors to implement everything in hardware, and the software part is only used as a reference model to compare against.
john is offline   Reply With Quote
Old 09-13-2004, 02:29 PM   #19
anubis
DevMaster Staff
 
anubis's Avatar
 
Join Date: Apr 2003
Location: Germany
Posts: 2,231
Default

Quote:
You're pretty much garenteed a job at any company you apply for. What's the point of life after that?

american ?
___________________________________________
Heisenberg was probably here !
anubis is offline   Reply With Quote
Old 09-13-2004, 04:37 PM   #20
Dia Kharrat
DevMaster Staff
 
Join Date: Jan 2003
Posts: 1,185
Default

For those who haven't noticed, the code spotlight submissions are categorized now
Dia Kharrat is offline   Reply With Quote
Old 09-13-2004, 04:56 PM   #21
anubis
DevMaster Staff
 
anubis's Avatar
 
Join Date: Apr 2003
Location: Germany
Posts: 2,231
Default

nice... maybe the categories should be seperated more clearly ?
___________________________________________
Heisenberg was probably here !
anubis is offline   Reply With Quote
Old 09-14-2004, 01:20 AM   #22
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 837
Default

Quote:
Originally Posted by john
not only does Nick's swShader beat quake's software rendering and other , but it beats the crap out of DirectX's REF rastersizer. There's no point in comparing against quake's s/w anyways. Good job Nick!
Thanks for the compliment, but I would nuance this a little. It doesn't beat the Quake I renderer at all. In it's context (a Pentium 100 MHz), the Quake I renderer is brilliant. Before that, there was nothing that came close. It did everything right: perspective correction, fill convention, prestepping, mipmapping, lightmapping. And all that at 4-6 clock cycles per pixel (inner loop)! Although my renderer has ten times more features, I have GHz processors with MMX and SSE and I'm still stuck at 20 FPS. Well, in its own context that ain't bad at all... But anyway, it's Quake I's renderer that really inspired me, and at some points it still does.
Quote:
You're going to be a millionare someday. You're pretty much garenteed a job at any company you apply for. What's the point of life after that? hehe (j/k)
Money is not everything. Although I could certainly use some more to keep paying for my education, there's more to life than this. In the end it's all about living a happy life. And making the people around you happy is very important to feel good yourself. So I would never do a job just for the money, I would do it if I feel good about it and my career doesn't make me neglect other people. Combining a good career and happiness is hard, but I'm trying...
Quote:
As far as I know, there isn't a single commercial or free product which emulates shaders except for swShader (isn't that right?).
Mesa 3D, the OpenGL 'reference rasterizer' supports them too. I haven't tested it, but I've heard it's not too bad when only using a few simple shaders. But swShader is probably the only product where shaders can be categorized as real-time and efficient. Last week I was able to speed up my Per-Pixel Lighting demo from 34 to 42 FPS on my Pentium M 1.4 GHz laptop. Now I'm aiming at 50 FPS. Although I'm very excited about every improvement, I do realize that the actual usefulness of it all is very low. You can't really run a semi-modern game with swShader. Even the cheapest integrated graphics chip from three years ago beats me at fillrate. It's an uphill battle I'll never win.
Quote:
I guess there wasn't any incentive for Microsoft to optimize their software renderer since they expect hardware vendors to implement everything in hardware, and the software part is only used as a reference model to compare against.
They did have some attempts at software rendering, like the RGB Emulation and Ramp Emulation. I don't know if they ever evolved further, but I think that they quickly realized that the days of 2D graphics cards are totally over. A computer with a graphics card that can't draw a triangle with a single texture is really hard to find. That's almost all RGB and Ramp could do.
Nick is offline   Reply With Quote
Old 09-14-2004, 06:39 AM   #23
Mihail121
Senior Member
 
Mihail121's Avatar
 
Join Date: Jan 2003
Posts: 738
Default

Hey John, check out Muli3D, it also supports shaders in easy and understanable way!
Mihail121 is offline   Reply With Quote
Old 09-14-2004, 09:27 AM   #24
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 837
Default

Quote:
Originally Posted by Mihail121
Hey John, check out Muli3D, it also supports shaders in easy and understanable way!
It's nice to see Muli3D has evolved! Just a couple month ago it had sub-pixel precision issues and now it shows the most advanced features.

I wouldn't really say it has shader support though. You can derive a class from a PixelShader class, and implement the execute() function. This allows to simply write the 'shader' in C++. In that sense every software renderer would have shader support even before one line of it is implemented.

Anyway, it seems like it could become an open-source 'reference rasterizer'...
Nick is offline   Reply With Quote
Old 09-16-2004, 07:28 AM   #25
Nick
Senior Member
 
Join Date: Aug 2004
Location: Ghent, Belgium
Posts: 837
Default

The article is now also available on the swShader site:

swShader Documentation

If there's any part that can be improved, like things that are unclear and need more explanation, please let me know!
Nick is offline   Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Forum Jump


All times are GMT -7. The time now is 09:31 PM.


Powered by vBulletin
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.