1. Can I create/destroy a vertex in a vertex shader?
No, you cannot. You can modify the components of a vertex (position, color, texcoords, etc.), but you cannot destroy or create vertices. However, note that you can play tricks like moving vertices off screen, or modifying the per-vertex alpha value, to effectively remove vertices.
2. Can I access other vertices from a vertex shader?
No, for the most part, you cannot. Vertex shaders work on one vertex at a time, and only one vertex. Vertices cannot communicate with each other in any way and cannot write to constant memory. So, effectively, you can think of it as having access to the vertex buffer, but not the index buffer in DX8. Note that since you have up to 16 vertex attributes you can send to the vertex shader, you could send the positions of other vertices along with every vertex.
3. How can I implement branching?
There are no branching instructions in the vertex shader instruction set, but it’s possible to approximate branching by computing both outcomes of a if-then-else and selecting between by multiplying by the output of a sge. For example:
; compute r0 = (r1 >= r2) ? r3 : r4
SGE r0, r1, r2 ; one if (r1 >= r2) holds, zero otherwise
ADD r1, r3, -r4
MAD r0, r0, r1, r4 ; r0 = r0*(r3-r4) + r4 = r0*r3 + (1-r0)*r4
See our whitepaper “Where Is That Instruction?”, which covers this vertex shader technique and more.
4. How do I do matrix palette skinning with 200 bones with only 96 constants?
The constant memory has enough space for approximately 20-30 bones, depending on how much of it you have available for bone data. If your models have more bones than this, simply split up the mesh into parts with fewer bone influences and render each separately. The D3DX function ConvertToIndexedBlendedMesh can do this for you. If your bone matrices are affine, you can just store 4x3 matrices in constant memory, thereby saving 25% of the space. Also, if the upper-left 3x3 portion of your matrices is orthonormal, you don’t need to store the inverse-transpose of each matrix for transforming your normals, thus doubling the number of matrices that can fit in the constant memory.
5. How come w-buffering isn’t working for me?
When using vertex shaders with w-buffering, make sure you set your projection matrix in the traditional way (using SetTransform()), otherwise w-buffering won’t work correctly.
6. Can I combine vertex shaders with the fixed function pipeline?
Yes, but each DrawPrimitive() call can use either a vertex shader or the fixed function pipeline. For example, you can’t have the vertex shader do the lighting while the fixed function pipe does the transform. If you choose to use vertex shaders, the vertex shader must complete all stages of transform/lighting, including transformation to homogeneous clip space, output of color from lighting calculation, generation of texture coordinates, and setting of the fog factor. Between calls to DrawPrimitive() you can switch to and from the fixed function pipeline using SetVertexShader(). Note that the output of the fixed function pipeline and the vertex shader pipeline are not guaranteed to be bit-for-bit identical given identical inputs, so if this is necessary (for z-values when doing multi-pass, for example) you should stick to one pipeline or the other.
7. Does GeForce3 support fixed-function matrix palette skinning?
No, it does not. We don’t recommend the use of this feature as the software path in DX8 is sub-optimal in its use of vertex buffers and performs more poorly than software emulation of vertex shaders. Vertex shaders are the preferred route for DirectX8 matrix palette skinning.
8. Do I have to write a vertex shader for every combination of light/skinning/texgen that I use?
Yes, probably. Note that you don’t have to write multiple vertex shaders. For example, you could write one shader to handle four lights per vertex and zero out the light contribution from unused lights. The instructions for the extra lights are still executed, though they have no effect. Wasting instructions this way can impact performance in geometry-limited situations. To help alleviate this problem, we’ve developed the NVLink tool which will automatically combine any number of vertex shader fragments for you at runtime or as a pre-process. This tool is available on our developer website.
9. How many pixel shader operations does GeForce3 support?
GeForce3 supports four texture addressing operations followed by up to 8 color/alpha blending operations. Note that you can coissue color and alpha instructions using a ‘+’, for a total of 16 instructions max.
10. Do I have to add in my own fog in the pixel shader? What about specular?
Fog is automatically added after the pixel shader and is handled by the familiar renderstates. Specular is NOT added for you after the pixel shader. You must add the specular value within the pixel shader and output only r0 as the color result.
11. Can I do a blending operation followed by a texture addressing operation?
No, the texture addressing operations must occur before any color blending operations, due to the nature of the pipeline:
12. What’s the difference between texture addressing operations and color blending operations?
Texture addressing operations (tex, texcoord, texm3x2tex, etc.) control how a color is looked up from a texture. They allow modification of the texture coordinates and are done in 32-bit floating point. The color blending operations (like dp3, mul, mad, lrp, etc., called register combiners in OpenGL) work by blending together colors looked up from textures using the texture addressing operations - along with iterated colors (v0 and v1) and constant colors (c0-c7). They are in signed 9-bit format (1-bit sign, 8-bits for each R, G, B, and A).
13. Does the GeForce3 support DX6-style EMBM?
Yes. However, DX6-style EMBM is not a general robust solution for bump-mapping and only works well for planar models. It can be useful in certain cases and can be used for effects other than bump-mapping. For example, you can simulate a heat haze effect by applying EMBM to a scene which you have previously rendered to a texture. For general reflective bump-mapping, use the true reflective bump-mapping feature in GeForce3 (the texm3x3vspec pixel shader addressing instruction).
14. Are there any limitations with EMBM on GeForce3?
GeForce3 doesn’t do the divide for projection after doing the ds/dt offset and rotation for EMBM. You can, however, project before the ds/dt offset. In other words, the “bump map” can be projective, but the environment map cannot.
15. Does GeForce2 support pixel shaders?
GeForce 256 and GeForce2 don’t have hardware support for pixel shaders, only GeForce3 does. With GeForce2 and earlier video cards, you need to use the traditional SetTextureStageState() interface in Direct3D. Note that GeForce 256 and GeForce2 can take advantage of the new trinary operations (like D3DTOP_MULTIPLYADD) and additional temporary register exposed in the DirectX8 fixed function pipeline.
Full scene anti-aliasing
16. How is multi-sampling different than super-sampling?
Multi-sampling as supported by GeForce3 has a number of benefits over traditional super-sampling: it generates the extra samples right before writing to the frame-buffer, thus eliminating the fillrate cost; it only fetches texels once per fragment, as opposed to once per sample, thus somewhat reducing memory bandwidth cost; and it supports more optimal placement of samples, resulting in better quality anti-aliasing.
17. What is the mapping between DX8 multi-sample modes and the GeForce3 hardware?
D3DMULTISAMPLE_2_SAMPLES == 2 sample AA D3DMULTISAMPLE_3_SAMPLES == Quincunx AA D3DMULTISAMPLE_4_SAMPLES == 4 sample AA
18. What is Quincunx anti-aliasing?
Quincunx AA is a special mode in GeForce3 where two samples are generated per fragment, but 5 samples are averaged together when filtering down, thus increasing the number of effective samples per pixel and giving higher quality than 2 sample AA with only slightly greater cost. The term “Quincunx” comes from the 5-pattern on the side of a die.
Higher Order Surfaces
19. What types of higher order surfaces are supported by GeForce3?
B-Splines, Bezier patches, and Catmull-Rom splines are all supported in hardware.
20. Are these tessellated by the driver or by the hardware?
GeForce3 has hardware support for curved surface tessellation. It is not done in the driver.
21. Is it possible to modify control points using vertex shaders?
No. Surfaces are tessellated before getting to the vertex shader, so control points are not available at that point in the pipeline. However, you can use vertex shaders to operate on the vertices generated by the hardware tessellator.
22. Does GeForce3 support 32-bit color/16-bit Z?
How about 32-bit Z-buffer (no stencil) or 16-bit color/24-bit Z?
GeForce3 does support mixed-mode rendering (32-bit color/16-bit Z, 16-bit color/24-bit Z + 8-bit stencil) but does not support 32-bit depth values. See render-to-texture under the performance section for a performance pitfall with mixed-mode rendering.
23. How many textures does GeForce3 support?
GeForce3 supports four simultaneous textures with four independent texture coordinates, corresponding to the four available texture addressing operations.
23. How does performance scale with the number of instructions?
Vertex Shader performance scales linearly with the number of instructions. The general rule of thumb is to assume one cycle per instruction.
24. What are the costs associated with individual instructions?
All instructions cost the same on GeForce3, one cycle. Negations, swizzling, and write masking are completely free and should be used as much as possible to reduce the total number of instructions.
25. How does vertex shader performance compare with the fixed-function pipeline?
The vertex shader pipeline runs at comparable speed to the fixed-function pipeline. The fixed-function pipeline is slightly faster in some cases, but only for apps that are geometry limited or do a ton of TnL mode switches. Switching to the fixed pipe is free, but uploading vertex programs has a small cost. Note that with vertex shaders, there is the opportunity to “cut corners”. For example, you can light in model space instead of eye space, or skip certain vector normalizations, thus saving instructions. If you can make optimizations such as these, vertex shaders can potentially be much faster than the fixed function pipeline.
26. Do vertex shaders run in hardware on GeForce2?
No, GeForce2 doesn’t support vertex shaders in hardware. The DirectX8 runtime and our OpenGL driver have fast optimized software paths for vertex shaders, so performance in software may be acceptable in many cases.
27. Do additional textures cost performance?
Yes. Two textures (1D, 2D, or cubemap) run at full fillrate on GeForce3, three or four textures run at half fillrate.
28. What’s the performance of the color blending operations (dp3, mul, etc.)?
Number of InstructionsPercent of Max Fillrate (minimum) 1-2100% 3-450% 5-633% 7-825%
Number of Instructions
Percent of Max Fillrate (minimum)
Note that this is the minimum performance for a given number of instructions, performance may well be better.
29. Wow! Additional instructions are costly! Are they usable?
Yes. Note that these only influence fill-rate, which is not often the bottleneck in today’s games. The bottleneck is usually memory bandwidth (or CPU), so dropping to 25% fill is not going to drop your frame rate to 25%. Also, consider that multi-sampling saves fillrate, so with 2-sample AA you effectively double the numbers above. Using many instructions or textures may also allow you to reduce the number of passes, which is always a performance win, regardless of the fillrate cost. Finally, note that if you’re using 4 textures and 8 instructions, the performance numbers are not cumulative. Since texture fetching and color blending are arranged in a pipeline, your fillrate will only be as slow as the slowest of the two.
Frame Buffer Optimizations
30. Does it matter what order I draw my polygons on screen?
Yes, GeForce3 includes z-buffer optimizations whose efficiency can be improved by sorting from front to back. This should only be a rough sort at the object level, not a per-polygon exact sort.
31. Should I clear the frame buffer even if I don’t have to?
Yes and no. You should always call Clear() on the z-buffer and stencil, even if you don’t have to, as it improves the effectiveness of the GeForce3 z-buffer optimizations and is extremely fast. However, you should always avoid clearing the color buffer if it’s unnecessary as this wastes memory bandwidth. See the next question about clearing stencil.
32. Are there any performance issues with stencil I should be aware of?
Yes (why else would this be a question? ;-). If you’ve requested a backbuffer format with stencil (D24S8) and you’re not using the stencil, make sure you clear it anyway when you clear the z-buffer. Otherwise our driver must assume there’s something in stencil that you need and the clear will not be as fast. If you request a format without stencil (any 16-bit format, or D24X8) this is not necessary.
33. Are triangle strips faster than triangle lists? What about fans?
Indexed triangle strips are faster than triangle lists on GeForce3. You can use our NVTriStrip tool to generate vertex-cache-friendly strips for you. Fans are as fast as strips, but suffer from the fact that batching is impossible and you must spread triangles over many DrawPrimitive() calls. The overhead introduced by such small batches far outweighs any possible speed gains due to using fans, so fans are not recommended. If your data is not strippable for some reason, properly batched indexed triangle lists are faster than fans.
34. Is there a vertex cache on the GeForce3?
Yes, the GeForce3 has a 24 entry vertex cache (as opposed to the 16 entry cache on the GeForce and GeForce2). Note that the effective size of the vertex cache is more like 18 entries due to the fact that up to 6 vertices can be in flight at one time. A hit in this vertex cache saves both TnL and vertex fetch bandwidth.
35. What vertex size is optimal? Should vertices in a vertex buffer be padded to a specific size?
In general, the smallest vertex size possible should be used, as this helps AGP bandwidth and your memory footprint. Padding vertices to the next multiple of 32-bytes (so, a 40 byte vertex gets padded to 64 bytes) can slightly improve performance in the case of scattered memory accesses, but roughly sequential access will provide better performance and eliminate the need to pad. In general, padding is not recommended. Note that if padding allows you to use fewer vertex buffers by coalescing VBs with differing FVFs into one larger VB, padding may improve performance by reducing VB switching.
36. Does it matter how vertices are arranged in a vertex buffer?
Yes, storing your vertices in your vertex buffer in roughly the order in which they’re accessed can improve performance, especially for vertices that are not a multiple of 32 bytes in size. And wildly scattered accesses will cause serious slowdowns due to having to constantly open and close DRAM pages. The NvTriStrip program on our website includes code to do this sorting optimization.
37. Are there any performance issues with curved surfaces that I should be aware of?
Yes. First, it is important to note that surfaces with static control points are much faster than surfaces with dynamic control points. Next, you need to achieve a certain minimum level of tessellation (at least 10-15) to get maximum efficiency out of the hardware. Finally, there is a performance “cliff” to be aware of: if your (TessellationLevel modulo 17) < 10 or so, performance will not be optimal.
38. Can I use point sprites in the fixed-function pipeline?
Yes, but GeForce3 doesn’t support specifying a per-vertex point size when using the fixed-function pipeline. However, you can specify the global point size. If you wish to use per-vertex point sizes, you must use vertex shaders.
39. What factors affect the performance of render-to-texture operations?
Rendering to mixed-mode render targets used as textures (32-bit color/16-bit Z or vice-versa) is a slow operation on GeForce3 (non mixed-mode render targets run at full speed). Make sure your color and z-buffers are the same format for all your texture render targets if render-to-texture performance is significant to your application. This means the same size (both horizontally and vertically) and the same bit depth. Also, you should avoid subrect clears on render targets.