A Tale of a Bug

As we march towards our 4.0 release with a strong focus on next gen materials, it is an incredibly busy period where we are all pretty confident with our new PBR implementation. Nevertheless, this story started a couple of weeks ago with this PBR bug report from our beloved community on the forum. In a nutshell, it looked like our new PBR Material (where I spent a big chunk of love and time during this release) was wrong.

Above, you can see the differences from the engine being used by the reporter of the bug (the Reference Render) compared to Babylon.js. There are plenty of differences which can all be explained easily, but one in particular stood out: the tops of the rough spheres look way too different.

To make it even more interesting, this all started on a Friday evening, and thanks to Murphy’s law, it was my daughter’s birthday over this weekend. I guess it was at that precise moment when I started to experience the 7 stages of grief.

1. Shock: Oh no!!! Not now, pretty please…
2. Denial: This can not be wrong! I am sure of what we are doing! The other engine is obviously wrong.
3. Anger: I spent a huge amount of energy ranting to David Catuhe about the issue. Many thanks to him for bearing with me. ;-) He knows me well enough by now to know that this helps me keep a bit of sanity.
4. Bargaining: I need to do my due diligence to investigate this.

I first tried to render the worst case scenario in Unity both with and without HDRP:

The diffuse environment lighting is definitely more dynamic here than in Babylon.js, but fortunately not as far as the reference viewer would make us think. The only thing at play here is the Image Based Lighting diffuse part (the sphere is fully dielectric and rough). This seemed to prove that there is indeed an issue in this area.

Just to confirm the issue, I went on with Filament and did another render of the model. Fortunately I am a JavaScript developer, so C++ is easy (troll level of this article +1000). But jokes aside, Filament is an amazing piece of code with great documentation that I often rely on when in doubt. Thanks to Romain Guy for this. It ended up reproducing the issue as well: like in Unity, the spheres appear less flat and more bright on the top.

With the issue confirmed by three different targets, I asked my friend and colleague @PatrickCRyan to create a ray traced version of the render in order to define the ground truth. As with the other engines, notice the lack of a bright area at the top of the rough spheres in Babylon.js.

5. Depression: Whyyyyyyyyyyy? oh whyyyyyyyyyyy?
6. Testing: Time to dig in!

And so I began to review all of our IBL math and setup. I spent all day Saturday learning and reviewing the spherical harmonics theory used for our environment lighting. This is an incredible rich area and the math behind it is amazing. In case you would like to know more about it, here is the list of articles that helped me a lot in understanding this space:

Turning to our code, I found what looked like a compilation of magic numbers:

I started by replacing all of them with their source computations to ensure nothing could be resulting from a typo in there:

Unfortunately, this did not impact the visual result but simply proved out that our code was ok, and yet there was still an issue. Armed with a stronger knowledge of this area, I did more testing. I ended up in a part of the code which is not defined in the graphics papers; we had developed it internally to speed up the shader computation. It relied on spherical polynomials instead of harmonics and I had always taken this part for granted.

I did a simple test relying directly on our spherical implementation without relying on our polynomial optimization trick.

That was it, it all works !!!

So we now have two modes in Babylon to allow either an accurate representation or a fast rendering. You can try them on your own in the inspector by toggling on and off the “Spherical Harmonics” button.

To be honest, none of our automated visual tests were impacted by that change confirming the pragmatism of using the faster compute in most cases.

The bug was closed and I felt awesome about it.

BUT…
Wait…

A new bug report arose. This basically stated that the values we submitted to the GPU were the wrong ones. YUP, A TINY TYPO and two of the 9 terms were always wrong in our previous fast compute mode.

This might have been the issue since the beginning...

Despite being a bit more knowledgeable on the subject I felt two combined feelings after reading the forum post: an insane amount of stupidity and a big pinch of shame. I leveled up my internal WTF stats to +1000. At this point, I was feeling so guilty for not having spent as much time with my daughters this weekend as I wanted to.

I happily fixed the issue and, for my own sanity, ran a new set of tests. The results showed improvement from the first bug but were still badly off compared to our ground truth target, so all my effort wasn’t in vain and I felt a lot better about it. It simply happens that our polynomial approximation is less precise than the harmonics model…but this explains the speed increase.

Conclusion:

  • The community is the most precious part of development. Without them we would have been rendering incorrectly for a long time.
  • I enjoyed every part of the emotional roller coaster and learned a lot by doing a proper in depth analysis.
  • Every bug is a nice learning opportunity.
  • Our PBR can now live happily ever after!!!
  • I gave my daughters a big hug and celebrated the birthday for days!!!