DISCLAIMER: This is the first blog post on this website and so don't expect anything coherent. Also most of the data on past builds is based on pulling old commits.
Around a year ago I started work on VSC, this random little software rendering engine that I had no idea would become as substantial as it is now. Over the course of the year, on and off work added in many features -- both rasterization and ray-tracing, subdivision methods, model loading, and so much more. But the one thing that has bogged it down the most was the optimization, and my endless chasing of micro-optimizations and other seemingly trite matters.
You see, VSC is intended to run on multiple platforms, requiring the user (aside from programmatically setting up the scenes) to connect the backend to a graphics API for display. I personally use this with two front-ends: one is a simple Python script I wrote to render buffer data outputted by the engine, the other is an ESP32 connected to a couple of 64x32 LED panels. While other people might use things like Raspberry Pis and Teensy boards (and there have been other rendering engines wrote for those systems), the ESP32 was originally chosen for its price, and its functionality. Plus, the LAN support also lured me in (unfortunately). What's not so good about it is its relative weak computational power compared to those other aformentioned boards. I think it was at the time that I also was reading on micro-optimizations and so I just kinda milked the hell out of it.
Most of the tests were run on a test rasterized scene (AnimShader) with several thousand triangles, morphed meshes, and (later down the line) camera animations and simple fragment shaders. The scene consisted of two icospheres with 5120 triangles each, a morphed mesh with 12 triangles, a ground plane with two (or four, very few) triangles, and a custom humanoid mesh sitting on top. All of this was defined programmatically--fragment shaders, for instance, are function pointers that transform fragments, and the custom models were loaded into std::vectors of data that could be parsed into Mesh objects. At the start, the entire animation of 512 by 512 pixels and 48 distinct frames took around 2.5 seconds to render (three with -pg). This was a good starting point. By the end of this we see this go down to around 1.6 seconds (without -pg, around 2 seconds with -pg).
Many optimizations and tests were done, and each of them had their benefits, and some of them were more trivial than others. The first thing I did (according to the commit history) was to change the 3x3 linear system solver from simply doing Gaussian Elimination to also looking at Cramer's rule for invertible matrices (bringing the runtime down to around 2.1s). But the next, possibly most impactful change, was the removal of many copy constructors inside methods and elsewhere. However, I did not stop there. After all, our goal was 1.6s
Most of the optimizations during the early game, aside from the copy constructor removals, a small addition of a scan line algorithm to fill triangles, and a change to the file output to reduce the number of file writes, consisted of mitigating repetitive work. According to gprof, at the time, a huge bottleneck was this function DrawTriFrag that drew a triangle fragment to the buffer. Now, in that method, are a number of instances where we need to interpolate on barycentric coordinates. Normally you would use a TriangleF::interp function, which computed the barycentric coordinates of the pixel position each time. So one thing I did to reduce the repetitive work was to add a method interp_given_bary to various Triangle classes that allowed you to input a precomputed barycentric position, skipping the need to solve each time. The other thing I did, was (in the same method) was to precompute certain reciprocals of things like z positions. Similar things occurred in other methods, mostly involving converting the interp method to interp_given_bary, and precomputing positions once. After all this work, the runtime of the test scene was down to 1.9 seconds, a significant improvement over the original 2.5.
Most of the optimizations past this early game stage had rather small improvements but I did them anyway to see what would happen. One of the optimizations that I had somehow overlooked was moving the backface cull to the start of the FillTriangle methods (originally it was after the triangle clipping step) (The triangle clipping would also get a small tweak which made it a bit more performant). Another was using an edge function system for the scanlines. These alone would bring the runtime down to 1.2s. However, using the new edge functions caused significant visual glitches on the ESP32, so the edge functions had to be reverted to a more crude form, increasing the runtime back up to 1.4s. Unfortunately, it would later be revealed that the vector projection issue was not actually fixed, that also had to be fixed (for the physics system that would be implemented later on). The Great Migration would occur around December 2025, changing the library from being header only to separating headers and implementation. These two changes would unfortunately increase the runtime back up to 1.7s. It was still an improvement over the 1.9s from before, however it is no longer as successful as the 1.4s with the flawed logic.
On a sidenote, after doing all this for the rasterizer, I decided to do much of the same things with the raytracer (although the actual motivation was finding out that since BaseMaterial and ImageTexture had uninitialized attributes, the raytracer would have very bizarre glaring effects). The raytracer test scene (RTexBVH aka. 9.3 Wheezie) was similar to the rasterizer test scene, except it was not animated, most materials were reflective (the number of reflections was at most 2), one of the spheres was a simple cube, and the other sphere was reduced in complexity. The raytraced scene (after the material fixes) took 2.1 seconds to render (and yes, there was already a bounding volume hierarchy in place to speed things up). From there, the optimizations were largely analogous to the rasterizer optimizations: adding in special methods to Mesh that allowed passing in of precomputed barycentric coordinates, as well as the Moller-Trumbore intersection algorithm for ray-triangle intersections. I also precomputed the reciprocal of the screen dimensions to further milk this notion of repetitive reduction. After all this, the raytracer now took 1.4 seconds to render our test scene. That's a lot better!
Much of the stuff after this (so to the present day, January 10 2026) was just more milking and redundancy reduction. I also kinda messed around with the Matrix solvers a bit more (as directed by gprof), and made a few constants into constexpr and changed a few mallocs/frees into static arrays. The average runtime right now of the animated scene with shaders was around 1.6 seconds. All in all this was a very rough and chaotic journey, but I feel like I learned a lot and my eldritch abomination is now much more healthier than they were before.