Software rendering is better than DirectX or OpenGL

6 2126

I have published this article two years ago, but it got lost from the internet. I got a request to re-upload it. The illustrations in this article lost as well, so i tried to gather them again. Parts of the article was behind a paywall, but i will publicize them now for free. So, here is the article!

Nowadays, overwhelming majority of games and applications use hardware rendering through DirectX or OpenGL. The OpenGL or DirectX API allows the programmers to use the graphics cards to generate the 3D image on your screen. The DirectX and OpenGL interface is supplied by the graphics drivers of your graphics cards. This method of producing the graphics is called hardware acceleration, or hardware rendering. When the graphics card is not involved in the creation of the 3D picture, the procedure is called software rendering.


How the era of 3D graphics begun

In the 90s, we had no 3D acceleration. On the 386-s, 486-s, Pentium computers, the games produced the graphics by software rendering. Basically, the computation of the graphics were happening through short algorithms, happening on the CPU.

Early software-rendered Tomb Raider running under DOS

(Screenshot by MobyGames)

As CPU-s became faster, the quality of the software rendering is increased. After we reached the 166 MHz Pentium 1, 640x480 became playable with a few thousands of polygons. This was enough for most of the games at that time, however, to keep the rendering playable, they had no texture filtering, and very few effects. Later, the S3 Virge graphics card and the 3dfx Voodoo 1 introduced, allowing filtered textures and faster 3D performance than software rendering. Of course, to access the abilities, we had to rewrite the codes to support these chips. First every manufacturer supported they propriretrary API, but later the interfaces was sort of standardized. OpenGL and DirectX (Direct3D) compatible drivers became widely available. In 1997, we was able to get much as double the performance of the software rendering with these early graphics chips. They offered more fps and better graphics quality, so the software manufacturers switched to hardware rendering.

What is the situation now?

Currently we have three DirectX variations available on Windows. DirectX9 uses fixed function rendering with the support for shaders. DirectX 11 supports programmable pipeline only, and DirectX 12 is designed for highly parallelized rendering. This three API is totally incompatible with each other, so graphics card developers have to write three separate implementation for each of these API-s. If a programmer decides to write a game for DirectX12, then he must also write a separate rendering engine for DirectX9 as the older graphics chips will not support the newest DirectX 12 API. Initializing DirectX12 and rendering a few textured triangles require an initialization code thats approximately 1000 lines. To have a separate DirectX9 rendering engine for compatibility, thats another 500-ish lines. The story didn't ended here – now we only have covered the Windows compatibility.

Its getting more complicated

DirectX does not exists on Linux, Apple or Android based devices. These platforms have OpenGL. To support these platforms, you have to implement these API-s as well. Porting your graphics engine to OpenGL 1.1 will take about 500 more lines (we still speaking from a triangle based renderer that can do texturing). However, this only covers the desktop OpenGL, which does not exists on mobile phones. Mobiles have a different variation of OpenGL, which called OpenGL ES. There is two separate OpenGL ES API exists. The newer is called OpenGL ES3, which is similar to DirectX 11, you have to write a shader based programmable pipeline to handle it. This is about 1000 line, and its not backward compatible with older phones. Older phones have OpenGL ES2 or OpenGL ES1.x. To create a program that can work on OpenGL ES1, you must write again a new renderer thats capable of the renderer.

Its REALLY getting more complicated

Ok, now you have your renderer that has a separate code path for DirectX 11, DirectX9, OpenGL 1.1, OpenGL ES3, OpenGL ES1. The API implementation of the rendering alone grew above 5000 lines, and in theory now it is now capable to run on PC, tablets, and phones. In theory. In reality, it will run only on your devices, as implementation of 3D API-s are broken. A code that works well on an nVidia chip, will maybe not work so well on an AMD, Vivante, Samsung, or MALI chip. Some chips have no problem rendering non power of two sized textures, some may only work well with two-factor textures. Some maybe will crash if you allocate more than a few tousand of textures, maybe some will just give white picture because you forgot to set a bit somewhere. Maybe on some configuration it will just crash your phone, so you have to check every single graphics vendor. You have to buy a couple of nVidia, AMD cards, VIA based laptops, everything from Mediatek, HiSilicon, AMlogic, Samsung, through tens of less common manufacturers, and you will spend the rest of the year by testing your engine on hundreds of chips, ensuring your CRAP runs well on every chip, so its good for production.

What, seriously?

Some programmers just giving up at this point, and using some existing rendering engine. In the hope of getting some time spared on writing the renderer, they start to use bloatware such as the unity game engine, or various other over-complicated solutions. The end result is usually slow, licensing is problematic, and hardware support is still bad, as these engines will just crash on some hardware just like before. Even if you have something that runs relatively well, you are not future-proof as in the future you will have to find a new engine, that can work properly on the new hardware and software environment that will be released later (while maintaining the compatibility with the existing hardware and with the past hardware solutions). Another problem is the trade war between USA and China, which may results in the born of totally new computer platforms if China just bans everything from Microsoft, Intel, AMD, nVidia, and then half of the users will use a different infrastructure.

The problem is the 3D hardware rendering itself

We can easily see that these problems would not exist without hardware acceleration. As we discussed previously, 3D acceleration were only born because the Pentium1 was too weak to produce a nice graphics on suitable resolutions. This was, however, 25 years ago. Since then, our CPU became more than 100x faster, so this is not an issue any more.

My older game engines used hardware 3D rendering as well, starting from the early 2000s. To have a great compatibility with wide range of hardware, i have collected various nVidia, Ati, 3dfx, 3dlabs and S3 graphics cards. After the 2010s, i decided to discontinue that code-base altogether. The maintenance was too time-consuming. For example, i never have properly fixed a crash-bug with Intel IGP-s despite of fixing the bug with various Intel based IGP-s, others still continued to crash. The code used OpenGL 1.x with extensions to use frame buffer objects, shadow maps. Rewriting the code to OpenGL ES1.x would have been too ugly, resulting a non maintainable code.

Instead, i have rewritten the whole 3D engine to use software rendering. My first attempt was crap, however, my second attempt was fine, and it only took a few days to finish it, and a few more weeks to optimize it and fix all the bugs.

The end-result is used in my new 3D RPG Maker software, Maker4D.

Picture: Maker4D using software renderer


What Maker4D is? Maker4D is a FREE 3D rpg game maker engine, with pre-rendered background, and real time 3D battle. (Similar to Final Fantasy 7 or 8, or Alone In The Dark). It is also capable to make Visual Novels, and it can dynamically switch bethwen 2D or 3D mode. Maker4D is a free RPG game maker engine that automatically generates your playable 3D characters from 2D picture of your hero. Download: http://maker4d.uw.hu/index.html

Using software renderer in Maker4D instantly allowed me to run the code on Windows, Linux, and Android. Luckily, the code was fast enough to run on ARM v7 processors as well.

The performance

As i already mentioned previously, modern processors are fast enough to run software renderers. I have benchmarked the renderer on several hardware. I have optimized the renderer further after i have done this benchmark.

(edit: well since these resuts were gathered, i have optimized the code, so currently its about two times faster than back then, so you can multiply every numbers with two from now on...)

As we can see, the software renderer can achieve around 300 fps on a modern i7 CPU, when there is about 10000 textured and animated polygon (triangle) on the scene. This renderer does not scales well beyond 2 cores for some reason, it is very rare to see more than 2 core being utilized, so the results are pretty much same on an i3 CPU. The dual core second generation Atom CPU fells below 25 fps, but the Pentium N3710 and equivalent Celeron CPU-s can go around 50 fps. In this test, this is the speed of the whole Maker4D engine displaying a room with characters, therefore this isn't a synthetic test, this is the real-world performance of the whole game engine under real load, running physics, background music, and the game logic itself.

If we resize the window to HD resolution (1920x1080p) then the speed usually still stays above 25 fps, or it runs around 100 fps on the i5/i7 machines. There is a room to increase the polygon count as well, the software renderer can handle about half million polygons before the speed starts to fell too much, but of course this also depends on the configuration of the CPU. Maker4D eats about 50 Mbyte of RAM. It can also run sort of playable on ARM based phones too.

As we can see, software rendering is quite of enough for a game even in HD, unless you want to have some extremely modern graphics with a couple million of polygons and special effects.

How to write a software renderer like this?

Writing a software renderer is easy. My software renderer i use in Maker4D is only about 1000 lines, and its written in C. Of course you cant write a fast renderer in a corporative script-shit language like Java or C#, so you will have to use C with a compiler that can generate a decent binary code, such as GCC.

(edit: since i wrote this article, i have optimized the code and now its probably more than 2000 lines)

This little explanation will explain you how to write a decent software renderer, including the explanation of how to write the code itself. This will however NOT show any code samples or any pseudocode (except one). Please note that writing software renderer does not need more advanced math than a 9th grader kids homework. It assumes you already have some clue what a vertex or what a texture is.

Rendering is triangle based

Modern rendering is triangle based. Each triangle have 3 point, each have an x,y,z points, where z is the distance. Your triangle is maybe moved in the scene and have an additional rotation, you may calculate that previously, or calculate it directly in the renderer before doing anything. To rotate your model, you can use the sin and cos functions, but you may dont want to call it for every triangle, do it only once per model. If your model is resized, just multiple the x,y,z values. Then you add the location of the object to these x,y,z coordinates, and you have the model in place.

To render a triangle, first you subtract the location of the camera (xyz) from the xyz points of the triangle. X from x, y from y, z from z. Then, you can use sin and cos functions to do the rotation of your camera (you don't want to call the cos and sin functions for every object, its enough to call them once). You can google pseudocode if dont know the formulas for rotation, for example https://stackoverflow.com/questions/13275719/rotate-a-3d-point-around-another-one the third code snippet is fine - you don't have to reinvent the wheel.

Then just divide (multiple with recp) the coordinates with the width (y-s with the height), divide the coordinates with the Z (depth), add the half of your screen size to the end coordinates, and voila, you have all the coordinates on the screen in pixels.

Lets fill the triangles

Filling its a bit more tricky, because you cant use too much math, it would make the rendering too slow. What i did, is basically i calculate a vector (xy direction, divided by lenght) for the two side of the triangle, and add them together for each line. I dont just do this with coordinates, but also with Z and UV, and vertex colors (if there is any). With this, i can avoid doing too much math per line. So now i know coordinates, the uv, etc for the vertex in each line, i can fill the polygon. To do this, i repeat this math trick to have the numbers i must add together for each pixel.

Integers

Integers are faster than floating point numbers. In a modern CPU, we have 3 or more integer pipeline in every core, but we only have 1 or 2 floating point pipeline. Therefore, at this point i multiple every numbers, and convert everything to integer. Including the UV for the textures, This gives a notable performance boost, almost by 50% on weaker CPU-s. Especially the ARM cpu-s have weak floating point performance, so please be sure to use integers when filling up the triangles. I used 32 bit integers, and i usually multiple with 65536 (i sacrifice 16 bit), so this also limits the maximal texture size and i will not be able to use more than 32000x32000 as a resolution, which is probably not a big deal… If it is for you, then you can use 64 bit numbers as well, which i didnt do, as i wanted decent performance in 32 bit as well (or just use long, that will be 32 bit on 32 bit machines, 64 bit on 64 bit machines, when using GCC).

Use IF-s if necessary

You can use IF-s when you want to optimize out rarely used parts of code. If statement (branching) will only eat a few cycles, so if you want to ignore some complex math somewhere, then always use it, do not assume that the cpu will be enough fast to go through it anyway.

Optimize for weak machine

If you continue to test your code on weak hardware while developing it, you can eliminate the dirty little speed demons which would maybe stay hidden when you do your development on a multiple 1000 dollars worth of CPU. I pushed this to a little bit of extreme direction when i have built a Cyrix 6x86 MX based PC just to optimize the code. If you think its unnecessary, well think on it: you probably want ARM based phones to be able to run your stuff too, so its important to have an optimized code.


6
$ 11.88
$ 11.77 from @TheRandomRewarder
$ 0.08 from @sanctuary.the-one-law
$ 0.02 from @BigBlockIfTrue

Comments

The scale and depth of what you do with your software talent is impressive.
Debunking the necessity for proprietary hardware rendering seems like a good path for what you provide.

$ 0.00
3 years ago

thankyou, however there are far better software renderers available compared to mine. for example, the software renderer in old tomb raiders, indycar and unreal tournament, etc are at least two times faster than mine (i however expect those to use inline assembly to achieve these speeds)

$ 0.00
3 years ago

I was regarding your effort as being a solo effort
instead of some kind of team effort,
which I assume (correctly, or not) others mostly are.
But I am not familiar with your genre of software, so is just my guess.

$ 0.00
3 years ago

writing a renderer is a very focused, very individualistic task, which is usually being done by just one guy (even if a team has multiple members). this was also traditionally true for the early dos game engines.

nowadays sometimes you can observe giant game developer teams, where there are a lot of people to the same task, yet nobody is doing anything ever, and those engines/games either never finished, or taking ten times more time to finish than they should be, and/or they are overcomplicated. (communism in work :D)

$ 0.01
3 years ago

Thanks for the insight.
I suspect I may someday very soon find an application for this useful perspective.
It suggests that any superior 'team effort' may
best consist of voluntary contributions that interweave collaboration.
...As opposed to goal and role assignment and delegation.

$ 0.00
3 years ago