NVIDIA’s RTX 30 series launched to a ton of fanfare and jaw-dropping levels of performance claims and specifications – but somewhere between all the hype and third-party reviews, the promised doubling in performance vanished without a trace. Today, we are going to be investigating a very interesting phenomenon plaguing NVIDIA GPUs and why not everything is as it seems. Nothing is presented as the gospel truth for you to believe and you are encouraged to use your own judgment according to taste.
NVIDIA’s RTX 30 series has more than twice the TFLOPs, so where did all the performance go?
The argument is simple, Jensen promised twice the graphics power in Ampere GPUs so we should see roughly twice the shading performance in most titles (without any bells and whistles like DLSS or RTX). This, most curiously, isn’t happening. In fact, the RTX 3090 is anywhere from 30% to 50% faster in shading performance in gaming titles than the RTX 2080 Ti even when it more than twice the number of shading cores. TFLOPs is, after all, simply a function of shading clocks multiplied by the clock speed. Somewhere, somehow, performance is being lost.
One of three things is happening:
- The lone shading core of Ampere is somehow inferior to Turing and the cards can’t actually deliver that FP32 TFLOPs number (in other words Jensen lied).
- There is something wrong in the bios/microcode or low-level drivers of the card
- The high-level drivers / gaming engines / software stacks can’t scale up to properly utilize the mass of shading cores present in Ampere cards.
Fortunately for us, this is a problem that we can easily investigate using the scientific method. If the Ampere cards’ shader cores are somehow inferior to Turing, then we should not be able to get twice the FP32 performance using *any* application. Simple right? If however, we can get the claimed performance on *any* application then it becomes slightly tricky. While it would absolve the hardware of any blame, we would then need to find whether the software stack/high-level drivers are at fault or whether its a microcode issue. While you can resolve hardware vs software with a very high level of certainty, you cannot do the same for the software side. You can, however, make a very good guess. Our logic flow diagram is as follows:
Rendering applications are designed to use a ton of graphics horsepower. In other words, their software is coded to scale exponentially more than games (there have actually been instances where games refused to work on core counts higher than 16 in the past). If *a* rendering application can demonstrate the doubling in performance than the hardware is not to blame. The cores aren’t inferior. If *all* rendering applications can take full advantage then the low-level driver stack isn’t to blame either. This would point the finger at APIs like DirectX, GameReady drivers, and the actual code of gaming engines. So without any further ado, let’s take a look.
VRAY is one of the most shading intensive benchmarks for GPUs. It is essentially the Cinebench for GPUs. It also helps that the program is optimized for CUDA architecture so represents a “best case” scenario for NVIDIA cards. If the Ampere series can’t deliver the doubling in performance here, it will not do so anywhere else. The RTX 3090 in VRAY achieves more than twice the shading performance of the RTX 2080 Ti quite easily. Remember our flow diagram?
Since we have a program that can actually output double the performance in a ‘real world’ workload, it obviously means that Jensen wasn’t lying and the RTX 30 series is actually capable of the claimed performance figures – at least as far as the hardware goes. So we know now that performance is being lost on the software side somewhere. Interestingly, Octone scaled a little worse than VRAY – which is slight evidence for lack of low-level drivers. Generally, however, rendering applications scaled a lot more smoothly than gaming applications.
We took a panel of 11 games. We wanted to test games on shading performance only, no DLSS, and no RTX. There wasn’t a particular methodology to picking the titles – we just benched the games we had lying around. We found that the RTX 3090 was on avg 33% faster than the RTX 2080 Ti. This means, for the most part, the card is acting like a 23.5 TFLOPs GPU. Performance is obviously taking a major hit as we move from rendering applications to games. There is a vast differential between the performance targets the RTX series should be hitting and the one its actually outputting. Here, however, we can only guess. Since there is a lot of fluctuation between various games, game engine scaling is obviously a factor and the drivers don’t appear to be capable of fully taking advantage of the 10,000+ cores that the RTX 3090 possesses.
So what does this mean? Software bottleneck, fine wine and the amazing world of no negative performance scaling in lineups
Because the problem with the RTX 30 series is very obviously one that is based in software (NVIDIA quite literally rolled out a GPU so powerful that current software cannot take advantage of it), it is a very good problem to have. AMD GPUs have always been praised for being “fine wine”. We posit that NVIDIA’s RTX 30 series is going to be the mother of all fine wines. The level of performance enhancement we expect to come for these cards through software in the year to come will be phenomenal. As game drivers, APIs, and game engines catch up in scaling and learn how to deal with the metric butt-ton (pardon my language) of shading cores present in these cards, and DLSS matures as a technology, you are not only going to get close to the 2x performance levels – but eventually, exceed them.
While it is unfortunate that all this performance isn’t usable on day one, this might not be entirely NVIDIA’s fault (remember, we only the problem is on the software side, we don’t know for sure whether the drivers or game engines or the API is to blame for the performance loss) and one thing is for sure: you will see chunks of this performance get unlocked in the months to come as the software side matures. In other words, you are looking at the first NVIDIA Fine Wine. While previous generations usually had their full performance unlocked on day one, NVIDIA RTX 30 series does not and you would do well to remember that when making any purchasing decisions.
Fine wine aside, this also has another very interesting side effect. I expect next to no negative performance scaling as we move down the roster. Because the performance of the RTX 30 series is essentially being software-bottlenecked and the parameter around which the bottleneck is revolving appears to be the number of cores, this should mean that less powerful cards are going to experience significantly less bottlenecking (and therefore higher scaling). In fact, I am going to make a prediction: the RTX 3060 Ti for example (with 512 more cores than the RTX 2080 Ti) should experience much better scaling than its elder brothers and still beat the RTX 2080 Ti! The less the core count, the better the scaling essentially.
While this situation represents uncharted territory for NVIDIA, we think this is a good problem to have. Just like AMD’s introduction of multiple core count CPUs forced game engines to support more than 16 cores, NVIDIAs aggressive approach with core count should force the software side to catch up with scaling as well. So over the next year, I expect RTX 30 owners will get software updates that will drastically increase performance.