Progress Report May 2023
May we offer you a progress report this fine month?
Tears were shed, monarchies were restored, and we’ve seen a fully functional, multi-stage detachable dropship, complete with cruise missile systems and meat grill in a Zelda game. Truly, nature is healing. If you hadn’t already guessed, this month has been almost entirely monopolized by that pesky blonde princess, but fear not, we did find time between play sessions to work on an avalanche of changes, fixes and improvements.
We’ll save a little something special for the end, but for now, let’s get to it.
We begin this month, not on a certain AAA release, but on the recently released Demon Slayer - Kimetsu no Yaiba. While the art style of the show is very faithfully recreated in videogame format, fans quickly noticed that many models and textures appeared fuzzy, with an almost TV static-like effect.
As seen above, this haze is caused by an incredibly small floating point error on a shader operation. Within the compiler-optimized output, there was a redundant multiplication operation occurring between `gl_FragCoord.w` and its reciprocal (1/gl_FragCoord.w). While on paper, these should cancel, computers are unfortunately not as simple as pure mathematics. With this multiplication step removed, the floating point error is swept away with it.
How ‘bout we talk about Zelda now? But for the moment, it’s gonna have to be Breath of the Wild. As mentioned last month, the bulk of the performance optimizations for this title came in May with a whole slew of changes that actually did hit the ground running with a couple of other games as well.
- Rendered textures without any pool reference are now kept alive to avoid recreation. Version 1.6.0 of BotW began to clear/write textures while never actually sampling them, causing a lot of headache for Ryujinx’s texture caching mechanisms.
- Vertex buffer updates and now batched in Vulkan. Avoids individual vertex buffer updates to reduce draw calls. Minor improvement and will vary heavily on the game and GPU driver.
- Granular buffer updates from constant buffer updates are now allowed. Constant buffer updates used to be uploaded as a full 4096-byte chunk; by tracking the offset since the last update, we can only upload the range that represents the newest buffer update. This dramatically reduces the uploaded bytes per frame by almost 50% in Korok Forest.
- Vulkan fence manager and MultiFenceHolder were simplified. There were a number of bottlenecks and slow data structures in use within the MultiFenceHolder that were dramatically cut down. Improved performance in any backend-bottlenecked titles (this is very system and game specific).
- CPU region handle containers were removed. A rather stupid speed-up across the board here as we were accessing CpuRegionHandle objects via an intermediate instead of simply referencing them directly. A very small and simple change, yet due to its huge usage per frame, we saw a non-trivial 5% performance increase in Super Mario Odyssey (a good benchmark for raw drawcount performance).
- Textures that are flushed often are now preemptively flushed to host-imported memory (when available). Breath of the Wild is a huge offender in wastefully reading back data from linear textures; foot placements on terrain, water level at an object's/link's position, even some information used to color underwater terrain and to populate grass. The game basically breaks if you don't do it properly. Performing a preemptive and direct flush of such textures to CPU accessible memory skips a number of waits that the GPU would otherwise perform while the data was being copied around. In an ideal scenario, when the texture layout is linear, it can even skip another step and copy directly to the GPU by importing memory directly. The Legend of Zelda: Skyward Sword HD is also a fiend for reading back texture data to check if the sun is obscured and also sees significant gains here.
After all that, let’s look at the outcome for a couple of titles:
As the test title for most of the performance related changes of the last 2 months, it isn’t unsurprising to see a very healthy 30% uplift on our test systems in Breath of the Wild. As mentioned prior, Skyward Sword HD also sees a very disproportional uplift of 25% from the preemptive flush change alone. Remember that texture it uses to readback data on sun occlusion? That thing is a full fat 1920x1080 RGBA8! Xenoblade Chronicles DE, as usual, somehow sneaks a small improvement from just about anything we do, but XC2 also manages a 28% uplift in the main town hub.
On the topic of Xenoblade, DE and 2 managed to become the catalyst to fixing some annoying, yet simple graphical bugs. On Nvidia drivers starting from 522.XX and pretty much all AMD drivers, XC:DE, XC2 and Bayonetta 3 exhibited major graphical artifacting. This was usually limited to UI elements in the Xenoblade titles, but were expressed as full blown god rays in Bayonetta, which obscured most of the screen space.
Due to a very specific render scenario, there were some cases where the correct barriers were not being correctly set due to the order in which the checks take place. By adjusting when the barrier check happens, the barrier can be inserted correctly.
Alright enough with the formalities. Tears of the Kingdom… *drum roll*... Ladies and Gentlemen, we did it again. Our trophy cabinet of Day 1 playable titles receives possibly its largest accolade to date. The team is extremely proud of this one as it continues to reassure us that we’re doing this whole emulation thing properly! That’s quite enough of the self-flattery though, the experience certainly was not perfect so let’s discuss what went wrong, and how we fixed it. We will try to keep this section as spoiler-free as possible, but as comes with the territory of using screenshots, proceed at your own risk.
The most immediate graphical issue could be spotted almost instantly. Rock and wall textures were littered with white square-shaped artifacts as shown below:
These splotches were being caused by a bug in the ASTC decoder which was setting an incorrect endpoint in LuminanceDelta mode. This affected all users who did not own a GPU with native ASTC support; basically everyone except for, ironically enough, Mac users.
Eagle-eyed readers may have already spotted the next issue in the two screenshots above. Those textboxes sure didn't look like that in Breath of the Wild and it appeared that in Tears of the Kingdom, they were setting their swizzle texture incorrectly. By explicitly failing an exact match condition, we can force the creation of a correctly swizzled D32 texture. This change also resolves a very old bug in Mario Kart 8 Deluxe, where returning to the character select screen could sometimes break the character model cubemaps, causing them to appear a solid silvery color.
Moving down the list of most obvious bugs, the game at launch would experience seemingly random crashes citing an invalid memory region error after an hour or so of playtime. This one was actually caused by the shader cache matching a current shader use with the address of a different shader. The issue arose when that different shader had been partially unmapped, causing a crash. By only reading the mapped portion of the shader, this will instantly fail the compare condition and compile/lookup the correct shader instead.
Back to stuff you can actually see, the Vulkan backend was disabling explicit LoD when using depth comparison with array textures. Interiors and a lot of shadow-based lighting was heavily affected. This seems to be a bit of legacy leftover code from when Vulkan used to cross-compile from GLSL as the extension is unsupported there. By removing this redundant blockage, the problems vanish!
As far as vendor-specific bugs go, Tears of the Kingdom came with plenty of baggage. For Nvidia users, the Vulkan backend was performing much worse than it should have, and in some scenarios up to 57% slower than OpenGL. The gap only widened when scaling to higher resolutions. The problem here was maddeningly ironic; a couple of months ago we implemented a system to migrate data around between device and host memory, which helped almost every game across the board. Unfortunately, Breath of the Wild benefits from the exact opposite buffer locations, as Tears of the Kingdom and the implementation was obviously tuned for the former, not the latter. By device mapping any buffers that are written more than they’re flushed, Nvidia GPUs no longer get kneecapped here.
AMD are not spared the spotlight either. The gloom in the depths seems to function by emitting LoD texture sampling instructions via compute. While this is somewhat valid behavior with Nvidia-specific Vulkan extensions, on AMD Windows this caused texture sampling from compute to fail and cause “gloom damage” even when, visibly, Link was not stood anywhere near it.
It seems that other drivers were simply ignoring this invalid behavior and sampling LoD as 0. By checking the instruction is being used exclusively on fragment, the phantom gloom is a thing of the past.
The depths causing problems seems to be a theme because many users reported significant performance fluctuations after random amounts of time exploring. When down there, Tears of the Kingdom uses a global memory access with an address on constant buffer slot 6. This isn't standard and thus isn’t the size we expect; this caused us to read back a garbage size that ended up very large, which would synchronize a large amount of data per frame. Adjusting how we calculate the buffer size should bind it to a reasonable size and stop it crossing into other memory.
Onto something everyone could enjoy, Z-fighting. Z-fighting is a phenomenon in 3D scenes where if two ‘objects’ are very close together that they can appear to have an almost identical depth value in the z-axis. When this happens, the camera can effectively see a random assortment of geometry from both as a flickering effect as both “fight” to be “on top”.
In TotK, this effect could be seen on distant geometry which has less precise depth values than objects close to the camera.
Adding support for `VK_EXT_depth_clip_control`, can significantly reduce the bulk of the larger geometry fighting. There is still work to look into the remaining fights, but it should be isolated to zoom-in shots now.
Alas, all of the improvements to rendering and performance above mean nothing if the game refuses to run any faster than 20FPS though. Both Breath of the Wild and Tears of the Kingdom make use of a double-buffered VSync implementation that can dynamically switch between a 30FPS and 20FPS target, depending on the performance of the Switch. If it starts to thermal throttle or drop frames, the game can simply swap to its 20FPS mode and maintain its speed. How does it know if it’s performing badly? By using the timestamp of the GPU at the current frame. It seems that Ryujinx was incorrectly reporting its timestamp because by simply forcing the first timestamp on game boot to be 0, thus making all future timestamps an offset from 0, TotK finally seems to realize it isn’t running on a potato!
To finish off this extensive Zelda segment, some new shader formats were implemented in the form of p2rc, p2ri, p2rr and r2p.cc. We aren’t actually sure what they’re used for, but the logging console seemed to spit out an “unsupported format” warning occasionally, so it does use them somewhere! Find out where and we’ll give you a prize*.
*we allow you to be smug for no longer than a period of 3 seconds.
MacOS development:
The upstreaming work for macOS continues at a rapid pace this month, including some pretty massive and vital changes being merged.
As we mentioned last month, universal macOS packages are now part of our master build pipeline. This allows users on macOS to download a fully up to date, bleeding edge version of Ryujinx from our release pages. Be aware that not every Apple Silicon-specific optimization we worked on for the `macos1` release has been merged yet, so many games may perform worse/render differently, and this is the reason we are still linking to the original build on our website as it has a larger compatibility profile at the moment. This should change soon.
A few, different, varied and fun MoltenVK bugs were given workarounds in May, with the end result being that titles like Xenoblade Chronicles 3 can now render extremely respectably on macOS master.
And we hope you aren’t bored of Tears of the Kingdom, but all roads do seem to lead there. For some unknown and honestly miraculous reason, the game not only ran on day 1, but more interestingly unlike its predecessor, didn’t instantly kill the hypervisor. This is good news for everyone because to this day, Breath of the Wild still needs to use the much slower JIT, whereas Tears of the Kingdom can natively execute all its code. We give our thanks to whichever game developer changed that!
This is not to say that Apple users did not escape the bug blast. Huge vertex explosions plagued the game when Link wore specific or no clothing at all. By truncating any vertex attribute format that exceeded the stride, we stop MoltenVK from providing Metal with incorrect vertex values.
The second issue was once again related to clothing. Sensing some prejudice here… Shining bright white spots would appear on certain outfits and world geometry which, while looking actually rather cool, was clearly a garbage value in a shader somewhere.
Sometimes games may add a very small offset to a value in order to make completely sure that it will never be used in an operation that could result in a division by zero. This makes sense, division by zero is certainly quite bad for computers to deal with. Computers are also very smart these days and compilers for shaders will usually try to optimize away anything it deems as incorrect, wasteful or inefficient. Randomly adding tiny values to stuff is prime territory for the compiler to ruin your day, as has happened here. Luckily, values can be qualified as “precise” in SPIR-V and this allows them to be left well alone in the optimization stage.
While rendering is fairly sorted on the latest builds, plenty of more intensive games (TotK among them) still need buffer mirrors implemented in order to perform well. We covered this in our initial blog post but as a TL;DR, they attempt to bridge the gap between how a desktop type GPU, such as that found in the Switch, and how a mobile-type GPU, such as that found in M1/M2 chips, render graphics. Work is currently progressing on getting this merged without negatively impacting Windows and Linux users, something we didn’t really need to worry about for macos1.
While these changes are certainly flashy, gdkchan has been busy effectively reworking a massive portion of the shader backend across multiple pull requests. The end goal of this work is to implement emulation for transform feedback and geometry shaders in a much cleaner and maintainable method than that used in macos1.
May brought the final groundwork for this undertaking in three parts:
- Replacement of constant buffer access on shader with new `Load` instruction. Condenses the `ConstantBuffer` operand and `LoadConstant` instruction into a single `Load` instruction, reducing backend complexity and improving flexibility. This change also fixes the vertex explosions faced by AMD GPU users on Windows in Super Mario Galaxy (3D All-Stars).
- Generate scaling helper functions on IR. This change moves all of the resolution scaling code out of the SPIR-V and GLSL backends and into a single homogenous helper function. This ultimately means less code, less work to implement more backends in future, and reduces the likelihood for differences to occur between any current or future backends.
- Replacement of ShaderBindings with new ResourceLayout structure for Vulkan. The ResourceLayout is used to create the PipelineLayout on Vulkan, rather than it having 2 hard coded layouts (one that was used for game shaders, and another that was used for helper shaders from the backend called "minimal layout"). Since we need to reserve additional storage buffers for transform feedback and geometry shader emulation, the PipelineLayout also needs to be different. This change allows this to be done in a simple way.
With those now in place, the first pull request for transform feedback emulation is open (and merged by now, but shh, save the surprise for next report), with geo shaders to follow. We’re aware this has taken a fair while, but the team are far happier with the implementations that the reshape allows, compared to the rather complex solutions employed for our first macOS release.
And back to our usual schedule...
When emulating an operating system that is still in active development, it’s easy to get swept away and forget that maybe stuff has changed since it was originally reverse engineered. As such it was time to return an eye to the… time services. With a full RE of these from firmware 15.0.0 we were not only able to ensure our accuracy for at least another few versions, but also stumble upon some old mistakes. This change finally fixes the completely static timed PokéJobs in Sword/Shield, and likely other games that make use of timed events. It was about time if you ask us.
Alongside that RE work, our resident audio-maestro marysaka, went back over to the audio renderer, fixing some audio bugs in Tears of the Kingdom and implemented support for full 5.1 surround sound when using the SDL2 backend with a compatible game and speaker system.
Most first-party Switch games actually do support surround sound, so this change is a welcome one for those with the space!
To finish us up let’s rattle through a quick-fire round for some of those more niche, yet oddly helpful improvements:
- BuildID’s for games are now exposed via the Cheat manager. This should make it far easier to make your own cheats/name your cheat files accordingly without needing to resort to homebrew like JKSV.
- Vm.max_map_count is now automatically increased on Linux if it is deemed too low. Some titles, lots of UE4 games, allocate huge chunks of memory at launch and need a higher limit here.
- DelegateHelper has been replaced with pre-generated delegates in the CPU recompiler. This removes all uses of System.Reflection from the ARMeilleure project; a step closer to being NativeAOT compatible.
Last but not least, ‘Stop emulation’, a button which has felt somewhere between useless and a Russian roulette for a fair old while now, should finally work… in most cases. Four more causes of deadlock were isolated and resolved this month from GUI, ServerBase, CPU and mostly GPU code. The nightmare should, mostly, be over.
Closing words
Did we say we had something cool? Have a gander at this.
That’s all we’ve got for you this month! If you like what we do and want to give us a helping hand, you can check out our Patreon, have a wander through our GitHub, or join us on Discord. As per our usual pitch, open-source software is driven by folk around the world who find something annoying, and fix it. Are we annoying? Well, you know what to do.
Thank you all for reading and we’ll be back in a month! Au revoir.