Progress Report August 2021
We’re off and about in August which brought a rumbling of bug fixes, HLE improvements, and GPU improvements, and not to mention multithreaded shader compilation! Vulkan has also seen an incredible amount of improvements in the past month.
But before plunging into the deep end, let’s review the state of Ryujinx’s Patreon goals and deliverables:
Amiibo Emulation - merged into the main build in March 2021.
While compatibility is now almost perfect, there are still some improvements to come for Amiibo which can be tracked on the associated Github issue here: https://github.com/Ryujinx/Ryujinx/issues/2122
Custom User Profiles - merged into the main build in April 2021.
Vulkan GPU Backend - still a work in progress, working to make it stable enough to make it into master. For now a test build is available here. Be sure to report any issues on the GitHub page.
ARB Shaders - goal reached in April 2021. Work on ARB shaders will begin as soon as Vulkan is finished.
ARB shaders will further reduce stuttering on first-run by improving the shader compilation speed on NVIDIA GPUs using the OpenGL API.
$2000/month - Texture Packs / Replacement Capabilities - Almost there!
This will facilitate the replacement of in-game graphics textures which enables custom texture enhancements, alternate controller button graphics, and more.
ETA once goal is reached: ~3-4 weeks
$2500/month - One full-time developer - Not yet met
This amount of monthly donations will allow the project's founder, gdkchan, to work full-time on developing Ryujinx.
$5000/month - Additional full-time developer - Not yet met
This amount of monthly donations will allow an additional Ryujinx team developer to work full-time on the project.
As promised, there is now a Pull Request open with the Vulkan implementation. What that means is that the code is now public, and anyone can test it if they wish. It is not complete yet, so please do not interpret this as a feature announcement, there is still quite a lot of work to be done before it can be considered better than the OpenGL backend overall. This month saw incredible progress for Vulkan.
When the Vulkan pull request was opened at the end of July, it still had many issues affecting AMD users. Mainly, "vertex explosions" affecting quite a lot of games, crashes on games doing depth-stencil texture copies and crashes when using resolution scaling. Those specific issues are now fixed, allowing the games to render properly. There are also other issues that we are slowly debugging and fixing. Some of those issues do not affect only Vulkan, but also OpenGL, so they are not being fixed as part of the Vulkan work, instead, a separate fix is submitted to the master branch of the emulator. This is the case for the solid blue eyes on Hatsune Miku or broken shadows on Zelda Link's Awakening, an issue that only affected Intel and AMD, and is now fixed. We'll be talking about that one later in the progress report.
Threaded-GAL (which stands for Graphics Abstraction Layer here, in case you were wondering) and HLE macros were merged, and are now also part of the Vulkan branch, improving performance and making it closer to OpenGL on Nvidia, and a lot faster than OpenGL on AMD (on Windows).
Below you will see a lot of what the Vulkan updates fixed. Many games that did not render well, or outright crashed are now rendering much better and are much more playable.
Hyrule Warriors: Age of Calamity
The Legend of Zelda: Skyward Sword HD
The Legend of Zelda Link's Awakening
Rendering on AMD also improved on this title. It looks similar to Intel now (with a better frame rate). Before, the shadows were broken on AMD, but it was rendering better than Intel still, as Intel had another bug making the colors look "dead". It is worth noting that this game only renders black on Intel OpenGL (on Windows), so this is another game that will be made playable on Intel iGPUs thanks to Vulkan.
Add a multithreading layer for the GAL, multi-thread shader compilation at runtime
Nvidia’s driver has its multithreading, but it can cause significant and unavoidable stutters depending on the game, and it can disable itself whenever it pleases so it's something that's not very reliable. The custom threading layer allows us to have much more control over exactly how and in what form these commands are sent to the background thread - which can be optimized further than the Nvidia driver could manage with its more general approach, it nets us more consistency between all the vendors since they all use the same code rather than a driver implementation, it also gives a nice performance boost for all GPU vendors. AMD users should notice a massive improvement in performance and Intel users should see some improvement as well, as their Windows drivers, do not have any sort of multithreading. Another bonus that comes with this is that Vulkan sees a massive boost in performance across all vendors! Vulkan does not have any kind of driver threading of its own in any implementation, it is always assumed that the user is building and submitting commands in the most efficient way possible, such as in a standalone dispatcher thread. People are used to the performance they get from a GL multithreaded implementation, where the GL commands essentially cost nothing on our FIFO processing thread, so without any multithreading of our own a worse experience is guaranteed.
One of the main benefits of doing our multithreading is being able to continue throwing draws at the backend without waiting for programs to compile, which allows us to begin multiple shader compilations in parallel at runtime without skipping draws.
If you've been using our shader cache for a while, you might have forgotten how bad the first run is. Please take a look at the videos for a refresher.
This does not eliminate stuttering, due to GLSL compilation being very slow, but it does reduce it significantly when multiple new shaders appear in the same frame (4-5x in best-case scenarios).
Notably, this improvement could extend to SPIR-V (Vulkan) and potentially GP5/GLASM (ARB, throw as many abbreviations as you want at it) shaders with little additional effort, reducing their low compilation times even further.
Here are some examples of popular games with an empty shader cache (but with PPTC enabled, to remove its influence on results).
This game essentially stops at the moment the camera pans down to the plaza. VERY long pauses when loading game levels or effects ingame result in disconnections with LDN, which will happen much less as a result of the magnitudes faster shader compilation.
Monster Hunter Rise Demo
This game has an incredibly long first draw, resulting in the entire title screen song playing before it draws any frames, which could make people think the emulator was entirely broken. A similar pause happens when loading levels for the first time, or attack effects, which tends to cause disconnections on LDN. These should be improved.
Mario Kart 8 Deluxe
This game's shader compile stutters are greatly reduced, in the menu and at the start of the race. The wait at the end of the loading screen is much shorter.
As you can see from the comparisons the stutter is greatly reduced and will be even better once ARB is implemented.
Implemented in #2501 by riperiperi.
Support non-contiguous copies on I2M and DMA engines
Makes more copies respect non-contiguous GPU virtual memory regions, instead of assuming it is contiguous. This also fixes a bug on I2M (Inline-to-Memory, used to send buffer or texture data within a command buffer), where the copy would be incorrect for block linear textures if the destination X coordinate was not a multiple of 16. This was fixed by aligning the start of the vectorized copy, which should start and end at a multiple of 16. We are not aware of any games that use it with an X value that is not a value of 16 but it was worth correcting.
Implemented in #2473 by gdkchan.
Use a new approach for shader BRX targets
BRX is an indirect branch shader instruction. The branch target offset comes from a register. So the problem here is that we have no way to know where the branch may land, as the offset is unknown at compile time. Before, it would make assumptions about the possible target offsets from the program layout. It assumed that everything after the BRX itself was a potential target, until the target branch of the first branch (the "merge" point). This works reasonably well for simple programs but falls apart for more complex ones.
The new approach makes assumptions about the code the compiler is going to generate for those instructions. It assumes that the BRX is preceded by an LDC instruction (used to load the target offset from the constant buffer), which is preceded by an SHL instruction (used to shift left the index by 2 to get the byte offset), which is then preceded by an IMNMX instruction (which is an integer minimum or maximum instruction, used to enforce that the index is in range). Being aware of that, the shader translator can use pattern matching to find the constant buffer from where the branch target offset is read from and the number of possible offsets. This way it can find all the possible targets of the BRX instruction, and then generate the appropriate GLSL code.
This fixes issues in many games, as can be seen below.
Hatsune Miku Project DIVA MEGA 39's
We can see multiple issues fixed here. The lights coming from the window are rendered on the Wowaka clip, lights were fixed on Piano Forte Scandal, and the car windshield is no longer opaque, and the visualizer on the Remote Control clip now works. The rendering of several other clips was also improved thanks to this change.
The Legend of Zelda: Link’s Awakening
This one should be pretty easy to notice, the chain chomp is no longer a white ball of light.
Cadence of Hyrule: Crypt of the NecroDancer
The confusing reflections were fixed.
Implemented in #2532 by gdkchan.
Make sure attributes used on subsequent shader stages are initialized
Sometimes, a shader attribute is consumed at a later stage, but the previous stage does not write it. On Nvidia, all attributes that are not written to are initialized with a default value of (0, 0, 0, 1). This is not the case for other vendors, however, and according to the spec, the values on them are "undefined". This causes problems for AMD and Intel as the attributes have different values. The previous approach taken is assuming all the first 16 attributes are used, so initializes the first 16. That's because on desktop, usually only 16 attributes are supported on OpenGL. This causes shaders to fail to compile if the previous stage does not write the attribute at all. This change solves some rendering issues in Hatsume Miku Project DIVA MEGA 39's on Intel and AMD GPUs.
Other games, such as The Legend of Zelda: Link's Awakening also improved on AMD as a result of this change.
Implemented in #2538 by gdkchan.
Workaround for Intel FrontFacing built-in variable bug
Intel has a bug, where the gl_FrontFacing built-in shader variable will sometimes have the incorrect value depending on how it is accessed. It was found while we tried to find why Zelda Link's Awakening was not rendering properly on Intel Vulkan. It turns out that the bug also exists on OpenGL, and was affecting the Super Mario Odyssey title screen.
Mario mustache was not being rendered properly on Intel (Windows), a long-standing issue that is now being fixed thanks to our commitment to improving compatibility with Intel and AMD GPUs as part of the Vulkan implementation, which also benefits OpenGL in some cases (such as this one).
Fixed in #2540 by gdkchan.
Use "Undesired" scale mode for certain textures rather than blacklisting
There are many textures where it is possible to scale them, but we don't scale them due to the potential for wasted work (depth of field need not be high resolution) and undesired effects (blur shaders do not behave correctly when scaled, scaling texture atlases results in ugly linear blending blur). This new method gives these types of textures a new scale mode, rather than blacklisting them entirely. This mode is "Undesired", as we don't want to scale the texture, but if it's bound alongside another texture that is scaled, we'd rather scale them both than blacklist and potentially lose resolution scale entirely.
As you can see from this screenshot, Bayonetta 2 now scales properly!
Fixed in #2537 by riperiperi.
Implement Shader Instructions SUATOM and SURED
Fixes morph target animation (used for facial animations, eyes), lighting issues on some UE4 games, and anything else using atomic image store.
Below you can see what Bravely Default 2 looked like before this implementation:
The water, lamps, etc, are darker than they should be.
Now, see how it looks with those instructions implemented:
Note that another fix (discussed below) is required to make this scene render properly. The screenshot above includes said fix.
This might fix several other issues UE4 games had.
Implemented in #2090 by riperiperi.
Fix out-of-bounds shader thread shuffle
Fixes some issues with thread shuffles, mainly shuffle up that was not taking negative threads IDs into account. If the source thread ID is negative after the subtraction by the index, it is not valid (but generated shader code would consider those values valid as well before because the comparison was unsigned, and 0xFFFFFFFF (-1) for example, is >= 0). Ryujinx now always reads the value from the source thread, rather than doing it conditionally, as the latter also causes graphical glitches. This fixes vertex explosions happening in Marvel Ultimate Alliance 3 and flickering lighting in Bravely Default 2.
Marvel Ultimate Alliance 3
See how Groot is stretching out at the left. It was usually even worse than this, this is a "lucky" screenshot.
Now rendering as it should. The weird black triangles on the characters were also fixed.
Fixed in #2605 by gdkchan.
Ensure that all threads wait for a read tracking action to complete
This fixes a regression that occurred when allowing tracking actions to release the tracking lock. If multiple threads were to trigger a read action in the same handle at the same time, then one of the threads could continue almost immediately without flushing, as the action would have been consumed on the other. This would result in the thread that skipped the action either reading/writing old data, then the action would complete and overwrite it, causing weird issues. This fix crashes in Catherine Full Body, which started happening after the multithreading changes discussed before.
Fixed in #2597 by riperiperi.
Initial support for shader attribute indexing
A missing shader instruction was causing graphical issues on a few games. Namely, the AL2P and ALD.P instructions, used to perform indexed attribute access, were not implemented. This instruction allows indexing into an array input attribute.
This improves the rendering of some levels on Donkey Kong Country: Tropical Freeze.
Note that everything being rendered as a silhouette on this level is an artistic choice and how the game is supposed to look. We're not aware of any other graphical issue on this game, so it should be fully playable now.
Another game that improved by this change was DC Super Hero Girls, where the textures of some buildings are no longer solid black.
Thanks to kakasita for testing this one.
Fixed in #2546 by gdkchan.
Enable transform feedback buffer flush
Transform feedback buffers can be modified by the GPU, so they must be written back to guest memory if they have been modified and are being accessed from the CPU. This fixes vertex explosions happening in SNK Heroines Tag Team Frenzy.
Fixed in #2552 by gdkchan.
Fix GetHandleInformation for mipmapped 3D textures
While trying to fix some OpenGL errors, contributor mpnico discovered that the "GetHandleInformation" function for mipmapped 3D textures was done the wrong way around which was causing games to try accessing data of inexistent mipmap levels, this might have caused minor issues in various games.
Fixed in #2569 by riperiperi.
Remove pool cache entries for incompatible overlapping textures
Xenoblade, UE3, and UE4/Unity games had been known to use an excessive amount of memory due to them not deleting old textures. If a texture being overlapped by a new sample has been modified by CPU, then its modified data is likely destined for the new texture being sampled, and the old texture should be deleted as it contains stale data and likely won't be used for some time. This change does exactly that, removes old textures with stale data, greatly reducing memory usage on those games (mainly Xenoblade), and giving overall more stable performance due to memory usage being much lower.
Fixed in #2568 by riperiperi.
Change disabled vertex attribute value to (0, 0, 0, 1)
The way you define the format and location of each "value" on the vertex buffer is with vertex attributes. A vertex attribute can be constant. Those constant attributes do not exist on the vertex buffer, the shader just reads the same value every time, over and over, for every vertex. The original value of the constant attributes was (0, 0, 0, 0) and has been changed to (0, 0, 0, 1) as that is what the Switch’s GPU does.
This fixes a regression on Super Mario Odyssey, causing some plants to render black on the Wooded Kingdom.
Fixed in #2573 by riperiperi.
Avoid deleting textures when their data do not overlap
Textures are removed from the cache when their data overlap in memory, to avoid having multiple versions of the same texture in the cache (which causes issues and also wastes VRAM). There was an issue where textures were assumed to overlap when they don't because there are gaps in the data. The gaps are caused by the fact that only a sub-range of the texture mipmap levels are used. The incorrect removal of textures that don't overlap caused data loss, which would in turn cause it to load garbage from memory when the new texture was created. This fixes UE4 games that had incorrect lighting or just had a white screen.
This also fixed shadows in Yoshi Crafted World, which was too dark before, and now renders the same as the Switch.
Fixed in #2601 by riperiperi.
Swap BGR components for 16-bit BGR texture formats
OpenGL does not support BGRA formats, so on OpenGL, we need to use an RGBA format and then swap the components when the texture is written or copied. This was already being done for some BGRA formats, but not for some BGRA and BGR formats (in particular, the packed 16-bit ones).
This fixes the red and blue components being swapped in a few games, such as Pokkén Tournament DX, which used to look like this:
Fixed in #2567 by gdkchan.
Reduce JIT GC allocations
The "POWER" update that we released a few months ago brought several improvements, and one of them was a reduction in PPTC compilation time. This change reduces it even further, with some of our testers observing the compilation time being halved in some games.
This change modifies some data structures that the emulator JIT compiler uses internally to store the game code intermediate representation (IR). Before every operation and operand would need to create an object, which would incur a memory allocation and all the object creation cost. On top of that, the object has to be tracked by the garbage collector (GC) and eventually "collected", to allow the memory to be reused by something else when the object is no longer needed. All that has a cost and was slowing down the compilation process. Now, both operations and operand use a struct, which has the advantage of being faster to allocate and not being tracked by the garbage collector. Those structs only contain a pointer to the actual data, which is allocated using a new arena allocator. An arena allocator allocates all (or most) of the memory that will be required up-front and then hands out smaller chunks of this memory when needed. When the compilation process ends, all memory is freed at once. This has the advantage of being very fast, as all the required memory was pre-allocated, instead of allocating small chunks all the time, and it also improves memory locality, as all the data is on the same memory chunk, which further improves the performance as it reduces CPU cache misses.
Implemented in #2515 by FICTURE7.
Implement MSR instruction for A32
This implements the MSR apsr_nzcvq Arm32 instruction, required by the Pocket Rumble game, which is now playable.
Implemented in #2585 by thog.
Update TamperMachine and disable write-to-code prevention
This improves how Ryujinx reports cheats and also updates cheat instructions to match Atmosphere’s implementation.
Implemented in #2506 by Caian.
While debugging a separate issue, jduncantor noticed that there was a need to pass in whether the command is a TIPC command or a HIPC command to the exception constructor. The code was refactored to no longer make this necessary.
Fixed in #2535 by jduncanator.
Update to LibHac 0.13.1
Libhac is a .NET library that reimplements some parts of the Nintendo Switch operating system, also known as Horizon OS. Ryujinx uses Libhac for its file system. This updates Libhac dependency to version 0.13.1 which brings many improvements to Ryujinx’s file system. It makes the emulator all the more accurate while also allowing some games to boot that didn’t before.
Implemented in #2475 by Thealexbarney.
A long-requested feature was rumble emulation, a few attempts had been made before but there were too many hiccups with each implementation. After a lot of work and adapting it for SDL2, Rumble now works on controllers that support it. (It should be noted HD rumble is not yet implemented due to SDL2 not supporting it).
Implemented in #2468 by mpnico.
Seeing if there are any other spelling errors to correct
Sometimes there can be grammatical errors or spelling errors in someone’s code, contributor Mou-ikkai has been polishing up these errors making them easier to read.
Fixed in #2572 by Mou-ikkai.
Hide UI rework/arrow key fix
Users had reported an issue where pressing the right and left arrow keys would bring up a file menu in-game. This fixes that issue and also reworks the entire flow of how the "Show UI" hotkey works. Since we no longer need menu bar access for this, The key is now configurable through the config.json.
Fixed in #2504 by ooa113y.
Preliminary work on ARB has begun, right now very few games have started to work with ARB but rest assured we are working hard to make ARB work in all games as best as it can. Below you’ll see some examples of ARB in its infancy.
As you can see right now it’s not perfect but we’re working hard to make sure everything works correctly.
We would like to thank everyone who has contributed to the emulator so far whether it was through Patreon, reporting bugs, or code contributions. You all have made this project what it is today!