Progress Report October 2021

Progress Report October 2021

The spooky month of October brought some amazing releases like Metroid Dread, Mario Party Superstars, and Fatal Frame: Maiden of Black Water. All of which worked day one, thanks to the absolute avalanche of graphical bug fixes for all these new wonderful games and some incredible kernel improvements across the board!

Patreon Goals

Amiibo Emulation - merged into the main build in March 2021.

Some new amiibo were added this month! Check below for more details. While compatibility is now almost perfect, there are still some improvements to come for Amiibo which can be tracked on the associated Github issue here: https://github.com/Ryujinx/Ryujinx/issues/2122

Custom User Profiles - merged into the main build in April 2021.

Vulkan GPU Backend - still in progress, a public test build is delivered. A lot is being worked on.

ARB Shaders - Goal reached in April 2021. As seen from August's progress report, preliminary work on ARB shaders has begun.

ARB shaders will further reduce stuttering on the first run by improving the shader compilation speed on NVIDIA GPUs using the OpenGL API.

$2000/month - Texture Packs / Replacement Capabilities - Almost there!

This will facilitate the replacement of in-game graphics textures which enables custom texture enhancements, alternate controller button graphics, and more.

ETA once the goal is reached: ~3-4 weeks

$2500/month - One full-time developer - Not yet met

This amount of monthly donations will allow the project's founder, gdkchan, to work full-time on developing Ryujinx.

$5000/month - Additional full-time developer - Not yet met

This amount of monthly donations will allow an additional Ryujinx team developer to work full-time on the project.

So now we’re done with that, let’s get started with this month's progress:

Vulkan progress

Work on the shader tester mentioned on the previous progress report has begun. Currently, it is being used to implement and test a few missing shader instructions (such as the double-precision instructions) and to fix a few existing bugs. For example, one bug that prevented the game World War Z from progressing into menus was fixed, allowing the title to boot further. Soon, the shader tester will also be used to test SPIR-V and ARB shaders, once we confirm that they all pass testing in the GLSL backend. The tester and those fixes will be discussed in more detail on the next progress report.

As usual, the Vulkan branch was rebased to include the latest fixes on the master branch, which includes several fixes and performance improvements, allowing more games to be played. Some regressions have also been fixed. For more information, refer to our changelog.

One of the bullet points in the "pending work" list was implemented, which is support for image format aliasing. This is required by some games that uses image load/store on the shaders, as they might require the image to be bound with a format that is different from the one that the base image has. This fixes the ground rendering on Xenoblade Chronicles Definitive Edition for example.
Before:

After:

On some games using BGRA textures, when using the OpenGL backend on the Vulkan build, they would have incorrect colors. This has also been fixed.

GPU

Rewrite shader decoding stage

This changes the way how shaders are decoded on the emulator. The new method is not only more efficient, but it is also less error-prone. The shader decoding process consists of reading a value from memory (known as an opcode) and then finding which operation must be done from the information encoded on this value. Doing that requires knowing which values correspond to which instructions, and initially, we gathered this information from Nouveau (open-source Linux driver) disassembler or from NVIDIA's disassembler called nvdisasm, as available as part of the CUDA development kit.

The main problem with the NVIDIA disassembler is that it is not open-source, so to view the instruction from its encoded value, we need to have a valid shader and pass it to the tool to view the disassembly output. This is pretty time-consuming and not very efficient to do manually, so we created a script that automatically creates shaders with several different values, passes them to the tool, and sees which instruction comes out on the disassembly. From this information, it can auto-generate tables and structures that can be used to decode shaders on the emulator.

So you might be wondering which benefits this change brought. First, now we can decode all the shader instructions that the Switch GPU supports. That means when those instructions are implemented on the emulator, we will have less work to do as the decoding part is already done. Second, a few oversights of the old decoder have been corrected. One of them was the wrong bit being read for one of the bindless textures with offset instructions, which was causing issues on some Unreal Engine 4 games like JUMP FORCE Deluxe Edition. See below for a comparison.

Before:

After:

Notice how the character skin and hair has the correct tone now. It also fixed other issues not visible on the screenshot, like "shaky" pixels.

Implemented by gdkchan in #2698.

Smaller initial size for BufferModifiedRangeList & directly inherit backing array

This fixed a potential regression with the old range list changes, where the cost for creating new ones would be rather large due to creating a 1024 size array. It also reduces the cost for range list inheritance by using the first existing range list as a base, rather than creating a new one then adding both lists to it. The growth size for the RangeList is now identical to its initial size. The Unmapped and SyncMethod methods have also been changed to ensure that they behave properly if the range list is set to null. This improves performance in a few games.

Implemented by riperiperi in #2663.

Relax sampler pool requirement

Before, the emulator printed an error and exited early if an attempt was made to use textures without having a sampler pool that was currently bound. That was because accessing a texture without a sampler is usually not valid. But there is one case where the sampler is not needed, which is when textures are accessed with texel fetch, as those are not filtered in any way. This was usually not a problem, because it's not common for a game to only ever use texel fetch, but it turns out the Cotton/Guardian Saturn tribute games compilation does this. This change allows the texture to be bound without a sampler, which improves rendering on this title.

Before:

After:

Better, but still has issues. The remaining issue is due to a missing shader instruction, we'll talk more about this one later.

Implemented by gdkchan in #2703.

Don't force scaling on 2D copy sources

GameMaker Studio games build texture atlases out of sprites during initialization, using the 2D copy method. These copies are done from textures loaded into memory, not rendered, so they are not scaled to begin with. The source texture are now in these copies and are set to force scaling, but really it only needs to scale if the texture already exists and was scaled by rendering or something else. This is now set to false, so it doesn't change if the texture is scaled or not. This will also avoid the destination being scaled if the source wasn't. The copy can handle mismatching scales just fine. This prevents scaling artifacts in Game Maker Studio games and likely others.

Before:

After:

Implemented by riperiperi in #2701.

Enqueue frame before signaling the frame is ready.

Link's Awakening and Xenoblade DE had their fences reached already when posting framebuffers, so the signal that a frame was ready would go out before the frame was enqueued, and the render loop would fail to dequeue anything and "skip" a frame. This resulted in their performance lowering dramatically after some loading transitions, as a frame signal would be consumed and presentation would be one frame behind. Xenoblade would seem to cap at 60% FIFO, and Link's Awakening would run at 30fps or worse. Reordering this seems to fix both.

Implemented by riperiperi in #2722.

Force index buffer update for games using Vulkan

Some games that use the Vulkan API on Nintendo Switch previously had an issue on Ryujinx for when the Vulkan draw methods were used. On games that do multiple consecutive draws with different ranges of the index buffer, the emulator was not updating the index buffer range used, which would cause the draw to not draw anything, as the draw would try to access a range of the index buffer that does not exist.

This was fixed by forcing the index buffer range to update on the draw methods used by the Vulkan API on the Switch. On its own, this change has no known visible effects, but when combined with the change below, it allows the game Hades to render correctly.

Implemented by gdkchan in #2726.

Extend bindless elimination to work with masked and shifted handles

Bindless textures were already discussed quite a bit on previous progress reports, so we won't be going into too much detail about what it is this time, and focus more on what changed.

First, Hades uses shaders that perform a bindless access, with a handle that comes from a constant buffer. Nothing out of the ordinary here and this case would be handled by the existing bindless elimination. The difference is that this time, the shader has more operations to ensure that the texture handle value is valid and in range. Also supporting this case was not difficult, we just had to extend our bindless elimination to also be able to recognize those extra operations.

This allows Hades to render, instead of being just a black screen.

Another game with bindless textures related issues was The Witcher 3. While not the same case as Hades, it was also pretty easy to handle the case that this game uses, which combines the texture and sampler handles differently. The change also allowed this game to render for the first time before it was just a black screen.

Implemented by gdkchan #2727.

Implement SHF (funnel shift) shader instruction

This implements the SHF (funnel shift) shader instruction, required by Cotton Saturn Tribute games compilation. This instruction shifts a 64-bit value composed of 2 registers and returns the upper (for the left shift) of the lower (for the right shift) half of the 64-bit result.

As we mentioned earlier, those games were not rendering correctly, even after the sampler pool fix. With this change, they now render as they should.

One interesting note about this game compilation is that it uses a Sega Saturn emulator, so we're effectively doing double emulation here.

Implemented by gdkchan in #2702.

Initial tessellation shader support

Luigi’s Mansion 3 has a sand room that wouldn’t render correctly on Ryujinx due to the emulator missing tessellation shader support. This adds support for tessellation shaders (the control and evaluation stages, also known as hull and domain), which is the only shader type that was not yet supported. Most of the work here was just adding declarations that are specific to those stages, and also improving the implementation of a few other instructions.

Luigi’s Mansion 3’s sand room now renders correctly.

Before:

After:

Implemented by gdkchan in #2534.

Workaround for NVIDIA driver 496.13 shader bug

NVIDIA's recent driver updates had caused some major graphical issues in many games. This happened because there's an issue with assigning variables with the "precise" qualifier to negated expressions on the new driver. So, doing -x does not work on the new driver, while 0.0 - x does (both are supposed to be equivalent). This will be removed once the issue is resolved on NVIDIA’s side.

This fixes a variety of issues in several games.

Before:

After:

It is worth noting that those issues only started happening on this driver version, so it is not an emulator issue or regression.

Fixed by riperiperi in #2750.

Fix shader 8-bit and 16-bit STS/STG

The emulator uses an unsigned integer buffer for the global memory that is accessed on the shaders. That means that the buffer can only be accessed 32-bits at a time. This is a problem when we need to access shorter values, like 8-bit or 16-bit values. To perform a 16-bit store, for example, we have to do a partial update of the 32-bit value and change either the lower or the higher 16-bit half. So basically, we do 3 operations: load the 32-bit value, partially modify this value inserting the new value, and then store the 32-bit value back. The problem is that on the GPU, invocations happen in parallel, so multiple invocations might be trying to modify this value at the same time, which is a problem.

To make this work, the store is performed using an atomic compare and swap operation. Atomic here means "indivisible", which means that it can be considered a single operation that does not have any intermediate result visible by other invocations. First, it loads the current value, inserts the new value into it, and then performs the compare and swap. If the value in memory is equal to the "current value" we loaded earlier, then no modification was made since we loaded the value, and we can safely just store the modified value. Otherwise, we need to start over as the memory has been modified.

This fixes the broken interior lighting in The Witcher 3 making it render much better.

Before:

After:

Notice the weird squares on the character's hair, and the woman on the bottom left is too dark.

Fixed by gdkchan in #2741.

Preserve image types for shader bindless surface instructions (.D variants)

This fixes a small oversight, where shaders could use the wrong format for bindless image accesses. There are 2 types of image access on the Switch GPU, sized or formatted. With the sized access, it simply loads a given amount of data from the image, like 32-bit or 64-bit, without caring about the format. With the formatted access, on the other hand, it loads each component to a separate register, as performs the required conversions depending on the format.

The bug affected the sized access. Since the format shouldn't matter here, the correct thing to do is assign to the image a format matching the access size. For example, for 64-bit access, it would assign a rg32ui (32-bit of red, and 32-bit of green) to the image, which is a total of 64. The oversight was that it was replacing this format with the actual image format during the bindless elimination process, which is incorrect in this case.

This was found while debugging other issues on Clubhouse Games 51, we are not sure how the bug impacted this title however, but it is worth fixing nonetheless.

Fixed by gdkchan in #2779.

Add support for fragment shader interlock

As mentioned before, GPUs work with several "invocations". On a fragment shader, for example, each one of those invocations runs in parallel and is responsible for computing the color of each pixel on the output image that is eventually presented on the screen. The high parallelism is very good for performance, as you have several operations happening at once, but it also means that there are no guarantees about the order of operations or when they will be complete.

An easier way to see the problem is with tasks. For example, let's say there is a library with a large pile of books. Those books are sorted in alphabetical order, and a group of people is asked to put them on shelves. Without further instructions, they would just place them at random, not knowing that they should be sorted in a particular way on those shelves. If you repeated the task 10 times, most likely they would be in a completely different order each one of those times. Now, if you instructed those people to place the books on the shelves in alphabetical order, they would do so, and even if the task was repeated 10 times, the result would be the same, as they would now coordinate their efforts to ensure the books are properly sorted.

The same problem can happen on the GPU. The invocations are happening in parallel, there are no guarantees about which one will finish first, or the order they will happen at. Usually, this is fine, as the order doesn't matter most of the time. But depending on the operation that is being done on the fragment shader, the order might matter. So how can you ensure that the invocations happen in correct and consistent order? The answer is fragment shader interlock. This is like telling the GPU that you want the invocations inside a given region to be ordered, much like telling the people that the books should be sorted alphabetically on the example above. It ensures that all invocations for overlapping pixels (at the same screen position) are properly ordered.

The lack of fragment shader interlock usually causes tile flickering. If you recall the previous example, the reason should be clear at this point. No coordination means the order is completely random, and the final results change each time, which causes flickering on the image.

On the NVIDIA shaders, the interlock begin and end operations are implemented using function calls to some NVIDIA-specific functions on the shader. We had to implement pattern recognition to find those functions and replace calls to them with regular calls to the interlock extension begin and end functions, as implementing it otherwise is not impossible, since those functions use hardware-specific registers that are not exposed by high-level languages such as GLSL (OpenGL Shading Language).

This fixes flickering lights on the "It's the Pits" mini-game on Super Mario Party. Other parts of the game with a similar glitch could also be affected

Before:

After:

One thing that should be noted is that the vendor support for the fragment shader interlock extension is hit or miss, with AMD being completely absent. On OpenGL, AMD does support the Intel fragment shader ordering, which does the same thing as the interlock extension, so we use it if available. Most cards do not support it though, and on Vulkan, AMD has no support for it at all. We plan to look at different methods to implement this on the drivers that don't support the extension, but doing so in a performant manner without hardware and driver support is very difficult.

Implemented by gdkchan in #2768.

CPU

Add Operand.Label support to Assembler

This improves the JIT generated code when PPTC is enabled. Before, all jumps would use a 32-bit offset when it was enabled, to make getting the relocation offsets easier, as knowing whenever the jump offset can be encoded in 8-bits requires generating the code first to be able to know the offset. The PR changes the way how this is handled, and enables using 8-bit jumps with PPTC enabled too (previously it was only used with PPTC disabled), which makes the code a little bit more compact, which again means slightly less memory usage and disk usage by PPTC caches.

Implemented by FICTURE7 in #2680.

Optimize LSRA

This optimizes the register allocator. LSRA stands for "Linear Scan Register Allocator", which is a type of register allocator commonly used in JITs because it is fast while still producing decent results. Register allocation is the process of allocating an unlimited number of variables to a fixed set of registers on a given CPU architecture. On x86, you have about 16 registers (a bit less actually, some have a fixed purpose and you can't use it as a general-purpose register), while Arm64 has about 32 (again, a bit less since you have registers like the stack pointer included which can't be used for other purposes). This process is necessary to "map" the 32 Arm registers to the 16 registers on x86.

The change makes the register allocation process faster by optimizing the allocator code, and the benefits here are faster PPTC rebuilds (as it has to recompile all the functions), as fewer stutters caused by JIT compilation (which would be present if the user has no PPTC cache or PPTC is disabled, and games that loads NRO code dynamically at runtime such as Super Smash Bros Ultimate).

Implemented by FICTURE7 in #2563.

Add an early TailMerge pass

This merges the epilogues and returns on the code generated by the CPU JIT. At every point that the function returns, it needs to generate something called "epilogue" that restores the CPU registers to the state it was before the function was called, as mandated by the ABI (Application Binary Interface). This is necessary to meet the expectations of the caller when the code returns.

The change makes the code jump to a single location with the epilogue and return, instead of generating that code on every single return point. The benefit of this is that the JIT-generated code size is smaller, so slightly lower memory usage, and slightly lower disk usage by the PPTC cache.

Implemented by FICTURE7 in #2721.

HLE

Amiibo API updates

The new Metroid Dread Amiibo (Samus and E.M.M.I) have been added into the Amiibo API, use it to your heart's content!

Fix DisplayInfo struct

This fixes a regression that would cause Dragon Ball Xenoverse 2 to no longer boot, as it would pass an invalid size of 0 to surface flinger initialization, which would later cause other failures. The error was caused by the DisplayInfo structure size being incorrect, a regression caused by the recent change to support multiple resolutions on this service, mentioned in the previous progress report.

Fixed by gdkchan in #2708.

Added support for Pixel Format X8B8G8R8

Metroid Dread’s title screen introduced a new pixel format and Ryujinx did not support this as there isn’t another game that we know of that uses this. This makes it so the new format is now supported and makes it render correctly.

Before:

After:

Implemented by C1fer in #2716.

Inline software keyboard without input pop up dialog

This adds a new inline software keyboard so that the old pop-up text window is no longer needed. Before, if you were prompted to enter characters through your keyboard a small text window would pop up. This was an annoyance if you played in full-screen mode as the game audio sped up and froze. It confused a lot of users as the pop-up only showed up in windowed mode. This new inline software keyboard makes it so you no longer need to be in windowed mode to see your typed characters.

Note that the new keyboard is only used for games using the "inline" keyboard type. For regular software keyboard launches, it still uses the pop-up window.

Implemented by Caian in #2180.

SPL: Implement IGeneralInterface GetConfig

This implements the GetConfig call of the SPL service. This is currently needed for some homebrews, which now no longer need ignore missing services to boot.

Implemented by AcK77 in #2705.

NVDEC: Adding VP8 codec support

This codec was not implemented before as very few games use it. It is a very old codec, so there is little reason to use it when more modern and efficient codecs are supported, but it turns out there are a few Switch titles out there making use of it. After implementing it, Diablo II’s intro now plays correctly, and the cutscenes on TY The Tasmanian Tiger are now properly rendered too.

Implemented by AcK77 in #2707.

HLE: Improve safety

This reduces the use of "unsafe" code, which makes the code a bit more secure and less prone to errors caused by memory corruption, due to code not doing bounds check properly or not validating input values, etc. It also fixes a bug with the way the code was reading ASCII strings from memory, as it would not stop at the null terminator if the buffer had any non-zero value after the null terminator, causing it to load strings with garbage data after the end.

Fixed by Thog in #2778.

kernel: Fix inverted condition on permission check of SetMemoryPermission syscall, Clear pages allocated with SetHeapSize, Add resource limit related syscalls, Implement SetMemoryPermission syscall, Add missing address space check in SetMemoryAttribute syscall

We saw several improvements to the HLE Kernel implementation in Ryujinx. Thog made many changes to bring Ryujinx's kernel implementation further in line with what the original OS does. This fixed some small issues in the kernel that was lurking about but haven’t been hit by any games that we're aware of. Some of the notable improvements are that SetHeapSize now clears the memory allocated for the heap, to avoid leaking information from other processes. Some syscalls used by services have been added, but games never use them. So they don't have any user-visible impact right now but make our kernel implementation more complete, and the emulator one step closer to being able to run the services from the Switch firmware (as opposed to providing an HLE implementation on the emulator).

Fixed by Thog in #2771, #2772, #2773, #2776, and #2777.

Fixup channel submit IOCTL sync point parameters

Fixes a bug where the emulator was reading the function parameters from the wrong buffer location. The bug only manifests if more than one fence is submitted to this function, which commercial games never do, so in general, it should have no user-visible effect.

Fixed by bylaws in #2774.

Add support for the Brazilian Portuguese language code

With the release of Mario Party Superstars, it became the first Nintendo game to utilize the new Brazilian Portuguese language option which was introduced back in firmware 10.1.0. With this now implemented you can now choose Brazilian Portuguese in the system languages drop-down menu in the Ryujinx GUI.

Note that if you select the Brazilian Portuguese language and move to an older version of the emulator, the configuration file will reset as the language did not exist on the previous versions and it will fail to load.

Implemented by gdkchan in #2792.

New code contributors October 2021

C1fer

Closing words

We are all incredibly thankful for everyone’s support towards this project so far whether it was through Patreon, reporting bugs, or code contributions. Because of all of you, we’re now able to boot so many games on their release day and have them be playable. We are truly in awe of how far this project has come, so once again thank you!

We have an active Patreon campaign with specific goals and restructured subscriber benefits/tiers, so please consider becoming a patron to help push Ryujinx forward!