Progress Report May 2021

Progress Report May 2021

Ryujinx POWER’ed up its CPU emulation in May, making waves with a far-reaching performance update and becoming the first Switch emulator to implement "fastmem". This delivers dramatic FPS increases, boot time reductions, and smoother gameplay on a wide variety of platforms. LDN2.3 was released simultaneously with POWER to make all those multiplayer sessions that much smoother! May 2021 also saw the arrival in the master build of the new multi-level function table as well as the long-awaited “PPTC meets exeFS” update that allows PPTC to remain enabled even when exefs mods are active. Work on Vulkan continues, with some new developments explained in the Vulkan Progress section below.

Before listing all the updates that arrived in May 2021, let’s take a look at the current state of Ryujinx’s Patreon goals and deliverables:

Amiibo Emulation - merged into the main build in March 2021.
While compatibility is now almost perfect, there are still some improvements to come for Amiibo which can be tracked on the associated Github issue here: https://github.com/Ryujinx/Ryujinx/issues/2122

Custom User Profiles - merged into the main build in April 2021.

Vulkan GPU Backend - still in progress, working to deliver a public test build as soon as possible. See below for a more in-depth report on the progress of this feature.

ARB Shaders - Goal reached in April 2021. Work on ARB shaders will begin as soon as Vulkan is finished.

ARB shaders will further reduce stuttering on first-run by improving the shader compilation speed on NVIDIA GPUs using the OpenGL API.

$2000/month - Texture Packs / Replacement Capabilities - Almost there!

This will facilitate the replacement of in-game graphics textures which enables custom texture enhancements, alternate controller button graphics, and more.

ETA once goal is reached: ~3-4 weeks

$2500/month - One full-time developer - Not yet met

This amount of monthly donations will allow the project's founder, gdkchan, to work full-time on developing Ryujinx.

$5000/month - Additional full-time developer - Not yet met

This amount of monthly donations will allow an additional Ryujinx team developer to work full-time on the project.

Vulkan Progress

As many of you have surmised, Vulkan has turned out to be a larger beast than we initially expected. Poor gdkchan has nearly gone mad thrice now, and has been overheard muttering things to himself like “validation layers...just need another pipeline and everything should work...yes, that’s it”. Thog has stepped up to take a more direct role in the Vulkan implementation to help lighten the load and provide some relief in the fight against AMD’s effective driver-ruining campaign. Otherwise, preliminary internal testing continues while the remainder of the features are finished. In the interim, we plan on releasing a WIP PR as soon as possible for public testing in the interest of full transparency of the Vulkan development process as it is fully fleshed out. This test release should have arrived in May but it was discovered that newer AMD drivers had caused another regression in our forthcoming Vulkan implementation; we are still working on ironing out these bugs. We of course wish that we could have met projections, but very much appreciate your patience while our developers’ heads are continually banged against the wall.

Vulkan FAQ (Updated)

Will Vulkan include a shader cache?

The main goal right now is getting Vulkan delivered in a good working state. If this is done quickly enough and there's still enough time left for a shader cache implementation, then shader cache could be added. Otherwise, it will be left as a follow-up task and will be implemented after the Vulkan implementation is merged into the master branch of the emulator. To clarify, our ETAs of these Patreon goal features is the estimated time to develop the feature and submit the PR (pull request) for public testing & review. Actual merge dates into the main build may vary and do not have ETAs.

Will Vulkan include async shader compilation?

No, for the same reason it is not implemented in OpenGL. The async shader compilation feature implemented on other emulators come with compromises like visual glitches while shaders compile, some of them being persistent until emulation is stopped. It also does not fully eliminate stutter as some shaders cannot be compiled asynchronously. For those reasons, the current Ryujinx GPU emulation developers are not interested in implementing it. We are not opposed to an external contributor implementing this feature, however.

Will Vulkan support resolution scaling?

Yes. The goal is having feature parity with OpenGL upon release.

What kind of improvement can I expect?

AMD and Intel GPU users should expect a significant improvement in compatibility, performance and, especially for Intel GPU users, a major reduction in graphical glitches. NVIDIA users can expect a small performance improvement in a few titles (we will defer saying which ones until after the Vulkan implementation is properly optimized and tested). All GPU vendors should benefit from faster shader compilation and less stuttering on Vulkan.

GPU Improvements

Fix shader buffer write flag on atomic instructions

At the end of April, New Pokemon Snap released with much fanfare, and at first glance seemed to be working in the emulator. However at the end of an expedition, no Pokemon were identified and thus the game was impossible to progress through. This bug was caused by buffer memory being modified by atomic instructions (on shaders) to not be flushed.

After ensuring the buffer flushing was being triggered properly, the Pokemon were successfully identified, and the game can be played through.

Note: while this fixed the game for most users, some with Ryzen 2nd/3rd gen CPUs and Intel i7 10th/11th gen CPUs were experiencing random crashes. More on that later!

Fixed on (#2261) by gdkchan

Makin’ copies...the rip-meister... - Use copy dependencies for the Intel/AMD view format workaround

Early on in the development of the emulator, it was discovered that due to certain shortcomings in Intel’s & AMD’s GPU drivers, the visual output of the emulator would be degraded to a rather extreme degree. In order to mitigate this, a workaround was instituted back then by copying to a more compatible view format from its storage after every draw; this corrected the image output at the cost of a lower framerate.

Fast forward to present day: this update changes the method by which the view format workaround is applied, moving the behaviour out of the backend so that it can be covered by copy dependencies. This is a much faster approach as this solution only copies on read as opposed to copying after every write.

Mario Kart 8 Deluxe, a game whose speed is tied to performance/framerate, highlights the significant improvement this update provides for Intel iGPU and AMD GPU users. Recorded on an Intel i5-6600K with Intel HD 530 graphics.

Implemented on (#2144) by riperiperi

Pass CbufSlot when getting info from the texture descriptor

This update fixed a long-standing bug that could randomly cause full screen color flickering overlay in Super Mario Party due to bindless texture handling.

Before (sometimes):

After (every time):

Implemented on (#2144) by riperiperi

Working in parallel - Allow parallel shader compilation when loading a shader cache

Ryujinx’s shader cache has a nifty feature that rebuilds & recovers a shader cache that would otherwise have been destroyed by an invalidating event. Such invalidating events are GPU driver updates/downgrades on the host or changes within the emulator that affect shader handling. While this saves countless hours of headache-inducing shader stutter it also means that from time to time, this rebuild process must occur at boot time. For some games, the auto-rebuild process only takes a few seconds. But for games with a large number of shaders such as Splatoon 2 or Super Smash Bros. Ultimate, this rebuild process could take upwards of 20 minutes.

Thanks to this update the boot time auto-rebuild process is now multithreaded, taking advantage of as many cores as the host GPU driver will allow. On NVIDIA GPUs this change reduces the rebuild time by between 33%-75% on average, depending on whether there is a driver cache still intact.

Implemented on (#2177) by riperiperi

Move ‘em out - Move shader resource descriptor creation out of the backend

With Vulkan on the way and ARB shaders following after, it made sense to move all of the logic used to create buffer and texture descriptors from the GLSL backend to the translator. This will decrease the amount of time & effort necessary to add the new SPIR-V (Vulkan) and ARB backends.

Implemented on (#2290) by gdkchan

Assign _backgroundContext before starting its worker thread

This is a one line fix that moves the assignment of the background context to occur before the worker thread is started, as opposed to after. After the update, the random chance for an embedded game (from within a game collection, for example) to crash on launch has been resolved.

Fixed on (#2299) by riperiperi

Fix non-independent blend state not being updated

Some of you may not know that Switch games use various graphics APIs; some games use NVN (NVIDIA’s own proprietary API), others may use OpenGL or Vulkan. Most games use NVN and as such the emulator’s highest compatibility is with those games. On games that use OpenGL, a few of them had some long-standing serious graphical issues due to broken blend. This update fixes the blend state to be properly updated by inserting the missing entries in the state table. See the comparison below for the improvements on Code of Princess EX.

Before:

After:

Implemented on (#2303) by gdkchan

There’s gotta be a better way… - Use a different method for out of bounds blit

The guest OpenGL driver performs out of bounds copies when copying textures for layout conversion. The way this is supposed to work is that the copy should wrap to the next line of the texture and continue copying from there. Up until now, this was being emulated on the host with two blits, which worked but had an issue: the second copy required the source texture to have one extra line. This extra line could sometimes result in pixels falling into unmapped memory regions, which would cause a crash with an invalid memory region exception.

This update uses a new approach that only requires a single blit. It simply adds an offset to the texture address; that way, the copy is "in bounds" and no wrap to the next line is necessary. It also doesn't need to read any extra pixels.

This fixes the crash due to invalid memory regions being accessed on Rune Factory 5 and Shantae (the GBC emulator).

Implemented on (#2302) by gdkchan

Add another Depth32F texture format variant

This update implements a missing texture format variant that was causing mostly-missing graphics in Yo-kai Watch 1. Together with the out of bounds blit fix mentioned directly above, the difference this 1-line fix makes is dramatic!

Before:

After:

Fixed on (#2304) by gdkchan

Compare aligned size for largest mip level when considering sampler resize

When selecting a texture that's a view for a sampler resize, we should take care that resizing it doesn't change the aligned size of any larger mip levels.

This PR covers two cases: when creating a view of the texture, we now check that the aligned size of the view shifted up to level 0 still matches the aligned size of the container. If it does not, a copy dependency is created rather than resizing.

When searching for a texture for sampler, textures that do not match our aligned size when both are shifted up by its base level are not considered an exact match, as resizing the found texture will cause the mip 0 aligned size to change. It will create a copy dependency view instead.

This update fixes graphical errors and crashes (on flush) in various Unity games that use render-to-texture. See below for a comparison on the Moving Out demo.

Before:

After:

This also fixes some graphical glitches in Rune Factory 5 and resolves a specific bug with this game caused by the emulator flushing a texture with the incorrect size; this in turn caused the emulator to crash with a coreclr.dll error (found in the Windows event log).

Fixed on (#2306) by riperiperi

Fix value of constant vertex attributes

A bug was causing the constant vertex attribute value to be set to "1, 1, 1, 1". It was caused by the emulator never setting this value, and assuming that it would always have a default value of zero, which was not the case. This update explicitly sets the value, which fixes some graphical issues observed in SD GUNDAM G GENERATION CROSS RAYS.

Before:

After:

Fixed on (#2307) by gdkchan

Get those clothes back on - Improve accuracy of reciprocal step instructions

A rather humorous if not unfortunate bug was plaguing the newly released Rune Factory 5, causing the character’s skirt and other bits of clothing to appear lifted or stretched. This update improves the accuracy of the FRECPS and FRSQRTS ARM instructions by also handling the cases where one of the inputs is infinity and other is 0, as per the manual. The net result is improved graphics and no more accidental up-skirt shots.

Before:

After:

Fixed on (#2305) by gdkchan

CPU & Kernel Improvements

Fix a specific core migration bug on the scheduler

This update fixes a theoretical core migration bug that would require the following scenario:

  • Core 1 must pick a new thread “B” on the `Schedule` method, while already executing a guest thread “A”.
  • Core 2 must pick the thread “A” which is currently executing on core 1.
  • The scheduler must move the thread “A” back to core 1 before thread “A” has a chance to switch to the next thread.

As there are no known triggers for this potential bug, it may possibly fix some unreported crashes or issues that might have occurred in future situations.

Fixed on (#2271) by gdkchan

A match made in heaven… - PPTC meets ExeFS Patching

Since PPTC’s introduction roughly one year ago, one meddlesome caveat has loomed over it: if any exefs mods were present in a particular game—commonly used for resolution enhancements or FPS uncapping—then PPTC would automatically disable itself. With this update, all exefs mods may be used while keeping PPTC enabled, enabling those 60FPS mods to really shine. This change also refactored the PTC profiler to use XXHash128 for the PPTC ".info" files (essentially index files for the cache) instead of the previous MD5.

Implemented on (#1865) by LDj3SNuD

POWER - Performance Optimizations With Extensive Ramifications

Check out the special blog post for a more in-depth look at this giant May update, but the tl;dr is that the emulator received a major across-the-board performance upgrade by implementing a new host-mapped memory manager. Performance improvements vary, depending on the game and your hardware, but we have confirmed frame rate increases of 10-110% as well as boot-time reductions up to 50% and boot-time PTC compilation time reductions up to 60%. The net effect is that CPU requirements have been lowered, enabling low end PCs to reach full performance in many more games, while high end PCs are now able to leverage 60FPS mods and go well beyond a game’s intended framerate. Below we can see the impact on average FPS this update had on a somewhat low end CPU (Intel i5-6600K) being put through a torture test using Ryujinx in Windows - you can expect similar or better improvements:

But our Linux brethren are also the beneficiaries of this update, as evidenced by the numbers put up by an i7-4770 running Pop OS w/ TKG PDS kernel.

This update also fixes the “AcquireSemaphore” crash on all Ryzen CPUs known to exhibit the issue as well as the Fire Emblem: Three Houses character movement slowness in the monastery.

Before:

After:

We strongly encourage reading the blog post linked above to see the improvements fully demonstrated on a wider variety of platforms and scenarios.

Implemented on (#2286) by riperiperi, gdkchan

No, there’s no acronym for this - Add multi-level function table

As part of an overall campaign to increase performance in the emulator, this update improves the CPU emulation by increasing the speed of indirect calls and jumps. These are basically calls where the target address is not constant, which means it can change during execution. In the previous approach, the emulator was very slow for these cases because first it had to perform a function call to get the address of the function before then doing a call/jump to that address. The new approach doesn’t need to perform the function call as it can read the target address from the table directly, which significantly increases performance in situations where these types of calls are used heavily.

Though the multi-level function table provides performance benefits in a wide variety of situations, it really shines in Switch games that contain nested emulators. This is due to the fact that such emulators have an interpreter that is used to emulate the CPU of the original system being emulated. The interpreter works by reading instruction by instruction on the ROM, and then performing the respective action that the real CPU would perform for said instruction. For example, if it’s an ADD instruction, it adds two numbers. If it’s a MUL (multiply) instruction, it multiplies two numbers. The way nested emulators do this is usually by containing a huge table with a lot of function pointers; this has a function for each CPU instruction of the machine being emulated. So, for each instruction on the ROM, it reads a pointer on the table and then calls that function. This is an indirect call because the address is not constant; it was loaded from the table. This address changes each time (and will have different addresses for different CPU instructions being emulated). This case is exactly what is improved by the change regarding indirect calls and jumps, and is the reason that these emulators gain the most performance. Since these emulators are performing indirect calls and jumps constantly, they are a best-case scenario for the types of workloads that the multi-level function table excels at handling.

Check out the comparisons below to see how a couple of nested emulators (SNES with the FX chip first, then Neo Geo MVS Arcade system) perform before & after this multi-level function table update. Most current PCs were able to emulate these games at 60FPS before the multi-level function table update. All videos recorded on an Intel i5-6600K with Intel HD 530 graphics to underscore the respective improvements and subsequent lowering of CPU requirements to get playable games.

Super Nintendo Entertainment System Online - Stunt Race FX:

This game used to be particularly painful for the emulator. As you can see, not only does the addition of the multi-level function table double the FPS from ~26 to ~52, but Stunt Race FX goes from essentially a slideshow to playable on a middle-of-the-road CPU with a low end iGPU.

ACA NEO GEO - Metal Slug X:

Though the frame rates here are not very far apart (~50 vs. a steady 60), there is a clear difference in the speed of the game. Anyone who played Metal Slug games on original arcade cabinets or home Neo Geo consoles knows that these games were prone to slowdowns during heavy action. With this update, however, there are no slowdowns at all and the game runs more smoothly than on original hardware. And all this on a six year old i5 CPU!

Implemented on (#2228) by FICTURE7

Fix inverted low/high mask value on GetThreadCoreMask32 syscall

This two line quick-fix corrects inverted values in a particular kernel syscall that was affecting 32-bit games. This bug prevented Game Tengoku CruisinMix Special from booting at all. Now with the update, we can see the game booting to the title screen and beyond (though it is still not playable due to other issues).

Fixed on (#2325) by gdkchan

HLE Improvements

hid: Rewrite shared memory management

HID, or Human Interface Device, is a crucial aspect of the emulator’s code to ensure usability during gameplay. The HID shared memory management code had basically not been touched since its original implementation years ago. This update completely rewrites the HID shared memory management to be more usable and accurate.

Implemented on (#2257) by Thog

More ways to listen - audio: Implement a SDL2 backend

SDL2 is the gift that keeps on giving. After the release of Miria last month which fully replaced the input interface backend (removing all usage of OpenTK3) and added native controller/motion support for nearly every controller possible, Thog once again leveraged SDL2 to add a third audio backend, with the intention to become the default after a period of testing. This audio backend should have the highest compatibility and is suitable for use in all games.

Implemented on (#2258) by Thog

Fix race in SM initialization

A rare edge case where a particular "SmObjectFactory" property value was null when the SM server thread starts could cause an affected user to suffer from games crashing almost immediately on boot. At the time of the fix there was but a single reported instance of this happening, but this four-line change resolves the issue by passing the factory function on the constructor and setting the property before the thread is started.

Implemented on (#2280) by gdkchan

REV it up - amadeus: Update to REV9

As Nintendo often does, the recently released REV9 (on 12.0.0) for the audio renderer makes some minor changes to the way the renderer handles some audio effects and adds the capability to perform dynamic range compression. While these new bits have no known usage in games yet, it is important to ensure such changes are accurately implemented in the emulator.

Implemented on (#2309) by Thog

GUI Improvements

There is more than one way to skin a cat - gtk3: Add base for future Vulkan integration

Most of those following the progress of the emulator are aware Vulkan support will soon be added. This brings some unique challenges since the existing GTK3 UI does not natively support Vulkan; we’ve been planning to replace the UI with Avalonia to help facilitate this. However, constructing a new UI essentially from scratch can be a lengthy if not painstaking task. As such, this update adds in a placeholder foundation to use the existing GTK3 UI for Vulkan integration if the Avalonia UI is not ready in time.

Implemented on (#2260) by Thog

Miscellaneous Improvements

FFFFFUUUUUUU - Update to FFmpeg 4.4.0

The emulator relies on a ffmpeg dependency for h264 decoding of in-game videos, and had not been updated since the initial release of NVDEC emulation support in May of 2020. With potential for video-related bug fixes and decoding improvements, these updates bring the emulator up to date with the current ffmpeg, redirect ffmpeg-related log output to spare the console spam, and set a dynamically queried ffmpeg root path in Linux which ensures that all Linux users, from Arch to Ubuntu, can enjoy games with this type of rich content.

Implemented on (#2259, #2266, #2292) by Thog & AcK77

Clean up your act - Cleanup Discord Presence

One feature that is not exactly part of the emulator but still part of the codebase is the Discord rich presence feature. This update cleans up & refactors the feature so that it is more informative, adding a timer for how long the emulator has been open or how long a game has been played, as well as a link to the Ryujinx website for those curious types. It also replaces some hard coded title IDs/game icons with a generic game card icon, as the number of icons that can be used is limited by Discord.

Implemented on (#2262) by AcK77

We need to go deeper... - Extend info printed when guest crashes/breaks execution

Many times the first step in troubleshooting a particular bug is to simply trigger it in order for the emulator to print out an error log. That log needs to be as descriptive as possible so that the developer doing the troubleshooting has a solid base of information to investigate. This update extends the amount of information produced by the logger so that when the guest does happen to crash, the information printed is of more value.

Implemented on (#1845, #2326) by shadowninja108, Thog, AcK77, gdkchan

Input: Implement an SDL2 keyboard

Keyboard input is currently handled via GTK3; in the near future it will be handled via the new Avalonia UI. Separately, this update implements an SDL2 keyboard that, while currently unused, will be leveraged to provide a forthcoming headless mode (separate un-merged PR is already open for this: #2310) that allows launching Ryujinx via command line with all emulator configuration handled via launch command parameters. This means that there would be no GUI for end users to navigate but would allow much more flexibility when integrating Ryujinx into other frontends like Launchbox, for example.

Implemented on (#2277) by Thog

Closing words

Thanks to everyone who took the time to test the POWER update, reporting bugs and improvements alike. Without your valuable feedback, it would have taken much longer for this code to mature enough to be ready for merging into the main build. And to all those that have supported Ryujinx so far, be it via Patreon donations, code contributions, testing games in the emulator, or simply being an active member of our community: you’ve helped make this emulator what it is today!

We now have an active Patreon campaign with specific goals and restructured subscriber benefits/tiers, so head on over if you're interested in becoming a patron to help push Ryujinx forward!