Ryujinx - Infusion of POWER & LDN 2.3 Release

Ryujinx - Infusion of POWER & LDN 2.3 Release

For much of recent memory Ryujinx has been widely known as a highly compatible, feature-rich emulator with one caveat: without a high-end PC, many first-party games simply ran too slowly to be considered playable. We aim to change all that with today's major overhaul of the emulator's memory management, (absurdly but affectionately) nicknamed POWER - Performance Optimizations With Extensive Ramifications. Lowering CPU requirements and increasing headroom for higher end PCs, this update effectively unlocks full-speed gameplay for a vastly larger audience. Thanks to the joint efforts of gdkchan and riperiperi, users with all kinds of hardware can enjoy performance increases anywhere from 20%-100% and beyond, depending on the game and the user's PC specifications.

The only known cases where this update does not offer noticeable performance increases are in games/areas within games where the FIFO % is at 95% or above, indicating a high emulated GPU load. Put more simply, as this update is CPU performance-focused, situations where the emulated GPU is the bottleneck do not see much benefit. More on that soon!

POWER is now immediately available on both the master/main build as well as the new LDN2.3 multiplayer-enabled build, also released today! This new build benefits from all updates to master that occurred since the previous LDN release. See the changelog for a full list of master updates since LDN2.2 (which was at parity with master 1.0.6819LDN2.3 now at parity with 1.0.6894) Get the new build here! Changes since LDN2.2:

- LDN2.3 is up-to-date as of master 1.0.6894
- POWER update significantly increases performance in most scenarios.
  - Mario Kart 8 Deluxe is more likely to stay connected for those who are now able to maintain 60FPS during a race.
- Miria (SDL2 input re-implementation) includes native support for nearly all controllers including motion controls.
- Custom User Profiles are now used for your multiplayer username. As such, the Username field has been removed from the multiplayer menu tab.

The following PR was added into LDN2.3:
- Add support for HLE macros and accelerate MultiDrawElementsIndirectCount.
- Performance tweaks for Pokemon Sword & Shield and Monster Hunter Rise are included.

But enough with the announcements and on to the nuts & bolts! What is POWER and how does it work?

Technical overview

Old approach

One area of Ryujinx CPU emulation that has been in dire need of improvements is memory access. Up until now, guest memory access was implemented using inline software translation. This meant that on each memory access, we'd need to do address validation, address translation and software protection for memory tracking. This approach had some benefits, but it also had some issues.

Main benefits:

  • Enhanced control over the memory mapping and address translation process. Allows things like guest shared memory to be easily implemented on all OSes.
  • Memory reprotection is fast. This benefitted GPU emulation as we needed to track CPU writes to memory regions that the GPU can access, to invalidate data on GPU memory.

Notable downsides:

  • Poor performance, as it needed to produce roughly 20 x86 instructions per single Arm memory access instruction.
  • Inflated code size, as all those instructions ended up in memory (and also on disk if PPTC is enabled). This contributed to "JIT Cache Exhausted" crashes.
  • Slow boot times because we throw more code into the pipeline, the JIT takes more time which meant longer loads and more stutters without PPTC (first run).
  • Slow PPTC build times (with PPTC), for the same reason mentioned above.

How can we improve this?

We can take advantage of the host CPU MMU (that is, the Memory Management Unit built-in to the CPU in your PC) to make the address translation for us, instead of implementing it in software. This can be done by replicating the guest memory mappings on the host. Basically, we reserve a region of memory in the emulator process that can fit the entire guest address space, at a memory location "BASE", and then for each map that the guest tries to do at an address "A", we can just map "BASE + A" on the host. Similarly, on the CPU emulator, for each access to access "A", we just need to access "BASE + A".

The changes accomplish this by introducing a new memory manager called "HostMappedMemoryManager", which functions as mentioned above. The old "MemoryManager" is still there as there are some cases where it can be useful, such as situations that require memory aliasing to work (like services that need to access memory shared with other processes, if we ever decide to run services on the emulator).

As mentioned before, one of the advantages of the software translation is fast memory reprotection. With this new approach, we needed to do host memory reprotection instead, which is considerably slower. For this reason, a few GPU emulation changes were required to avoid a large performance penalty associated with the reprotections. Constant buffer updates are now batched together and sent directly to the GPU, in addition to being written to CPU memory. Previously it would just write to CPU memory, which dirties the memory region and forces a reprotection, something we want to avoid now.

Improvements

This change brings several improvements, which can be separated into 2 categories: performance and boot time. There are also a few other improvements that will be discussed later but, for direct comparison, these are the categories we will be focusing on.
The tests were performed on low and high-end hardware, on Windows and Linux, using all 3 GPU vendors (NVIDIA, AMD and Intel). All tests were performed on the most demanding locations on those games, to simulate a torture test/worst case scenario. As such, you can expect even better performance in other areas of the game.

Performance

The comparison was done by calculating the average FPS on a single run of the game. Note the following configurations:

  • Ryzen 7 5800X and Intel i5-6600K are equipped with NVIDIA GPUs.
  • Intel i5-1135G7 is using Intel Iris Xe graphics on Windows. (typically GPU bottlenecked)
  • Intel i7-4770 is using an AMD RX580 on Linux (Pop OS w/ TKG PDS kernel).

Locations tested:

  • Super Mario Odyssey: Seaside Kingdom. Abnormally low FPS on this kingdom was a common complaint from users, so we made sure to get this one tested.
  • Animal Crossing New Horizons: Heavily loaded island with a lot of buildings and NPCs, among other things. Kindly provided by user Leischii|Yannick. A simpler or empty island should have much better performance.
  • Super Smash Bros Ultimate: Fountain of Dreams with 8 players, level 9. This has the highest load on the CPU. Matches with less players will have even better performance.
  • Splatoon 2: Tested on Inkopolis Square. This has the highest number of NPCs and normally the lowest FPS anywhere in the game. Note that this specific area is locked to 30 FPS on the Switch; we had to uncap it for testing purposes as it was already hitting 30 with those changes on all PCs tested.
  • Mario Kart 8 Deluxe: Baby Park track with CPU racers on. This track seems to get the lowest FPS compared to other tracks.

The oldest CPU of the bunch (a stock i7-4770 running Linux) had one of the highest improvements, averaging 50% FPS improvement across all the 5 games tested. (1.5x)

Boot times

One common complaint has been that the boot/load times are too slow. In fact, in some cases they were so slow that a user assumes that the game is not working and subsequently gives up, closing the emulator before the game finishes booting. These changes largely address this issue by significantly improving boot times, with both PPTC on and off. Below you can find some measurements.

PPTC Off ("first run/impression")

PPTC On

Not only is boot time greatly reduced but, with this performance update, some games may boot with PPTC disabled in the same amount of time as on the previous main build with PPTC enabled/cached. This means that some users may now wish to simply disable PPTC to avoid the occasional recompilation and other issues associated with it. For those that still wish to keep PPTC enabled, the performance is better than ever.

But that's not all. For PPTC users, having to recompile PPTC is a common annoyance (such as after an emulator update affecting CPU code, or going back and forth between the LDN build and the main build). This update also improves that. Let's check out how long it takes to recompile PTC now:

As we can see, it takes less than half of the time to recompile the code on this game! On top of that, the PPTC cache size on disk is a lot smaller, consuming less than half the size on average. Thanks to the smaller cache size, the "JIT Cache exhausted" crash that has been plaguing some games should also be fixed.

First impressions are everything

Something that is harder to show on paper is the difference this update brings in those moments between menus and gameplay, when the game is first “spooling up” and may appear sluggish. This can manifest as low FPS or stutters unrelated to shaders, slow movement speed, and even (temporary) graphical glitches until the emulator catches up to full speed.

Below is a comparison filmed by gdkchan of what this first impression looks like in Super Mario Odyssey, were you to boot a save from the Cascade Kingdom without having any PTC built. On the left is the old memory manager, on the right is with the new host mapped memory manager on the fastest setting (default); the black screen is included to completely convey the full experience. These were recorded on a laptop with the following specifications:

Intel i7-9750H
NVIDIA GTX 1660
8GB RAM


One more thing...

For Fire Emblem: Three Houses players, one long standing issue was the very noticeable character movement slowdown in the monastery. During our testing, we discovered that the issue was also fixed with this update. Check it out below.

Before:


After:

Similarly, other games with sluggish performance such as Hyrule Warriors: Age of Calamity have greatly improved. This game is known for skipping frames on the Switch when it can't keep up, but Ryujinx's CPU was too slow to handle the most intense of scenes in the game, so the game forcibly slowed time instead. Pay close attention to the movement/game speed in the comparison below. These clips were recorded on a desktop with the following specifications:

AMD Ryzen 9 3900x
16GB 3200MHz RAM
NVIDIA GTX 1070

Before:


After:

Another pleasant side effect of this feature update is one that is sure to please many AMD Ryzen CPU owners: the dreaded "AcquireSemaphore" bug that caused crashes in a variety of games including New Pokemon Snap and Catherine: Full Body is now resolved on all known Ryzen CPUs!

Configuring

  • A new setting called "Memory Manager Mode" was introduced to allow selecting the memory management that should be used. "Software" is the current memory manager without any performance enhancements, while "Host" and "Host unchecked" are the new options added in this update.
  • "Host unchecked" has the best performance, however it doesn't boundaries-check the memory access which makes it unsafe. It is recommended that you only use it with code that you trust. So, it should be fine if you are just playing games from your Switch or testing your own homebrew. It is important to note that this option is not any less safe than regularly using other programs & emulators unless they have extra safety measures built-in.
  • "Host" is a little bit slower, but goes above & beyond standard practices and ensures that the memory access is within the guest address space, which is safer. "Software" is still the most accurate, but also the slowest.
  • The key safety difference between "Host unchecked" and "Host" is that the latter option has an extra layer of security that would prevent malicious code from transcending the intended memory address space accessed by the emulator.

"Host unchecked" is the default option as it provides the best performance and is 100% safe to use with your dumped games & homebrew.

Now that this update has been merged into the main build, please do not use any of the many POWER PR test builds!

References

All tests performed with Ryujinx default settings except where otherwise noted (e.g. vsync off or PPTC off).

Full specifications of PCs tested:

Ryzen 7 5800X @ 4.8GHz
32GB DDR4 3600
NVIDIA GTX 1080
Windows 10 20H2

Intel i5-6600K @4.1GHz
16GB DDR4 2400
NVIDIA GTX 1080
Windows 10 20H2

Intel i7-4770 (stock clocks)
16GB DDR3 1600
AMD RX580
Linux - Pop OS 20.10 with TKG PDS kernel (latest stable)

Lenovo T14 laptop
Intel i5-1135G7
16GB DDR4 3200
Intel Iris Xe graphics
Windows 10 20H2

Thanks to everyone who took the time to test this update since the PR was initially opened, reporting bugs and improvements alike. Without your valuable feedback, it would have taken much longer for this code to mature enough to be ready for merging into the main build. And to all those that have supported Ryujinx so far, be it via Patreon donations, code contributions, testing games in the emulator, or simply being an active member of our community: you’ve helped make this emulator what it is today!

We have an active Patreon campaign with specific goals and structured subscriber benefits/tiers, so head on over if you're interested in becoming a patron to help push Ryujinx forward!