Progress Report September 2021
The month of September brought dozens of bolstering improvements including significant performance improvements, bug fixes, HLE improvements, and GPU improvements. There have also been significant improvements to something we teased a few months ago!
Amiibo Emulation - merged into the main build in March 2021.
While compatibility is now almost perfect, there are still some improvements to come for Amiibo which can be tracked on the associated Github issue here: https://github.com/Ryujinx/Ryujinx/issues/2122
Custom User Profiles - merged into the main build in April 2021.
Vulkan GPU Backend - still in progress, a public test build is delivered. A lot is being worked on.
ARB Shaders - Goal reached in April 2021. As seen from the last progress report, preliminary work on ARB shaders has begun.
ARB shaders will further reduce stuttering on the first run by improving the shader compilation speed on NVIDIA GPUs using the OpenGL API.
$2000/month - Texture Packs / Replacement Capabilities - Almost there!
This will facilitate the replacement of in-game graphics textures which enables custom texture enhancements, alternate controller button graphics, and more.
ETA once the goal is reached: ~3-4 weeks
$2500/month - One full-time developer - Not yet met
This amount of monthly donations will allow the project's founder, gdkchan, to work full-time on developing Ryujinx.
$5000/month - Additional full-time developer - Not yet met
This amount of monthly donations will allow an additional Ryujinx team developer to work full-time on the project.
So now we’re done with that let’s get started with this month's progress.
First, an update was released by AMD that made games no longer boot on Vulkan, as the device creation would just fail with an out of memory error on Windows. We believe that this is a driver bug, but we have now added a workaround on the emulator to allow it to work again with the newer drivers. The issue was caused by the "index type Uint8" extension. This extension is not supported by AMD hardware, and we do not use or request this extension on Vulkan. However, simply including the struct for this extension when creating the device causes it to fail on the newer driver, even if we do not enable it. As a workaround we have simply removed it on AMD, as the extension is not supported anyway, so it has no use.
The branch has also been rebased, so it is now more up-to-date and contains the latest improvements. We have received reports of a few games that seems to have regressed on AMD since then, and we're looking into it.
Some of the Vulkan changes are now in the main build too, which makes merging it in the future easier, and some also benefit OpenGL (such as the shader subgroup change that will be discussed later).
The plan for October is working on a shader tester. This will allow easily catching bugs on the SPIR-V implementation, that is required for Vulkan, but that is not the only benefit. It will also allow finding bugs on our shader translator/decompiler and improve the emulation in all backends. It will also help testing ARB shaders in the future, as that too is a new backend with bugs to iron out.
Fix TXQ for 3D textures
UE4 games assume the texture is 3D if the component mask contains Z. This fixes a bug in UE4 games where parts of the map had garbage pointers to lighting voxels, as the lookup 3D texture was not being initialized. The texture is supposed to be initialized by a compute shader, and the shader was failing to compile before due to this error. The most notable game to see this fix is Tony Hawk’s Pro Skater 1+2.
Fixed by riperiperi in #2613.
Lift textures in the AutoDeleteCache for all modifications
Before, this would only apply to render targets and texture blit. Now it applies to image stores, the fast DMA copy path, and any other type of modification. Image store textures always have at least one reference in the texture pool, so the function of the cache keeping textures alive is not useful, but a very important function has been its use to flush textures in order of modification when they are dereferenced so that their data is not lost. This fixes lighting breaking when switching levels in UE4 games and "rainbow" textures in a few games.
Tony Hawk Pro Skater 1+2
Little Nightmares II’s broken "rainbow" textures seemed to have been fixed by this as well.
Fixed by riperiperi in #2615.
Account for negative strides on DMA copy
Some games on the Switch that uses the OpenGL API are using negative stride values. This would cause the copy to advance backwards. This is used to flip the image vertically on the copy. This new change ensures it is positive and If the stride is negative, the base offset is adjusted to the real start offset of the copy. With all of these changes, Idol Days no longer crashes if the user tries to load/save the game.
Fixed by gdkchan in #2623.
Set texture/image bindings in place rather than allocating and passing an array
Ryujinx was allocating multiple arrays per draw or compute invocation. The cost for this was small but still significant. This has been updated and now the functions used to update the texture and image bindings instead rent the bindings array for modification. This is done to set the data directly, rather than allocate or copy it into the bindings manager. They now use arrays that are pre-allocated with a default size but can be increased in size to fit shaders that bind way more textures, such as bindless accesses.
One notable improvement due to this change is in Super Mario Odyssey: the FIFO% has been brought down, which could also mean some systems got improved performance.
Fixed by riperiperi in #2647.
Implement and use an Interval Tree for the MultiRangeList
This implements an augmented interval tree based on the existing tree dictionary and uses it for the texture lookup on the cache. This greatly speeds up texture overlap checks, as they can't use the non-overlapping fast path that buffers and tracking handles can use. Like the tree dictionary, it is based on a red-black tree and is self-balancing.
One game that was improved by this change was Mario Golf Super Rush. If you have tried to play it before on this emulator, you might have noticed that the game would take a long time to load the courses. With this change, the load times are much lower, thanks to the fast texture lookup that makes creating new textures faster. The games that benefits the most from this are the ones with a high amount of textures on the cache, as before creating new textures, it first needs to check if it already exists on the cache to avoid creating duplicate textures.
Implemented by riperiperi in #2641.
Array based RangeList that caches Address/EndAddress
This modifies the RangeList to cache the Address and EndAddress within the list itself rather than accessing them from the object's properties. It also changes the RangeList to be backed by an array containing structs with the above information, back to back. This improves memory locality when binary searching through the array list. This array list is used in a few places: Memory tracking, windows emulated view + placeholder tracking, buffer modified list and buffer lookup. It's this last use that is most important - we were losing quite a bit of time looking up buffers by CPU VA when binding buffers (uniforms primarily). Note that these cases are all non-overlapping ranges. A method has been added to the list to update the cached end address, as some users of the RangeList currently modify it dynamically.
This greatly improves performance in Super Mario Odyssey (about ~1.25x), Xenoblade and most other GPU limited games. Improvement will generally depend on how many buffers the game binds and how many draws it does. Give your favourite 98% FIFO game a shot.
Implemented by riperiperi in #2642.
Use shader subgroup extensions if shader ballot is not supported
Despite a lot of work put into making Intel GPUs work as best as they can on Ryujinx on the OpenGL backend, it’s extremely hard to make it run perfectly especially since Intel proprietary drivers aren’t fun to deal with as they don’t support a lot of things including a lot of extensions. ARB_shader_ballot extension is not supported on Intel’s proprietary drivers but the newer subgroup extensions are supported.
The two extensions are equivalent, so simply replacing the shader ballot calls with equivalent subgroup calls allows more games to render correctly, most notably Astral Chain.
This also reduces the differences between the master and Vulkan branches, since the new subgroup extensions are used on SPIR-V.
Fixed by gdkchan in #2627.
Share scales array for graphics and compute
Our resolution scaler works incredibly well with many games especially since these past updates but some games still don’t scale correctly or have issues with scaling. Ni no Kuni 2 is one of these games that had graphical issues if you used resolution scaling. The issue happened because the backend is using a single array to store both fragment and compute scales, while the GPU emulation is using 2. The fix was simply sharing the same array for both compute and graphics. This fixes an issue where scales might not be properly updated on games that use compute.
Fixed by gdkchan in #2653.
Fast path for Inline2Memory buffer write that skips write tracking force copy when auto-deleting a texture with dependencies
Many games write SSBOs from compute, notably the Xenoblade games which flushes buffer data on the GPU thread when trying to write compute data. The old method for this was already pretty fast, the better way of handling this is adding a method to PhysicalMemory that attempts to write all cached resources directly, so that memory tracking can be avoided. The idea is to both avoid flushing buffer data and to avoid raising the sequence number when data is written, as it causes buffer and texture handles to be re-checked and can make performance worse. Xenoblade Chronicles 2 and Xenoblade Definitive edition both net a significant performance increase from this.
Implemented by riperiperi in #2624.
Only make render target 2D textures layered if needed
In some cases, games can have a bogus value written as the render target texture depth. This can cause very large 2D array textures to be created, this is not only bad for performance as it makes the system use more resources but it can cause out-of-memory (OOM) errors and potentially a few other errors. Normally the non-base layers of the texture are not accessed at all, as it will only render to a single layer. It only matters when the shader writes to the gl_Layer with a non-zero value. Doing so will modify the target layer. So to fix this issue the code has been changed to only ever use 2D arrays when one of the vertex, tessellation, or geometry shaders writes to gl_Layer. This solves an issue where The Legend of Heroes: Zero no Kiseki was crashing on boot due to a 1080p array texture with 257 layers being created which would take several GBs of memory and cause all sorts of issues.
Fixed by gdkchan in #2646.
Replace CacheResourceWrite with more general "precise" write
The goal of CacheResourceWrite was to notify GPU resources when they were modified directly, by looking up the modified address/size in a structure and calling a method on each resource. The downside of this is that each resource cache has to be queried individually, they all have to implement their way to do this, and it can only signal to resources using the same PhysicalMemory instance. This new method adds the ability to signal a write as "precise" on the tracking, which signals a special handler (if present) which can be used to avoid unnecessary flush actions, or maybe even more. For buffers, precise writes specifically do not flush, and instead, punch a hole in the modified range list to indicate that the data on GPU has been replaced. This fixes some rendering issues in Mario + Rabbids Kingdom Battle and Rune Factory 4 that were introduced with the aforementioned fast Inline2Memory buffer write change.
Implemented by riperiperi in #2684.
Force copy when auto-deleting a texture with dependencies
When a texture is deleted by falling to the bottom of the AutoDeleteCache, its data is flushed to preserve any GPU writes that occurred. This ensures that the data appears in any textures recreated in the future, but didn't account for a texture that already existed with a copy dependency. The change forced copy dependencies to complete if a texture falls out from from the AutoDeleteCache. This is done via a full sync of the storage texture right now, as that's the best way to ensure all the rules are followed.
This fixes broken lighting caused by pausing in SMO's Metro Kingdom.
Fixed by riperiperi in #2687.
Use normal memory store path for DC ZVA
This is used as an optimized way to clear the memory in homebrew applications. Changing the method used to zero the memory to use the new method introduced with the "POWER" update that allows fast memory accesses can speed this up significantly aswell.
Implemented by riperiperi in #2693.
Optimize fast register allocator
This optimizes the JIT's faster register allocator, used the first time a game is played. This reduces the boot time when the game is launched with PPTC disabled, or on the first run (as there is no PPTC cache built at this point).
Implemented by FICTURE7 in #2637.
Report 1080p resolution when in docked mode
The GetDefaultDisplayResolution service function was returning a 720p resolution even when docked. While this is technically correct, most of the benefit of enabling docked mode on the emulator is getting a higher resolution, so increasing the resolution, in this case, is more desirable.
Allows Tsukihime -A piece of blue glass moon- to render at a higher resolution when docked.
You might need to load the images at full screen to see the difference.
Implemented by gdkchan in #2618.
Implement GetVaRegions on nvservices
This implements the GetVaRegions ioctl, which is used to get the ranges of the address space that the application can use. It returns two ranges, one for small pages and one for big pages. The Vulkan driver uses this to calculate the usable address space size. This fixes a crash on Quake due to VK_ERROR_OUT_OF_DEVICE_MEMORY being returned by the guest driver, caused by the fact that it assumed that the usable address space size was 0, which would fail the check for any buffer size that is greater than 0.
The game can progress further now but crashes due to Sockets issues.
Implemented by gdkchan in #2621.
HOS: Cleanup the project
This cleans up the HOS (Horizon OS) project as it has seen a tremendous amount of change. Leftovers that are not needed have been removed from the code and moves some things at the wrong places to the correct ones.
Fixed by AcK77 in #2634.
Amadeus: Update to REV10
The 13.0.0 update for the Nintendo Switch introduced Bluetooth audio but also introduced a lot of hidden changes within the OS. At the moment no games use this, but eventually, they will.
Implemented by Thog in #2654.
VI: Unify resolutions values and accurate implementation of them
This continues the work started with the change to report a 1080p resolution in docked. It makes the values and checks related to displays closer to the original hardware. Changes include AM's service GetDefaultDisplayResolution/ GetDefaultDisplayResolutionChangeEvent functions getting more information on what the services do, VI:U/VI:M/VI:S GetDisplayService are now much more accurate and finally IApplicationDisplay GetRelayService, GetSystemDisplayService, GetManagerDisplayService, GetIndirectDisplayTransactionService, ListDisplays, OpenDisplay, OpenDefaultDisplay, CloseDisplay, GetDisplayResolution are now properly implemented.
Implemented by AcK77 in #2640.
IRS: Stub some service calls
This stubs some IR service calls as at the moment we do not support the IR sensor in the right Joy-con. This allows games such as Night Vision and Spy Alarm to boot and makes Doukoku Soshite playable.
It is worth noting that those games were already playable before by enabling the "Ignore missing services" hack on the settings, but this change makes the hack no longer needed, so now those games can be played out of the box.
Implemented by AcK77 in #2665.
NVDEC (H264): Use separate contexts per channel and decode frames in DTS order
When H264 support was implemented on NVDEC, it was noticed that the frames were not in the correct order. At the time, we tried to fix it but couldn’t find the root cause of the issue, so to avoid further delaying it we used a workaround where it would ignore the VIC input surface address, and instead use the address of the last NVDEC frame decoded. This approach had issues, the most noticeable one being that it can lead to the presentation of duplicate frames because if there is more than one consecutive VIC copy operation, it will copy the same frame more than once.
The result is that the H264 videos are usually not smooth, and the frame pacing is irregular. On top of the existing problems, it also has another issue when multiple videos are decoded at once. There is no guarantee that the NVDEC decode and VIC copy for a given channel will happen in order. These issues only encouraged us to dig into this problem once more. Fortunately, this time the endeavor was a bit more successful.
The problem is that FFMPEG will deliver the frames in Presentation Time Stamp (PTS) order, while NVDEC is supposed to output them in Decoder Time Stamp (DTS) order. That is, not all frames on an H264 video are decoded in the same order they are supposed to be displayed on the screen, but FFMPEG always returns them in display order, which does not match the order that NVDEC is supposed to, or that the game expects. Using a more efficient and non-hacky solution fixes several issues that the original implementation had.
H264 video playback should be smoother now, without duplicate frames, some minor issues like a few games flashing a green frame when the video starts have also been fixed, the missing field_pic_order_in_frame_present_flag has also been added to the stream PPS which fixes decoding errors on Layton's Mystery Journey, but the video is still not rendered properly due to VIC Issues.
A more notable improvement came from another change to use a separate FFMPEG context per channel. Before, all channels shared the same context. This becomes a problem when there is more than one video being decoded at once, as the context stores previously decoded frames, which are used to predict content on future frames, a technique employed by video codecs to reduce file size by not encoding the same information more than once. One of the issues of sharing the same context for different videos, is that it would cause frame data from the wrong video to be used, among other issues. To sum it up, it causes severe image corruption, which you can see below.
You might be wondering why the Hatsune Miku game is decoding multiple videos on this clip in the first place. What happens here is that most of this scene is not actually rendered by the GPU, and instead uses pre-rendered videos. The only thing there that is actually 3D is the Hatsune Miku model. You can think of it like a sandwhich, there is a video for background elements, another for foreground elements (light effects and others), and Miku is right on the middle. So what we have here is 2 videos being decoded at the same time.
With this change that uses separate contexts per channel, the issue is now gone and the clip can finally render properly.
Hatsune Miku: Project DIVA MEGA 39's is not the only game to benefit from this. No More Heroes 3 also had a similar issue, and has also been fixed with this change. It may also have improved other games, such as Just Dance that had similar issues, but we did not test this one.
Fixed by gdkchan in #2671.
CLKRST: Stub/Implement IClkrstManager and IClkrstSession calls
This stubs and implements some clkrst calls. Some are stubbed because they are used to overclock the Switch hardware and it's pointless in our case as we are emulating the system.
Implemented by AcK77 in #2692.
GUI: Replace FileChooserDialog with FileChooserNative
The UI framework we currently use (GTK) has its own file chooser dialog, but many have seen that it’s not very fun to work with when it comes to handling multiple files. It also does not match the native OS look that most are used to. This makes it so it uses the OS native file chooser dialog instead of using GTK’s.
Implemented by AcK77 in #2633.
Adjustments to framerate metric and addition of frame time
Our old frame time indicator used a weighted average which uses a decay rate of 0.5 weighting frame to frame. This causes the perceived FPS and Ryujinx's reported FPS to always feel slightly off from one another and because of this, the bottom value can feel off. The FPS monitor has been changed and now shows instantaneous FPS rather than any form of weighted average and finally, all performance metrics now update every 750ms rather than 1000ms. Frame time is usually much more valuable in determining how "smoothly" a game is running so now a frame time metric is added to make results easier to analyze.
Implemented by MutantAura in #2638.
Remove file error popup
If you've ever stored games on an external hard drive for Ryujinx and you unplug it from the PC an error would pop up every time you opened up Ryujinx as it could not locate the games in the directory. This change removes the pop-up as it became more redundant than helpful. Note that the error will still be saved in the log.
Implemented by bobhope9848 in #2547.
Update game metadata when stopping emulation
The metadata for games would not be updated if you had stopped emulation. This small change fixes this issue and metadata now properly updates if you use the stop emulation button.
Fixed by Nistenf in #2610.
Fix GTK3 mapping for single quote key
If you had tried to use the quote button as a mapped key for Ryujinx, this wouldn’t work. This was because the single quote key (') was incorrectly mapped to the GTK key quotedbl.
Fixed by Nistenf in #2612.
Implement a "Pause Emulation" option & hotkey
A long-time-requested feature was the ability to be able to pause emulation. This can be useful if the game doesn’t allow you to pause at a certain moment. It can be toggled by hitting F5.
Implemented by mpnico in #2428.
Quick README update for game compatibility
This updates our README file to show the new total amount of games being playable going from 2100 in May to 2400 in September.
Implemented by Mou-Ikkai in #2694.
Add Linux Unicorn patch + desc.
This adds some info on compiling Unicorn with the necessary patch on Linux. Note that we do not use or ship Unicorn for CPU emulation. It is only used to unit test our own CPU emulator.
Implemented by mgielda in #2609.
But wait there’s something in the distance there!
No, your eyes are not deceiving you. Work for getting more applets to work, such as the player select applet above, is ongoing. This is a very complicated thing to do as many things need to be implemented for it to be functional, stay tuned for more news later!
As always we would like to thank everyone who has contributed to the emulator so far whether it was through Patreon, reporting bugs, or code contributions. You all have made this project what it is today.
New code contributors September 2021:
We have an active Patreon campaign with specific goals and restructured subscriber benefits/tiers, so please consider becoming a patron to help push Ryujinx forward!