Capacitor Bank: Long term TPS drain with large multiblock sizes (probable memory leak) #680
Labels
No labels
Area-Assets
Area-Backend
Area-Conduits
Area-Datapacks
Area-Lang
Area-Mod Compat
Area-Parity
Area-Rendering
Good first issue
MC-1.19.2
MC-1.20.1
MC-1.20.4
MC-1.20.6
MC-1.21
MC-1.21.1
Modtoberfest
P-0-High
P-1-Medium
P-2-Low
Status-Awaiting Response
Status-Behind-Flag
Status-Blocked
Status-Cannot Reproduce
Status-Duplicate
Status-Help Wanted
Status-Incomplete Report
Status-Invalid
Status-Needs LTS Backport
Status-Needs Updating
Status-Stale
Status-To Implement
Status-Triage
Status-Wontfix
Status-Wontmerge
Type-Backport
Type-Bug
Type-Documentation
Type-Enhancement
Type-Question
Type-RFC
Type-Suggestion
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Team-EnderIO/EnderIO#680
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Is there an existing issue for this?
Current Behavior
Symptom: multi-second delays on some (but not all) ticks in server, multi-second lockups in integrated-server (singleplayer), ultimately leading the tick watchdog to crash (160 seconds).
TLDR/likely cause: Memory leak in machine block entity capability caching due to adding duplicate invalidation listeners to the LazyOptional capabilities causing VM to hit memory limit, causing GC to thrash.
--
The long version of the story is that I was gradually building up a large cap bank to act as the central storage in my base. Around the time I hit the million RF/t generation (though probably not deeply relevant) my server (which is a personal server for me and my SO) began hitting extremely low TPS after several hours before ultimately hitting the tick watchdog check and crashing (160 second tick). Then when I would start the server again, it would be fine for several hours, rinse and repeat.
I did a ton of Spark and Observable profiling to try to ascertain the cause (this is in ATM9 and we're near the end game, so there was tons to check) but could not find an obvious culprit. Then during one of the TPS drain events I noticed that my cap bank entities were somewhat slower than before (probably not relevant as you'll see) and on that hunch decided to remove the capacitor bank. The issue went away and has not recurred.
The Vibrant capacitor bank was around 192 blocks large.
I raised this on the EnderIO discord and we had some conversations about potential causes. During this time I was able to reproduce the problem with a 3 * 8 * 8 multiblock after 2.5 hours in a singleplayer world using the same modpack. Since it was singleplayer, it would lock up the entire client, which was a clue, since iirc even in integrated server mode the server ticks are run on a separate thread from the client. The lock ups lasted about 4 seconds each, and happened every 4-8 seconds.
Here is the test setup at the time it was built (the left is EIO's Creative Energy block, the right is an energy trash can from the Trash Cans mod).
Here are two Sparks taken while the lockups were occurring. There isn't much to look at in the profiler view.
https://spark.lucko.me/7GlO6pfBby
https://spark.lucko.me/halqg3TAUb
In these, the most expensive tick cost was around mob ticks, but ATM9 has tons of mods that mixin to mob ticks, and the total amount of time wasn't anywhere near the length of time of each lock up.
After gathering the data, I split the multiblock in half, which immediately resolved the issue- the 4 second lock ups stopped happening almost instantly.
These do capture one pertinent fact: there was 14.1 GB of RAM in use (out of 14.3 allotted in the Java configuration), and an average GC old generation time of 4.4 seconds, which is bang on for the length of time of each lock up.
Furthermore, that explains why the entire client locked up, despite Minecraft running the server/client ticks on separate threads.
So this is actually a memory leak which eventually causes the VM to run out of allocatable memory, causing the GC to thrash looking for enough space to free up, and likely having to do a lot of compaction to enable the next allocation(s).
This explains why it took about 3-4 hours on my personal server (which has 16GB of RAM, but is not running the client portion) and took 2.5 hours on my local machine (around an equivalent VM memory limit, but has both the server and client allocations in the working set).
@Rover656 noticed that capability caching could be at fault here:
This does make sense on reading the code, at least on the 1.20.1 branch, the updateCapabilityCaches() method of MachineBlockEntity ultimately calls addInvalidationListener() on the capabilities it finds in it's 6 cardinal directions. That method adds an invalidation listener on the LazyOptional returned by the capable block in order to mark its own cache as invalid (to be rebuilt on the next server tick). However, no effort is made to prevent adding superfluous listeners to the same LazyOptional objects- so if an ICapabilityProvider reuses the same capability object (which would be the most sensible thing to do since they are invalidate-able), EIO will just keep adding additional listeners every time the cache is marked invalidated. This would have two deleterious effects: acting as a memory leak, and also causing other mod's invalidation notifications to take longer than they need to, though that would be limited primarily to the function call itself and a single value assign (
isCapabilityCacheDirty = true), so it's not too likely to be noticed, even when the amount of duplicate listeners is quite high.The Forge API does not include a method for removing invalidation listeners, so unfortunately a simple approach such as tracking all registered listeners within the MachineBlockEntity and removing them prior to updating the cache is not possible (at least for 1.20.1).
The other alternative would be to keep a weak set of the lazy optionals that have already have listeners added and consult the set prior to adding a new one. The listener itself need not be memoized.
I'm hoping to confirm that this is indeed the issue via reproduction in an EIO-only dev environment, I'll update this issue as I learn more.
If you are interested in the full Discord discussion, you can find it here: https://discord.com/channels/373534853259329536/373534853259329538/1249677198537789471
Expected Behavior
Not leading to OOM / GC thrashing.
How to Test
Should be repeatable by:
Local Environment
Any Additional Information?
No response
I've done a bit of testing and there's a bigger issue at play:
MachineBlocks delete capability caches every time an adjacent block is updated, and machines force update themselves the moment their energy changes. Adding a cache for invalidations does at least slow down the leak, but only if you never clear it. (Creating a leak to destroy the leak)