Capacitor Bank: Long term TPS drain with large multiblock sizes (probable memory leak) #680

New issue

Closed

opened 2024-06-10 23:37:53 +00:00 by rezonant · 1 comment

rezonant commented

2024-06-10 23:37:53 +00:00

(Migrated from github.com)

Is there an existing issue for this?

I did not find any existing issues.

Current Behavior

Symptom: multi-second delays on some (but not all) ticks in server, multi-second lockups in integrated-server (singleplayer), ultimately leading the tick watchdog to crash (160 seconds).
TLDR/likely cause: Memory leak in machine block entity capability caching due to adding duplicate invalidation listeners to the LazyOptional capabilities causing VM to hit memory limit, causing GC to thrash.

The long version of the story is that I was gradually building up a large cap bank to act as the central storage in my base. Around the time I hit the million RF/t generation (though probably not deeply relevant) my server (which is a personal server for me and my SO) began hitting extremely low TPS after several hours before ultimately hitting the tick watchdog check and crashing (160 second tick). Then when I would start the server again, it would be fine for several hours, rinse and repeat.

I did a ton of Spark and Observable profiling to try to ascertain the cause (this is in ATM9 and we're near the end game, so there was tons to check) but could not find an obvious culprit. Then during one of the TPS drain events I noticed that my cap bank entities were somewhat slower than before (probably not relevant as you'll see) and on that hunch decided to remove the capacitor bank. The issue went away and has not recurred.

The Vibrant capacitor bank was around 192 blocks large.

I raised this on the EnderIO discord and we had some conversations about potential causes. During this time I was able to reproduce the problem with a 3 * 8 * 8 multiblock after 2.5 hours in a singleplayer world using the same modpack. Since it was singleplayer, it would lock up the entire client, which was a clue, since iirc even in integrated server mode the server ticks are run on a separate thread from the client. The lock ups lasted about 4 seconds each, and happened every 4-8 seconds.

Here is the test setup at the time it was built (the left is EIO's Creative Energy block, the right is an energy trash can from the Trash Cans mod).

Here are two Sparks taken while the lockups were occurring. There isn't much to look at in the profiler view.

https://spark.lucko.me/7GlO6pfBby
https://spark.lucko.me/halqg3TAUb

In these, the most expensive tick cost was around mob ticks, but ATM9 has tons of mods that mixin to mob ticks, and the total amount of time wasn't anywhere near the length of time of each lock up.

After gathering the data, I split the multiblock in half, which immediately resolved the issue- the 4 second lock ups stopped happening almost instantly.

These do capture one pertinent fact: there was 14.1 GB of RAM in use (out of 14.3 allotted in the Java configuration), and an average GC old generation time of 4.4 seconds, which is bang on for the length of time of each lock up.

Furthermore, that explains why the entire client locked up, despite Minecraft running the server/client ticks on separate threads.

So this is actually a memory leak which eventually causes the VM to run out of allocatable memory, causing the GC to thrash looking for enough space to free up, and likely having to do a lot of compaction to enable the next allocation(s).

This explains why it took about 3-4 hours on my personal server (which has 16GB of RAM, but is not running the client portion) and took 2.5 hours on my local machine (around an equivalent VM memory limit, but has both the server and client allocations in the working set).

@Rover656 noticed that capability caching could be at fault here:

I reckon this could well be our capability caching; if each cell in this arrangement is caching its neighbours that could explain memory consumption?
...
It looks to be maybe more to do with the invalidation listener
yeah one thing that isn't being done is removal of the invalidation listener when we clear and refill the cache
that will absolutely lead to a heap of handlers just lying dormant
do that in all 6 directions of a multiblock, I imagine every tick as we recompute when neighbours change (need to check this claim by checking out 1.20.1 tomorrow) and boom you get a lot of listeners

This does make sense on reading the code, at least on the 1.20.1 branch, the updateCapabilityCaches() method of MachineBlockEntity ultimately calls addInvalidationListener() on the capabilities it finds in it's 6 cardinal directions. That method adds an invalidation listener on the LazyOptional returned by the capable block in order to mark its own cache as invalid (to be rebuilt on the next server tick). However, no effort is made to prevent adding superfluous listeners to the same LazyOptional objects- so if an ICapabilityProvider reuses the same capability object (which would be the most sensible thing to do since they are invalidate-able), EIO will just keep adding additional listeners every time the cache is marked invalidated. This would have two deleterious effects: acting as a memory leak, and also causing other mod's invalidation notifications to take longer than they need to, though that would be limited primarily to the function call itself and a single value assign (isCapabilityCacheDirty = true), so it's not too likely to be noticed, even when the amount of duplicate listeners is quite high.

The Forge API does not include a method for removing invalidation listeners, so unfortunately a simple approach such as tracking all registered listeners within the MachineBlockEntity and removing them prior to updating the cache is not possible (at least for 1.20.1).

The other alternative would be to keep a weak set of the lazy optionals that have already have listeners added and consult the set prior to adding a new one. The listener itself need not be memoized.

I'm hoping to confirm that this is indeed the issue via reproduction in an EIO-only dev environment, I'll update this issue as I learn more.

If you are interested in the full Discord discussion, you can find it here: https://discord.com/channels/373534853259329536/373534853259329538/1249677198537789471

Expected Behavior

Not leading to OOM / GC thrashing.

How to Test

Should be repeatable by:

Making a sufficiently large vibrant capacitor bank (for my testing I used 192 (3 x 8 x 8))
Adding an EIO creative energy block to pump in energy
Adding a reasonable "load" (for my testing I used the Energy Trash Can from the Trash Cans mod with unlimited speed).
Watch memory usage climb until GC thrashing occurs.
Once client / server is locking up, break the capacitor bank to reduce it's size. I caught mine roughly in half. I did happen to cut through it in a way that one of the two resulting multiblocks then had the energy source, and the other one had the energy load.
The GC thrashing stops

Local Environment

- **Operating System**: Replicated on Linux and Windows
- **Java version**: 17.0.8
- **Minecraft Version**: 1.20.1, forge-47.2.20
- **Version**: EnderIO 1.20.1-6.0.25-alpha
- **Other Mods**: ATM9-0.2.58 without any customizations

Any Additional Information?

No response

### Is there an existing issue for this? - [X] I did not find any existing issues. ### Current Behavior Symptom: multi-second delays on some (but not all) ticks in server, multi-second lockups in integrated-server (singleplayer), ultimately leading the tick watchdog to crash (160 seconds). TLDR/likely cause: Memory leak in machine block entity capability caching due to adding duplicate invalidation listeners to the LazyOptional capabilities causing VM to hit memory limit, causing GC to thrash. -- The long version of the story is that I was gradually building up a large cap bank to act as the central storage in my base. Around the time I hit the million RF/t generation (though probably not deeply relevant) my server (which is a personal server for me and my SO) began hitting extremely low TPS after several hours before ultimately hitting the tick watchdog check and crashing (160 second tick). Then when I would start the server again, it would be fine for several hours, rinse and repeat. I did a ton of Spark and Observable profiling to try to ascertain the cause (this is in ATM9 and we're near the end game, so there was tons to check) but could not find an obvious culprit. Then during one of the TPS drain events I noticed that my cap bank entities were somewhat slower than before (probably not relevant as you'll see) and on that hunch decided to remove the capacitor bank. The issue went away and has not recurred. The Vibrant capacitor bank was around 192 blocks large. I raised this on the EnderIO discord and we had some conversations about potential causes. During this time I was able to reproduce the problem with a 3 * 8 * 8 multiblock after 2.5 hours in a singleplayer world using the same modpack. Since it was singleplayer, it would lock up the entire client, which was a clue, since iirc even in integrated server mode the server ticks are run on a separate thread from the client. The lock ups lasted about 4 seconds each, and happened every 4-8 seconds. Here is the test setup at the time it was built (the left is EIO's Creative Energy block, the right is an energy trash can from the Trash Cans mod). ![image](https://github.com/Team-EnderIO/EnderIO/assets/409140/9e952ccf-0ee7-454e-b511-58cd3d8bde98) Here are two Sparks taken while the lockups were occurring. There isn't much to look at in the profiler view. https://spark.lucko.me/7GlO6pfBby https://spark.lucko.me/halqg3TAUb In these, the most expensive tick cost was around mob ticks, but ATM9 has tons of mods that mixin to mob ticks, and the total amount of time wasn't anywhere near the length of time of each lock up. After gathering the data, I split the multiblock in half, which immediately resolved the issue- the 4 second lock ups stopped happening almost instantly. These do capture one pertinent fact: there was 14.1 GB of RAM in use (out of 14.3 allotted in the Java configuration), and an average GC old generation time of 4.4 seconds, which is bang on for the length of time of each lock up. ![image](https://github.com/Team-EnderIO/EnderIO/assets/409140/87db6035-fcb5-4d55-89fb-fc95bd41849e) Furthermore, that explains why the entire client locked up, despite Minecraft running the server/client ticks on separate threads. So this is actually a memory leak which eventually causes the VM to run out of allocatable memory, causing the GC to thrash looking for enough space to free up, and likely having to do a lot of compaction to enable the next allocation(s). This explains why it took about 3-4 hours on my personal server (which has 16GB of RAM, but is not running the client portion) and took 2.5 hours on my local machine (around an equivalent VM memory limit, but has both the server and client allocations in the working set). @Rover656 noticed that capability caching could be at fault here: > I reckon this could well be our capability caching; if each cell in this arrangement is caching its neighbours that could explain memory consumption? > ... > It looks to be maybe more to do with the invalidation listener > yeah one thing that isn't being done is removal of the invalidation listener when we clear and refill the cache > that will absolutely lead to a heap of handlers just lying dormant > do that in all 6 directions of a multiblock, I imagine every tick as we recompute when neighbours change (need to check this claim by checking out 1.20.1 tomorrow) and boom you get a lot of listeners This does make sense on reading the code, at least on the 1.20.1 branch, the updateCapabilityCaches() method of MachineBlockEntity ultimately calls addInvalidationListener() on the capabilities it finds in it's 6 cardinal directions. That method adds an invalidation listener on the LazyOptional returned by the capable block in order to mark its own cache as invalid (to be rebuilt on the next server tick). However, no effort is made to prevent adding superfluous listeners to the same LazyOptional objects- so if an ICapabilityProvider reuses the same capability object (which would be the most sensible thing to do since they are invalidate-able), EIO will just keep adding additional listeners every time the cache is marked invalidated. This would have two deleterious effects: acting as a memory leak, and also causing other mod's invalidation notifications to take longer than they need to, though that would be limited primarily to the function call itself and a single value assign (`isCapabilityCacheDirty = true`), so it's not too likely to be noticed, even when the amount of duplicate listeners is quite high. The Forge API does not include a method for removing invalidation listeners, so unfortunately a simple approach such as tracking all registered listeners within the MachineBlockEntity and removing them prior to updating the cache is not possible (at least for 1.20.1). The other alternative would be to keep a weak set of the lazy optionals that have already have listeners added and consult the set prior to adding a new one. The listener itself need not be memoized. I'm hoping to confirm that this is indeed the issue via reproduction in an EIO-only dev environment, I'll update this issue as I learn more. If you are interested in the full Discord discussion, you can find it here: https://discord.com/channels/373534853259329536/373534853259329538/1249677198537789471 ### Expected Behavior Not leading to OOM / GC thrashing. ### How to Test Should be repeatable by: 1. Making a sufficiently large vibrant capacitor bank (for my testing I used 192 (3 x 8 x 8)) 2. Adding an EIO creative energy block to pump in energy 3. Adding a reasonable "load" (for my testing I used the Energy Trash Can from the Trash Cans mod with unlimited speed). 4. Watch memory usage climb until GC thrashing occurs. 5. Once client / server is locking up, break the capacitor bank to reduce it's size. I caught mine roughly in half. I did happen to cut through it in a way that one of the two resulting multiblocks then had the energy source, and the other one had the energy load. 6. The GC thrashing stops ### Local Environment ```markdown - **Operating System**: Replicated on Linux and Windows - **Java version**: 17.0.8 - **Minecraft Version**: 1.20.1, forge-47.2.20 - **Version**: EnderIO 1.20.1-6.0.25-alpha - **Other Mods**: ATM9-0.2.58 without any customizations ``` ### Any Additional Information? _No response_

DoctorNefario commented

2024-06-11 03:43:35 +00:00

(Migrated from github.com)

I've done a bit of testing and there's a bigger issue at play: MachineBlocks delete capability caches every time an adjacent block is updated, and machines force update themselves the moment their energy changes. Adding a cache for invalidations does at least slow down the leak, but only if you never clear it. (Creating a leak to destroy the leak)

I've done a bit of testing and there's a bigger issue at play: `MachineBlock`s [delete capability caches](https://github.com/Team-EnderIO/EnderIO/blob/9dd55a67db759386a0a3f42f9400e880a2e2b1a7/src/machines/java/com/enderio/machines/common/block/MachineBlock.java#L83) every time an adjacent block is updated, and machines [force update themselves](https://github.com/Team-EnderIO/EnderIO/blob/9dd55a67db759386a0a3f42f9400e880a2e2b1a7/src/machines/java/com/enderio/machines/common/blockentity/base/PoweredMachineBlockEntity.java#L216) the moment their energy changes. Adding a cache for invalidations does at least *slow down* the leak, but only if you never clear it. (Creating a leak to destroy the leak)