[INFRAUTILS-66] Threadswatcher causing safepoint pauses due to calling FindDeadlocks every 500ms Created: 25/Nov/19  Updated: 29/Nov/19  Resolved: 29/Nov/19

Status: Resolved
Project: infrautils
Component/s: metrics
Affects Version/s: Oxygen-SR4, Neon, Sodium, Magnesium
Fix Version/s: Magnesium, Sodium SR2, Neon SR3

Type: Bug Priority: High
Reporter: Tomas Cere Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

We experience frequent safepoint pauses seen in this log:

// code          
vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
2653.486: FindDeadlocks                    [     870          1              1    ]      [     1     0     2     1     0    ]  1
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
2653.986: FindDeadlocks                    [     870          0              0    ]      [     0     0     0     1     0    ]  0
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
2654.486: FindDeadlocks                    [     870          0              3    ]      [     0     0     0     1     0    ]  0
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
2654.986: FindDeadlocks                    [     870          0              2    ]      [     0     0     0     1     0    ]  0
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
2655.486: FindDeadlocks                    [     871          0              1    ]      [     0     0     0     1     0    ]  0
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
2655.986: FindDeadlocks                    [     871          0              0    ]      [     0     0     0     1     0    ]  0
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
2656.486: FindDeadlocks                    [     870          0              0    ]      [     0     0     0     1     0    ]  0

 Which was traced to ThreadsWatcher calling FindDeadlocks every 500ms.
 



 Comments   
Comment by Robert Varga [ 25/Nov/19 ]

This results in large pauses as evidenced by:

2019-11-15T15:43:36.928+0000: 3304.685: Total time for which application threads were stopped: 9.2916061 seconds, Stopping threads took: 0.0001933 seconds
2019-11-15T15:44:47.185+0000: 3374.942: Total time for which application threads were stopped: 9.5493942 seconds, Stopping threads took: 0.0001182 seconds
2019-11-15T15:46:00.042+0000: 3447.800: Total time for which application threads were stopped: 12.4064250 seconds, Stopping threads took: 0.0001305 seconds
2019-11-15T15:47:06.771+0000: 3514.528: Total time for which application threads were stopped: 6.1348460 seconds, Stopping threads took: 0.0005121 seconds
2019-11-15T15:48:12.704+0000: 3580.462: Total time for which application threads were stopped: 5.5681979 seconds, Stopping threads took: 0.0002656 seconds
2019-11-15T15:49:20.000+0000: 3647.758: Total time for which application threads were stopped: 6.8645785 seconds, Stopping threads took: 0.0001139 seconds
2019-11-15T15:50:22.866+0000: 3710.623: Total time for which application threads were stopped: 2.6894180 seconds, Stopping threads took: 0.0002664 seconds
2019-11-15T15:51:29.921+0000: 3777.678: Total time for which application threads were stopped: 6.7841151 seconds, Stopping threads took: 0.0001327 seconds
2019-11-15T15:52:35.962+0000: 3843.719: Total time for which application threads were stopped: 5.8258694 seconds, Stopping threads took: 0.0001516 seconds
2019-11-15T15:53:54.924+0000: 3922.681: Total time for which application threads were stopped: 4.7865003 seconds, Stopping threads took: 0.0002752 seconds
2019-11-15T15:55:13.234+0000: 4000.992: Total time for which application threads were stopped: 3.0976428 seconds, Stopping threads took: 0.0002604 seconds
2019-11-15T15:57:38.405+0000: 4146.162: Total time for which application threads were stopped: 4.2689397 seconds, Stopping threads took: 0.0001862 seconds
2019-11-15T15:58:45.222+0000: 4212.980: Total time for which application threads were stopped: 6.5856216 seconds, Stopping threads took: 0.0014271 seconds

i.e. every 60 seconds we force a stop-the-world for 3-12 seconds. I think this facility should be turned off by default.

Comment by Michael Vorburger [ 26/Nov/19 ]

Fixing this sounds like a great idea! JIRA assigns this to me automatically (default assignee), but I'm guessing you'd like to fix this? Go for it! (Unassigned from myself just to clarify I'm not actually going to work on this myself - but I'm happy to review any related Gerrit.)

Generated at Wed Feb 07 20:02:10 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.