[CONTROLLER-7] System can go into livelock after forwarding state is established between multiple switches Created: 11/Apr/13  Updated: 19/Oct/17  Resolved: 21/Apr/16

Status: Resolved
Project: controller
Component/s: adsal
Affects Version/s: 0.4.0
Fix Version/s: None

Type: Bug
Reporter: Gary Berger Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Linux
Platform: Other


External issue ID: 9

 Description   

Currently there is no liveness capability based on echo/echo_reply to flag a switch as down.. SwitchHandler might be hinted on NIO channel read errors to start a polling interval.

Procedure:

Utilize SimpleForwarding sample app to provide learning_bridge capability
Establish topology of roughly 15 switches leveraging mininet simulator.
Do a pairwise ping to establish flow-state
Bounce switches by exiting mininet.

A number of services continue to try and read from the channel consuming SwitchEvent and queue entries. There are references to these objects which are never freed which keeps them in OldGen space until the heap is exhausted.

Heap Histogram

num #instances #bytes class name
----------------------------------------------
1: 18815125 451563000 java.util.concurrent.LinkedBlockingQueue$Node
2: 18814799 451555176 o.o.c.p.openflow.core.internal.SwitchEvent
3: 86428 13728920 <constMethodKlass>
4: 75675 13373088 [C
5: 86428 11764480 <methodKlass>
6: 8782 10267656 <constantPoolKlass>
7: 8782 6374792 <instanceKlassKlass>

Rapidly the heap grows filling Eden and OldGen space until no new objects can be created.

Heap GC

S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 100.00 100.00 100.00 99.65 306 7.199 40 55.495 62.694
0.00 100.00 100.00 100.00 99.65 306 7.199 40 55.495 62.694
0.00 100.00 100.00 100.00 99.65 306 7.199 40 55.495 62.694
0.00 100.00 100.00 100.00 99.65 306 7.199 40 55.495 62.694
0.00 100.00 100.00 100.00 99.62 306 7.199 40 55.495 62.694
0.00 100.00 100.00 100.00 99.62 306 7.199 40 55.495 62.694
0.00 0.00 67.58 100.00 99.62 306 7.199 40 58.027 65.226
0.00 5.77 100.00 100.00 99.62 306 7.199 40 58.027 65.226
0.00 100.00 100.00 100.00 99.62 306 7.199 41 58.027 65.226
0.00 100.00 100.00 100.00 99.62 306 7.199 41 58.027 65.226
0.00 100.00 100.00 100.00 99.62 306 7.199 41 58.027 65.226
0.00 100.00 100.00 100.00 99.62 306 7.199 41 58.027 65.226
0.00 100.00 100.00 100.00 99.62 306 7.199 41 58.027 65.226
0.00 100.00 100.00 100.00 99.62 306 7.199 41 58.027 65.226
0.00 100.00 100.00 100.00 99.62 306 7.199 41 58.027 65.226
0.00 100.00 100.00 100.00 99.62 306 7.199 41 58.027 65.226
0.00 100.00 100.00 100.00 99.62 306 7.199 41 58.027 65.226

A timer must be implemented to clean up switch events and possibly trigger a cleanup of new messages (Statistics, FlowMods) based on a call back to AsynchronousCloseException.



 Comments   
Comment by Muthukumaran Kothandaraman [ 05/Jun/13 ]

Gary,

Took a look at your observation. Wanted a clarification on the same. The EventHandler thread of Controller seems to be clearing the events by scanning the switch-events queue and when switches are bounced, the switch-error event is sent to switch-events queue and consumed by EventHandler for cleanup.

But your observation seems to indicate that the events are residual. Am I missing something or misunderstanding your observation ?

Regards
Muthu

Comment by Carol Sanders [ 04/May/15 ]

This bug is part of the project to Move all ADSAL associated component bugs to ADSAL

Comment by Robert Varga [ 21/Apr/16 ]

AD-SAL was removed, hence won't fix.

Generated at Wed Feb 07 19:51:57 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.