[BGPCEP-111] PCEP request occasionally not found in outstanding DB Created: 18/Jul/14 Updated: 03/Mar/19 Due: 25/Jul/14 Resolved: 29/Jul/14 |
|
| Status: | Verified |
| Project: | bgpcep |
| Component/s: | General |
| Affects Version/s: | Bugzilla Migration |
| Fix Version/s: | Bugzilla Migration 1.0 |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Dana Kutenicsova |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: Linux |
||
| External issue ID: | 1394 |
| Description |
|
When doing performance test, sometimes the situation described by the following log fragments happens, during add-lsp: 2014-07-18 03:40:04.005 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-4] 2014-07-18 03:40:04.007 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-4] DEBUG o.o.p.pcep.impl.PCEPSessionImpl - PCEP message Pcrpt [_pcrptMessage=PcrptMessage [_reports=[Reports [ ... _srp=Srp [_operationId=SrpIdNumber [_value=1], ... 2014-07-18 03:40:04.007 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-4] WARN o.o.b.p.t.p.Stateful07TopologySessionListener - Request ID SrpIdNumber [_value=1] not found in outstanding DB This results in add-lsp never returning any output. |
| Comments |
| Comment by Vratko Polak [ 18/Jul/14 ] |
|
Hypothesis: Apparently, ODL was assuming that it can add request to waitingRequests upon sending Pcinitiate sooner than Pcrpt handling will look for it. As log evidences, this is not always so. |
| Comment by Vratko Polak [ 18/Jul/14 ] |
|
My idea on how the fix could look like: |
| Comment by Robert Varga [ 18/Jul/14 ] |
|
So the logs provided point actually to the fact that the request should have been on the queue: the listener writing the 'sent to socket' message is attached to the same future the listener which moves the message. The processing happens on the same thread, so the other listener should have been invoked, too. https://git.opendaylight.org/gerrit/9162 adds trace messages to track lifecycle of the list, please retry with that. |
| Comment by Vratko Polak [ 21/Jul/14 ] |
|
> please retry with that Basically the same. Excerpts: 2014-07-21 09:18:24.295 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-7] 2014-07-21 09:18:24.296 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-7] TRACE o.o.p.p.i.PCEPByteToMessageDecoder - Received to decode: ... 2014-07-21 09:18:24.300 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-7] TRACE o.o.b.p.t.p.AbstractTopologySessionListener - Removed request SrpIdNumber [_value=3] object null 2014-07-21 09:18:24.300 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-7] WARN o.o.b.p.t.p.Stateful07TopologySessionListener - Request ID SrpIdNumber [_value=3] not found in outstanding DB 2014-07-21 09:18:24.320 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-7] TRACE o.o.b.p.t.p.AbstractTopologySessionListener - Request SrpIdNumber [_value=3] sent to peer (object org.opendaylight.bgpcep.pcep.topology.provider.PCEPRequest@20deec50) > the logs provided point actually to the fact that the request should have been on the queue It was invoked, just 20 milliseconds too late. |
| Comment by Vratko Polak [ 22/Jul/14 ] |
|
My testing reached statistical significance for me to say: It is suspicious that I have never seen this bug when testing against single PCC (others present but do nothing, maybe apart keepalives). |
| Comment by Vratko Polak [ 22/Jul/14 ] |
|
Training to get Change-Id right (and without syntax errors). |
| Comment by Robert Varga [ 23/Jul/14 ] |
|
So looking at the traces, it seems that netty does not deliver the notification of the request leaving the socket first, but rather continues to process the message from the router. Your fixes are just papering over the problem, not fixing it, simply because membership on a particular list hold the logical state of the request. In particular, even with your fix, if the PCC closes the connection after we have pushed it, but have not received the future notification, we will inform the client of an Unsent error, where in fact it is Noack. So what we really need to do is get netty to order the events as they occur. |
| Comment by Vratko Polak [ 23/Jul/14 ] |
|
> So what we really need to do is get netty to order the events as they occur. Agreed. > In particular, even with your fix, if the PCC closes the connection after we have pushed it, but have not received the future notification, we will inform the client of an Unsent error, where in fact it is Noack. I would have to think more about interaction between onSessionDown and !isSuccess. > Your fixes are just papering over the problem So it is not worth to do a quick and dirty fix for "RPC hangs" bug if that would introduce "false unsent" bug? |
| Comment by Vratko Polak [ 29/Jul/14 ] |
|
This bug is verified to not appear in ODL based on this build: https://jenkins.opendaylight.org/integration/view/Integration%20jobs/job/integration-master-project-centralized-integration/1741 |