[BGPCEP-111] PCEP request occasionally not found in outstanding DB Created: 18/Jul/14  Updated: 03/Mar/19  Due: 25/Jul/14  Resolved: 29/Jul/14

Status: Verified
Project: bgpcep
Component/s: General
Affects Version/s: Bugzilla Migration
Fix Version/s: Bugzilla Migration 1.0

Type: Bug
Reporter: Vratko Polak Assignee: Dana Kutenicsova
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Linux
Platform: PC


External issue ID: 1394

 Description   

When doing performance test, sometimes the situation described by the following log fragments happens, during add-lsp:

2014-07-18 03:40:04.005 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-4]
TRACE o.o.p.pcep.impl.PCEPSessionImpl - Message sent to socket: Pcinitiate [_pcinitiateMessage=PcinitiateMessage [_requests=[Requests [_srp=Srp [_operationId=SrpIdNumber [_value=1], ...

2014-07-18 03:40:04.007 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-4] DEBUG o.o.p.pcep.impl.PCEPSessionImpl - PCEP message Pcrpt [_pcrptMessage=PcrptMessage [_reports=[Reports [ ... _srp=Srp [_operationId=SrpIdNumber [_value=1], ...

2014-07-18 03:40:04.007 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-4] WARN o.o.b.p.t.p.Stateful07TopologySessionListener - Request ID SrpIdNumber [_value=1] not found in outstanding DB

This results in add-lsp never returning any output.



 Comments   
Comment by Vratko Polak [ 18/Jul/14 ]

Hypothesis: Apparently, ODL was assuming that it can add request to waitingRequests upon sending Pcinitiate sooner than Pcrpt handling will look for it. As log evidences, this is not always so.

Comment by Vratko Polak [ 18/Jul/14 ]

My idea on how the fix could look like:

https://git.opendaylight.org/gerrit/#/c/9160/

Comment by Robert Varga [ 18/Jul/14 ]

So the logs provided point actually to the fact that the request should have been on the queue: the listener writing the 'sent to socket' message is attached to the same future the listener which moves the message. The processing happens on the same thread, so the other listener should have been invoked, too.

https://git.opendaylight.org/gerrit/9162 adds trace messages to track lifecycle of the list, please retry with that.

Comment by Vratko Polak [ 21/Jul/14 ]

> please retry with that

Basically the same. Excerpts:

2014-07-21 09:18:24.295 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-7]
TRACE o.o.p.pcep.impl.PCEPSessionImpl - Message sent to socket: Pcinitiate ...

2014-07-21 09:18:24.296 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-7] TRACE o.o.p.p.i.PCEPByteToMessageDecoder - Received to decode: ...

2014-07-21 09:18:24.300 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-7] TRACE o.o.b.p.t.p.AbstractTopologySessionListener - Removed request SrpIdNumber [_value=3] object null

2014-07-21 09:18:24.300 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-7] WARN o.o.b.p.t.p.Stateful07TopologySessionListener - Request ID SrpIdNumber [_value=3] not found in outstanding DB

2014-07-21 09:18:24.320 PDT [nettyThreadgroupModule$NioEventLoopGroupCloseable-7-7] TRACE o.o.b.p.t.p.AbstractTopologySessionListener - Request SrpIdNumber [_value=3] sent to peer (object org.opendaylight.bgpcep.pcep.topology.provider.PCEPRequest@20deec50)

> the logs provided point actually to the fact that the request should have been on the queue
> The processing happens on the same thread, so the other listener should have been invoked, too.

It was invoked, just 20 milliseconds too late.

Comment by Vratko Polak [ 22/Jul/14 ]

My testing reached statistical significance for me to say: It is suspicious that I have never seen this bug when testing against single PCC (others present but do nothing, maybe apart keepalives).
But, logs do not show anything interesting regarding communication on other Pcep sessions (usually handled by different threads and with long enough time differences).

Comment by Vratko Polak [ 22/Jul/14 ]

Training to get Change-Id right (and without syntax errors).
https://git.opendaylight.org/gerrit/#/c/9160/

Comment by Robert Varga [ 23/Jul/14 ]

So looking at the traces, it seems that netty does not deliver the notification of the request leaving the socket first, but rather continues to process the message from the router.

Your fixes are just papering over the problem, not fixing it, simply because membership on a particular list hold the logical state of the request.

In particular, even with your fix, if the PCC closes the connection after we have pushed it, but have not received the future notification, we will inform the client of an Unsent error, where in fact it is Noack.

So what we really need to do is get netty to order the events as they occur.

Comment by Vratko Polak [ 23/Jul/14 ]

> So what we really need to do is get netty to order the events as they occur.

Agreed.

> In particular, even with your fix, if the PCC closes the connection after we have pushed it, but have not received the future notification, we will inform the client of an Unsent error, where in fact it is Noack.

I would have to think more about interaction between onSessionDown and !isSuccess.
For commit b2c0fb9b66d35776f0bf4e6cea614c7c1e419d35 line 216, are we sure PCC has definitely not received the message?

> Your fixes are just papering over the problem

So it is not worth to do a quick and dirty fix for "RPC hangs" bug if that would introduce "false unsent" bug?

Comment by Vratko Polak [ 29/Jul/14 ]

This bug is verified to not appear in ODL based on this build: https://jenkins.opendaylight.org/integration/view/Integration%20jobs/job/integration-master-project-centralized-integration/1741

Generated at Wed Feb 07 19:12:03 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.