[BGPCEP-512] Missing ORIGIN attribute can break app-peer Created: 09/Aug/16 Updated: 03/Mar/19 Resolved: 07/Sep/16 |
|
| Status: | Resolved |
| Project: | bgpcep |
| Component/s: | BGP |
| Affects Version/s: | Bugzilla Migration |
| Fix Version/s: | Bugzilla Migration |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 6373 |
| Description |
|
In CSIT Userfeatures test job, "Ibgp Peer Lsp" suite sends update message without ORIGIN attribute (a current bug in the suite). ODL disconnects the peer, but fails to perform correct cleanup. Every suite after "Ibgp Peer Lsp" fails at test cases which expect consequences initiated by app-peer (each time app-peer is configured anew). Test cases for bgp-rib consequences from external peer are still passing Looking at karaf.log there seem to be a transaction chain left in failed state. Parts on karaf.log lines: 2016-08-09 01:15:25,444 | ERROR | entLoopGroup-5-7 | BGPDocumentedException | 89 - org.opendaylight.bgpcep.bgp-parser-api - 0.6.0.SNAPSHOT | Error = WELL_KNOWN_ATTR_MISSING 2016-08-09 01:15:25,444 | WARN | entLoopGroup-5-7 | BGPSessionImpl | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | BGP session encountered error 2016-08-09 01:15:25,447 | INFO | entLoopGroup-5-7 | BGPPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Session with peer 10.29.12.108 terminated: BGPTerminationReason {error=WELL_KNOWN_ATTR_MISSING}2016-08-09 01:15:25,452 | INFO | entLoopGroup-5-7 | BGPSessionImpl | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Closing session: BGPSessionImpl {channel=[id: 0x58e9ec32, L:/10.29.12.199:1790 ! R:/10.29.12.108:17900], state=IDLE}2016-08-09 01:15:25,455 | WARN | ult-dispatcher-6 | ShardDataTree | 172 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | member-1-shard-default-operational: Store Tx member-1-datastore-operational-fe-0-chn-58-txn-2: Data validation failed 2016-08-09 01:15:25,457 | ERROR | lt-dispatcher-16 | LocalThreePhaseCommitCohort | 172 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | Failed to prepare transaction member-1-datastore-operational-fe-0-chn-58-txn-2 on backend 2016-08-09 01:15:25,458 | WARN | ult-dispatcher-6 | ConcurrentDOMDataBroker | 172 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | Tx: DOM-CHAIN-57-2 Error during phase CAN_COMMIT, starting Abort 2016-08-09 01:15:25,461 | ERROR | CommitFutures-0 | BGPPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Transaction chain failed. 2016-08-09 01:15:56,268 | WARN | lt-dispatcher-20 | ShardDataTree | 172 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | member-1-shard-default-operational: Store Tx member-1-datastore-operational-fe-0-chn-62-txn-0: Conflicting modification for path 2016-08-09 01:15:56,269 | ERROR | lt-dispatcher-20 | LocalThreePhaseCommitCohort | 172 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | Failed to prepare transaction member-1-datastore-operational-fe-0-chn-62-txn-0 on backend 2016-08-09 01:15:56,274 | ERROR | CommitFutures-3 | ApplicationPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Transaction chain failed. 2016-08-09 01:16:10,609 | ERROR | n-dispatcher-146 | DataTreeChangeListenerActor | 172 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | Error notifying listener org.opendaylight.protocol.bgp.rib.impl.ApplicationPeer@284e83ce |
| Comments |
| Comment by Ajay L [ 10/Aug/16 ] |
|
Tried to repro this in local setup. Able to see "chain failed" error for BGPPeer which is kind of expected when update message has mandatory attribute missing as the peer connection is dropped in this case Not able to repro the "chain failed" error for ApplicationPeer. Below is the message seen prior to that. So it hits "Node was created by other transaction" when trying to write data into adj-rib-in of app peer 2016-08-09 01:15:55,992 | INFO | h for user karaf | command | 307 - org.apache.karaf.log.command - 3.0.6 | ROBOT MESSAGE: Starting test Configure_App_Peer ]/peer/peer[ {(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2013-09-25)peer-id=bgp://10.29.12.199}]/adj-rib-in/tables/tables[ {(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2013-09-25)afi=(urn:opendaylight:params:xml:ns:yang:bgp-types?revision=2013-09-19)ipv4-address-family, (urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2013-09-25)safi=(urn:opendaylight:params:xml:ns:yang:bgp-types?revision=2013-09-19)unicast-subsequent-address-family}]/afi. To get more details about the failure, is it possible to re-run this suite with debugs turned on and share the karaf logs? I can see that Robot keywords are available to turn-on debugs but unsure how to trigger the test from ODL Jenkins after making the change. If you can throw some light on that for my understanding, that will be great too KarafKeywords.Execute_Controller_Karaf_Command_On_Background log:set ${CONTROLLER_BGP_LOG_LEVEL} org.opendaylight.bgpcep |
| Comment by Vratko Polak [ 11/Aug/16 ] |
|
> unsure how to trigger the test from ODL Jenkins after making the change Most people cannot trigger jobs on RelEng. I can (as a releng/builder committer) so I have triggered [0]. The Robot variable values can be overridden using -v option in TESTOPTIONS job parameter. Usually Sandbox [1] is the platform to run such tests, but Sandbox results are deleted every Saturday. [0] https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-1node-userfeatures-only-boron/872 |
| Comment by Ajay L [ 11/Aug/16 ] |
|
(In reply to Vratko Polák from comment #2) Thanks. I accessed the logs at https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-1node-userfeatures-only-boron/872/archives/karaf.log.gz and unable to find the error there. Can you please re-run the test to see if it we can get the error to happen? $ grep "chain failed" karaf-2.log Errors from previous log file for reference: $ grep "chain failed" karaf-1.log |
| Comment by Vratko Polak [ 12/Aug/16 ] |
|
Yeah, the consequences on the following suites are different in this debug run. There is still the main issue that this expected error: leads to this unexpected consequence: ]/peer/peer[ {(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2013-09-25)peer-id=bgp://10.29.13.36}] does not exist. Cannot apply modification to its children.]]} I am not sure if there is anything wrong in the subsequent test, between test start: and test giving up waiting on correct result: Scheduled #875 to see if the behavior repeats. |
| Comment by Ajay L [ 12/Aug/16 ] |
|
(In reply to Vratko Polák from comment #4) ]/peer/peer[ {(urn:opendaylight:params:xml:ns:yang:bgp- > rib?revision=2013-09-25)peer-id=bgp://10.29.13.36}] does not exist. Cannot Thanks. #875 seems to have failed due to some infra issue . . . |
| Comment by Ajay L [ 17/Aug/16 ] |
|
Looking at past few test runs on Jenkins, the last time this error (i.e. "chain failed" on ApplicationPeer) was seen was on job #882 (ref. [0]). #883 to #886 does not have karaf.log, and #887 to latest #892 does not have this error. Request you to run the test again a few more times with DEBUG enabled to see if debugs can be collected when problem happens. Else we can park the issue till the error happens consistently again [0] https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-1node-userfeatures-only-boron/ |
| Comment by Vratko Polak [ 17/Aug/16 ] |
|
#894 scheduled. |
| Comment by Milos Fabian [ 17/Aug/16 ] |
|
Looks like "close" is invoked twice after the termination - see "Closing session:" info log message. |
| Comment by Ajay L [ 17/Aug/16 ] |
|
(In reply to Milos Fabian from comment #8) Yeah thats right. Also noticed while trying to repro that connected-peer-counter was getting decremented twice in BGPPeer:dropConnection() and becoming negative. However I am not sure that it is causing the issue with application-peer mentioned in this bug. So I will open separate bug to track those |
| Comment by Ajay L [ 22/Aug/16 ] |
|
https://bugs.opendaylight.org/show_bug.cgi?id=6468 has been opened for the duplicate cleanup issue There seem to be some changes pending in this area (e.g. https://git.opendaylight.org/gerrit/#/c/44229/). Once those are merged, please try again and if issue is still seen attach karaf.log with DEBUG enabled. Thx |
| Comment by Vratko Polak [ 07/Sep/16 ] |
|
When doing Sandbox testing for this suite change [0], this Bug was not seen, so it was probably fixed. |