[BGPCEP-512] Missing ORIGIN attribute can break app-peer Created: 09/Aug/16  Updated: 03/Mar/19  Resolved: 07/Sep/16

Status: Resolved
Project: bgpcep
Component/s: BGP
Affects Version/s: Bugzilla Migration
Fix Version/s: Bugzilla Migration

Type: Bug
Reporter: Vratko Polak Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 6373

 Description   

In CSIT Userfeatures test job, "Ibgp Peer Lsp" suite sends update message without ORIGIN attribute (a current bug in the suite).

ODL disconnects the peer, but fails to perform correct cleanup.
There is a suite running before "Ibgp Peer Lsp" which configures and then deconfigures app-peer. "Ibgp Peer Lsp" suite itself does not interact with app-peer.

Every suite after "Ibgp Peer Lsp" fails at test cases which expect consequences initiated by app-peer (each time app-peer is configured anew). Test cases for bgp-rib consequences from external peer are still passing

Looking at karaf.log there seem to be a transaction chain left in failed state.
Full log: https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-1node-userfeatures-only-boron/864/archives/karaf.log.gz

Parts on karaf.log lines:

2016-08-09 01:15:25,444 | ERROR | entLoopGroup-5-7 | BGPDocumentedException | 89 - org.opendaylight.bgpcep.bgp-parser-api - 0.6.0.SNAPSHOT | Error = WELL_KNOWN_ATTR_MISSING
org.opendaylight.protocol.bgp.parser.BGPDocumentedException: Well known mandatory attribute missing: ORIGIN

2016-08-09 01:15:25,444 | WARN | entLoopGroup-5-7 | BGPSessionImpl | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | BGP session encountered error
io.netty.handler.codec.DecoderException: org.opendaylight.protocol.bgp.parser.BGPDocumentedException: Well known mandatory attribute missing: ORIGIN

2016-08-09 01:15:25,447 | INFO | entLoopGroup-5-7 | BGPPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Session with peer 10.29.12.108 terminated: BGPTerminationReason

{error=WELL_KNOWN_ATTR_MISSING}

2016-08-09 01:15:25,452 | INFO | entLoopGroup-5-7 | BGPSessionImpl | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Closing session: BGPSessionImpl

{channel=[id: 0x58e9ec32, L:/10.29.12.199:1790 ! R:/10.29.12.108:17900], state=IDLE}

2016-08-09 01:15:25,455 | WARN | ult-dispatcher-6 | ShardDataTree | 172 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | member-1-shard-default-operational: Store Tx member-1-datastore-operational-fe-0-chn-58-txn-2: Data validation failed

2016-08-09 01:15:25,457 | ERROR | lt-dispatcher-16 | LocalThreePhaseCommitCohort | 172 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | Failed to prepare transaction member-1-datastore-operational-fe-0-chn-58-txn-2 on backend

2016-08-09 01:15:25,458 | WARN | ult-dispatcher-6 | ConcurrentDOMDataBroker | 172 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | Tx: DOM-CHAIN-57-2 Error during phase CAN_COMMIT, starting Abort

2016-08-09 01:15:25,461 | ERROR | CommitFutures-0 | BGPPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Transaction chain failed.
TransactionCommitFailedException{message=Data did not pass validation.

2016-08-09 01:15:56,268 | WARN | lt-dispatcher-20 | ShardDataTree | 172 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | member-1-shard-default-operational: Store Tx member-1-datastore-operational-fe-0-chn-62-txn-0: Conflicting modification for path

2016-08-09 01:15:56,269 | ERROR | lt-dispatcher-20 | LocalThreePhaseCommitCohort | 172 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | Failed to prepare transaction member-1-datastore-operational-fe-0-chn-62-txn-0 on backend

2016-08-09 01:15:56,274 | ERROR | CommitFutures-3 | ApplicationPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Transaction chain failed.
OptimisticLockFailedException

{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was created by other transaction.]]}

2016-08-09 01:16:10,609 | ERROR | n-dispatcher-146 | DataTreeChangeListenerActor | 172 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | Error notifying listener org.opendaylight.protocol.bgp.rib.impl.ApplicationPeer@284e83ce
java.lang.IllegalStateException: Transaction chain has failed



 Comments   
Comment by Ajay L [ 10/Aug/16 ]

Tried to repro this in local setup. Able to see "chain failed" error for BGPPeer which is kind of expected when update message has mandatory attribute missing as the peer connection is dropped in this case

Not able to repro the "chain failed" error for ApplicationPeer. Below is the message seen prior to that. So it hits "Node was created by other transaction" when trying to write data into adj-rib-in of app peer

2016-08-09 01:15:55,992 | INFO | h for user karaf | command | 307 - org.apache.karaf.log.command - 3.0.6 | ROBOT MESSAGE: Starting test Configure_App_Peer
2016-08-09 01:15:56,268 | WARN | lt-dispatcher-20 | ShardDataTree | 172 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | member-1-shard-default-operational: Store Tx member-1-datastore-operational-fe-0-chn-62-txn-0: Conflicting modification for path /(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2013-09-25)bgp-rib/rib/rib[

{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2013-09-25)id=example-bgp-rib}

]/peer/peer[

{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2013-09-25)peer-id=bgp://10.29.12.199}

]/adj-rib-in/tables/tables[

{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2013-09-25)afi=(urn:opendaylight:params:xml:ns:yang:bgp-types?revision=2013-09-19)ipv4-address-family, (urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2013-09-25)safi=(urn:opendaylight:params:xml:ns:yang:bgp-types?revision=2013-09-19)unicast-subsequent-address-family}

]/afi.
2016-08-09 01:15:56,269 | WARN | ult-dispatcher-2 | ConcurrentDOMDataBroker | 172 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | Tx: DOM-CHAIN-61-0 Error during phase CAN_COMMIT, starting Abort
OptimisticLockFailedException

{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was created by other transaction.]]}

To get more details about the failure, is it possible to re-run this suite with debugs turned on and share the karaf logs?

I can see that Robot keywords are available to turn-on debugs but unsure how to trigger the test from ODL Jenkins after making the change. If you can throw some light on that for my understanding, that will be great too

KarafKeywords.Execute_Controller_Karaf_Command_On_Background log:set ${CONTROLLER_BGP_LOG_LEVEL} org.opendaylight.bgpcep
KarafKeywords.Execute_Controller_Karaf_Command_On_Background log:set ${CONTROLLER_BGP_LOG_LEVEL} org.opendaylight.protocol

Comment by Vratko Polak [ 11/Aug/16 ]

> unsure how to trigger the test from ODL Jenkins after making the change

Most people cannot trigger jobs on RelEng. I can (as a releng/builder committer) so I have triggered [0]. The Robot variable values can be overridden using -v option in TESTOPTIONS job parameter.

Usually Sandbox [1] is the platform to run such tests, but Sandbox results are deleted every Saturday.

[0] https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-1node-userfeatures-only-boron/872
[1] https://nexus.opendaylight.org/content/repositories/site/org.opendaylight.releng.builder/master/jenkins.html#jenkins_sandbox

Comment by Ajay L [ 11/Aug/16 ]

(In reply to Vratko Polák from comment #2)
> > unsure how to trigger the test from ODL Jenkins after making the change
>
> Most people cannot trigger jobs on RelEng. I can (as a releng/builder
> committer) so I have triggered [0]. The Robot variable values can be
> overridden using -v option in TESTOPTIONS job parameter.
>
> Usually Sandbox [1] is the platform to run such tests, but Sandbox results
> are deleted every Saturday.
>
> [0]
> https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-1node-
> userfeatures-only-boron/872
> [1]
> https://nexus.opendaylight.org/content/repositories/site/org.opendaylight.
> releng.builder/master/jenkins.html#jenkins_sandbox

Thanks. I accessed the logs at https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-1node-userfeatures-only-boron/872/archives/karaf.log.gz and unable to find the error there. Can you please re-run the test to see if it we can get the error to happen?

$ grep "chain failed" karaf-2.log
2016-08-11 14:17:41,591 | ERROR | CommitFutures-18 | BGPPeer | 324 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | Transaction chain failed.
$

Errors from previous log file for reference:

$ grep "chain failed" karaf-1.log
2016-08-09 01:15:25,461 | ERROR | CommitFutures-0 | BGPPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Transaction chain failed.
2016-08-09 01:15:56,274 | ERROR | CommitFutures-3 | ApplicationPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Transaction chain failed.
2016-08-09 01:16:22,182 | ERROR | CommitFutures-3 | ApplicationPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Transaction chain failed.
2016-08-09 01:16:22,176 | ERROR | CommitFutures-4 | ApplicationPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Transaction chain failed.
2016-08-09 01:21:37,723 | ERROR | CommitFutures-1 | ApplicationPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Transaction chain failed.
2016-08-09 01:21:37,740 | ERROR | CommitFutures-3 | ApplicationPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Transaction chain failed.
2016-08-09 01:21:37,741 | ERROR | CommitFutures-3 | ApplicationPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Transaction chain failed.
2016-08-09 01:22:36,498 | ERROR | CommitFutures-1 | ApplicationPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Transaction chain failed.
2016-08-09 01:22:36,498 | ERROR | CommitFutures-4 | ApplicationPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Transaction chain failed.
2016-08-09 01:22:36,499 | ERROR | CommitFutures-0 | ApplicationPeer | 185 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.0.SNAPSHOT | Transaction chain failed.
$

Comment by Vratko Polak [ 12/Aug/16 ]

> https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-1node-userfeatures-only-boron/872/archives/karaf.log.gz

Yeah, the consequences on the following suites are different in this debug run.

There is still the main issue that this expected error:
2016-08-11 14:17:41,576 | ERROR | ntLoopGroup-11-8 | BGPDocumentedException | 311 - org.opendaylight.bgpcep.bgp-parser-api - 0.7.0.SNAPSHOT | Error = WELL_KNOWN_ATTR_MISSING
org.opendaylight.protocol.bgp.parser.BGPDocumentedException: Well known mandatory attribute missing: ORIGIN

leads to this unexpected consequence:
2016-08-11 14:17:41,591 | ERROR | CommitFutures-18 | BGPPeer | 324 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | Transaction chain failed.
TransactionCommitFailedException{message=Data did not pass validation., errorList=[RpcError [message=Data did not pass validation., severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ModifiedNodeDoesNotExistException: Node /(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2013-09-25)bgp-rib/rib/rib[

{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2013-09-25)id=example-bgp-rib}

]/peer/peer[

{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2013-09-25)peer-id=bgp://10.29.13.36}

] does not exist. Cannot apply modification to its children.]]}

I am not sure if there is anything wrong in the subsequent test, between test start:
2016-08-11 14:18:12,906 | INFO | h for user karaf | command | 302 - org.apache.karaf.log.command - 3.0.7 | ROBOT MESSAGE: Starting test L3vpn_Ipv4_To_Odl

and test giving up waiting on correct result:
2016-08-11 14:18:34,622 | DEBUG | ntLoopGroup-11-1 | BGPNotificationMessageParser | 313 - org.opendaylight.bgpcep.bgp-parser-impl - 0.7.0.SNAPSHOT | BGP Notification message was parsed: err = PEER_DECONFIGURED, data = [80, 101, 101, 114, 32, 68, 101, 45, 99, 111, 110, 102, 105, 103, 117, 114, 101, 100].

Scheduled #875 to see if the behavior repeats.

Comment by Ajay L [ 12/Aug/16 ]

(In reply to Vratko Polák from comment #4)
> > https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-1node-userfeatures-only-boron/872/archives/karaf.log.gz
>
> Yeah, the consequences on the following suites are different in this debug
> run.
>
> There is still the main issue that this expected error:
> 2016-08-11 14:17:41,576 | ERROR | ntLoopGroup-11-8 | BGPDocumentedException
> | 311 - org.opendaylight.bgpcep.bgp-parser-api - 0.7.0.SNAPSHOT | Error =
> WELL_KNOWN_ATTR_MISSING
> org.opendaylight.protocol.bgp.parser.BGPDocumentedException: Well known
> mandatory attribute missing: ORIGIN
>
> leads to this unexpected consequence:
> 2016-08-11 14:17:41,591 | ERROR | CommitFutures-18 | BGPPeer
> | 324 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | Transaction
> chain failed.
> TransactionCommitFailedException{message=Data did not pass validation.,
> errorList=[RpcError [message=Data did not pass validation., severity=ERROR,
> errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null,
> cause=org.opendaylight.yangtools.yang.data.api.schema.tree.
> ModifiedNodeDoesNotExistException: Node
> /(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2013-09-25)bgp-rib/
> rib/rib[

{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2013-09- > 25)id=example-bgp-rib}

]/peer/peer[

{(urn:opendaylight:params:xml:ns:yang:bgp- > rib?revision=2013-09-25)peer-id=bgp://10.29.13.36}

] does not exist. Cannot
> apply modification to its children.]]}
>
> I am not sure if there is anything wrong in the subsequent test, between
> test start:
> 2016-08-11 14:18:12,906 | INFO | h for user karaf | command
> | 302 - org.apache.karaf.log.command - 3.0.7 | ROBOT MESSAGE: Starting test
> L3vpn_Ipv4_To_Odl
>
> and test giving up waiting on correct result:
> 2016-08-11 14:18:34,622 | DEBUG | ntLoopGroup-11-1 |
> BGPNotificationMessageParser | 313 -
> org.opendaylight.bgpcep.bgp-parser-impl - 0.7.0.SNAPSHOT | BGP Notification
> message was parsed: err = PEER_DECONFIGURED, data = [80, 101, 101, 114, 32,
> 68, 101, 45, 99, 111, 110, 102, 105, 103, 117, 114, 101, 100].
>
> Scheduled #875 to see if the behavior repeats.

Thanks. #875 seems to have failed due to some infra issue
https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-1node-userfeatures-only-boron/875/console

. . .
SSH not responding on 10.29.13.79. Retrying in 10 seconds...
SSH not responding on 10.29.13.79. Retrying in 10 seconds...
SSH not responding on 10.29.13.79. Retrying in 10 seconds...
SSH not responding on 10.29.13.79. Retrying in 10 seconds...
SSH not responding on 10.29.13.79 after tries. Giving up.
Process 9622 failed to copy ssh keys.
Build step 'Execute shell' marked build as failure
Destroying nodes: [DFW/b72077cd-4885-43b0-8537-10a8b75928f2, DFW/64517f31-5f55-4d84-98ea-a7358b136dc6]
[ssh-agent] Stopped.
Robot results publisher started...
-Parsing output xml:
No public addresses found, so using private address.
No public addresses found, so using private address.
Failed!

Comment by Ajay L [ 17/Aug/16 ]

Looking at past few test runs on Jenkins, the last time this error (i.e. "chain failed" on ApplicationPeer) was seen was on job #882 (ref. [0]). #883 to #886 does not have karaf.log, and #887 to latest #892 does not have this error. Request you to run the test again a few more times with DEBUG enabled to see if debugs can be collected when problem happens. Else we can park the issue till the error happens consistently again

[0] https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-1node-userfeatures-only-boron/

Comment by Vratko Polak [ 17/Aug/16 ]

#894 scheduled.

Comment by Milos Fabian [ 17/Aug/16 ]

Looks like "close" is invoked twice after the termination - see "Closing session:" info log message.

Comment by Ajay L [ 17/Aug/16 ]

(In reply to Milos Fabian from comment #8)
> Looks like "close" is invoked twice after the termination - see "Closing
> session:" info log message.

Yeah thats right. Also noticed while trying to repro that connected-peer-counter was getting decremented twice in BGPPeer:dropConnection() and becoming negative. However I am not sure that it is causing the issue with application-peer mentioned in this bug. So I will open separate bug to track those

Comment by Ajay L [ 22/Aug/16 ]

https://bugs.opendaylight.org/show_bug.cgi?id=6468 has been opened for the duplicate cleanup issue

There seem to be some changes pending in this area (e.g. https://git.opendaylight.org/gerrit/#/c/44229/). Once those are merged, please try again and if issue is still seen attach karaf.log with DEBUG enabled. Thx

Comment by Vratko Polak [ 07/Sep/16 ]

When doing Sandbox testing for this suite change [0], this Bug was not seen, so it was probably fixed.

[0] https://git.opendaylight.org/gerrit/45165

Generated at Wed Feb 07 19:13:18 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.