[BGPCEP-601] B and C: rejoining isolated node causes exabgp unable to connect to any cluster node Created: 22/Nov/16  Updated: 03/Mar/19  Resolved: 20/Feb/17

Status: Resolved
Project: bgpcep
Component/s: BGP
Affects Version/s: Bugzilla Migration
Fix Version/s: Bugzilla Migration

Type: Bug
Reporter: Peter Gubka Assignee: Claudio David Gasparini
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
blocks BGPCEP-392 BGP scaling target not met in 3-node ... Confirmed
is blocked by MDSAL-212 Isolated node is not notified by EOS ... Resolved
External issue ID: 7222

 Description   

The problem observed in https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-3node-periodic-bgpclustering-ha-only-boron/86 and https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-3node-periodic-bgpclustering-ha-only-carbon/82/

The suite with karaf restart works well, but the suite when the bgp leader is isolated does not work well.
https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-ha-only-boron/86/archives/
Robot logs shows that ExaGbp is connected well to node3 (10.29.13.32).
Connected to peer neighbor 10.29.13.32 local-ip 10.29.13.14 local-as 64496 peer-as 64496 router-id 10.29.13.14 family-allowed in-open (out)
After the node isolation it reconnected to node2(10.29.12.215), but connection was broken just after.
Connected to peer neighbor 10.29.12.215 local-ip 10.29.13.14 local-as 64496 peer-as 64496 router-id 10.29.13.14 family-allowed in-open (out)
Peer 10.29.12.215 ASN 64496 out loop, peer reset, message [notification received (6,0)] error[Cease / Unspecific]

Tue, 22 Nov 2016 03:44:18 | INFO | 22747 | reactor | New peer setup: neighbor 10.29.13.32 local-ip 10.29.13.14 local-as 64496 peer-as 64496 router-id 10.29.13.14 family-allowed in-open
Tue, 22 Nov 2016 03:44:18 | INFO | 22747 | reactor | New peer setup: neighbor 10.29.12.30 local-ip 10.29.13.14 local-as 64496 peer-as 64496 router-id 10.29.13.14 family-allowed in-open
Tue, 22 Nov 2016 03:44:18 | INFO | 22747 | reactor | New peer setup: neighbor 10.29.12.215 local-ip 10.29.13.14 local-as 64496 peer-as 64496 router-id 10.29.13.14 family-allowed in-open
Tue, 22 Nov 2016 03:44:18 | WARNING | 22747 | configuration | Loaded new configuration successfully
Tue, 22 Nov 2016 03:44:21 | INFO | 22747 | network | Peer 10.29.12.30 ASN 64496 out loop, peer reset, message [notification received (6,5)] error[Cease / Connection Rejected]
Tue, 22 Nov 2016 03:44:21 | INFO | 22747 | network | Peer 10.29.12.215 ASN 64496 out loop, peer reset, message [notification received (6,5)] error[Cease / Connection Rejected]
Tue, 22 Nov 2016 03:44:21 | INFO | 22747 | network | Connected to peer neighbor 10.29.13.32 local-ip 10.29.13.14 local-as 64496 peer-as 64496 router-id 10.29.13.14 family-allowed in-open (out)
Tue, 22 Nov 2016 03:44:25 | INFO | 22747 | network | Peer 10.29.12.30 ASN 64496 out loop, peer reset, message [notification received (6,5)] error[Cease / Connection Rejected]
Tue, 22 Nov 2016 03:44:26 | INFO | 22747 | network | Peer 10.29.12.215 ASN 64496 out loop, peer reset, message [notification received (6,5)] error[Cease / Connection Rejected]
Tue, 22 Nov 2016 03:44:30 | INFO | 22747 | network | Peer 10.29.12.30 ASN 64496 out loop, peer reset, message [notification received (6,5)] error[Cease / Connection Rejected]
Tue, 22 Nov 2016 03:44:31 | INFO | 22747 | network | Peer 10.29.12.215 ASN 64496 out loop, peer reset, message [notification received (6,5)] error[Cease / Connection Rejected]
Tue, 22 Nov 2016 03:44:35 | INFO | 22747 | network | Peer 10.29.12.30 ASN 64496 out loop, peer reset, message [notification received (6,5)] error[Cease / Connection Rejected]
Tue, 22 Nov 2016 03:44:36 | INFO | 22747 | network | Peer 10.29.12.215 ASN 64496 out loop, peer reset, message [notification received (6,5)] error[Cease / Connection Rejected]
Tue, 22 Nov 2016 03:44:41 | INFO | 22747 | network | Peer 10.29.12.30 ASN 64496 out loop, peer reset, message [notification received (6,5)] error[Cease / Connection Rejected]
Tue, 22 Nov 2016 03:44:42 | INFO | 22747 | network | Connected to peer neighbor 10.29.12.215 local-ip 10.29.13.14 local-as 64496 peer-as 64496 router-id 10.29.13.14 family-allowed in-open (out)
Tue, 22 Nov 2016 03:44:43 | INFO | 22747 | network | Peer 10.29.12.215 ASN 64496 out loop, peer reset, message [notification received (6,0)] error[Cease / Unspecific]
Tue, 22 Nov 2016 03:44:43 | INFO | 22747 | network | Peer 10.29.13.32 ASN 64496 out loop, peer reset, message [notification received (6,0)] error[Cease / Unspecific]

In the logs of node 2 we can find
2016-11-22 03:44:42,074 | INFO | entLoopGroup-8-6 | AbstractBGPSessionNegotiator | 264 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.2.SNAPSHOT | BGP Session with peer [id: 0x059dff3e, L:/10.29.12.215:1790 - R:/10.29.13.14:42429] established successfully.
...
2016-11-22 03:44:43,120 | WARN | ult-dispatcher-2 | ConcurrentDOMDataBroker | 172 - org.opendaylight.controller.sal-distributed-datastore - 1.4.2.SNAPSHOT | Tx: DOM-CHAIN-12-1 Error during phase CAN_COMMIT, starting Abort
OptimisticLockFailedException

{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was deleted by other transaction.]]}

So it looks like the peer configuration was deleted and should not have been done.

This bug may be similar to https://bugs.opendaylight.org/show_bug.cgi?id=6788, but logs dont show the same problem.



 Comments   
Comment by Claudio David Gasparini [ 07/Dec/16 ]

partial fix
master
https://git.opendaylight.org/gerrit/#/c/48732/
Boron
https://git.opendaylight.org/gerrit/#/c/48767/

Comment by Claudio David Gasparini [ 11/Jan/17 ]

the error is cause of BUG-7322. Once its fixed we shouldn't longer experiment this wrong behavior under BGP.

Comment by Tomas Cere [ 09/Feb/17 ]

The issue is that the bgp services are trying to access the datastore to delete some data once the service managed by singleton tries to close them. Since the datastore cant make any progress once isolated the futures fail with unchecked exceptions which seem to prevent callbacks to other services.

https://git.opendaylight.org/gerrit/#/c/51621/ should improve unchecked exception handling in singleton service, but the underlying issue that the service is trying to write something into the datastore when it cant make progress doesn't seem right. If you need to cleanup the datastore maybe when the service is started it could do the cleanup as the first thing?

Comment by Claudio David Gasparini [ 16/Feb/17 ]

master
https://git.opendaylight.org/gerrit/#/c/51893/

Comment by Claudio David Gasparini [ 16/Feb/17 ]

boron
https://git.opendaylight.org/gerrit/#/c/51974/

Generated at Wed Feb 07 19:13:34 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.