[BGPCEP-878] BGP does not reconnect after partitioned cluster heals Created: 28/Aug/19  Updated: 14/Dec/19  Resolved: 27/Nov/19

Status: Resolved
Project: bgpcep
Component/s: BGP
Affects Version/s: None
Fix Version/s: Neon SR3, Magnesium, Sodium SR1

Type: Bug Priority: Medium
Reporter: Ajay Lele Assignee: Ajay Lele
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Steps are as follows:

  1. 3-node cluster, bgp and open-config shard local
  2. bgp connection to node1
  3. node1 gets isolated from node2 and node3
  4. bgp connection drops
  5. isolation is removed and node1 rejoins cluster
  6. bgp connection never gets reestablished
  7. below NPE is seen in karaf.log
2019-08-20T19:35:34,739 | INFO  | opendaylight-cluster-data-shard-dispatcher-46 | ShardManager                     | 282 - org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | shard-manager-operational Received follower initial sync status for member-1-shard-default-operational status sync done true
2019-08-20T19:35:34,748 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-52 | ClusterSingletonServiceGroupImpl | 335 - org.opendaylight.mdsal.singleton-dom-impl - 2.5.1 | Service group bgp-rib-service-group service org.opendaylight.protocol.bgp.rib.impl.config.BGPClusterSingletonService@17397d1 failed to start, attempting to continue
java.lang.NullPointerException: null
        at org.opendaylight.protocol.bgp.rib.impl.AdjRibInWriter.transform(AdjRibInWriter.java:149) ~[223:org.opendaylight.bgpcep.bgp-rib-impl:0.10.1]
        at org.opendaylight.protocol.bgp.rib.impl.ApplicationPeer.instantiateServiceInstance(ApplicationPeer.java:154) ~[223:org.opendaylight.bgpcep.bgp-rib-impl:0.10.1]
        at org.opendaylight.protocol.bgp.rib.impl.config.AppPeer$BgpAppPeerSingletonService.instantiateServiceInstance(AppPeer.java:135) ~[223:org.opendaylight.bgpcep.bgp-rib-impl:0.10.1]
        at org.opendaylight.protocol.bgp.rib.impl.config.AppPeer.instantiateServiceInstance(AppPeer.java:88) ~[223:org.opendaylight.bgpcep.bgp-rib-impl:0.10.1]
        at java.util.HashMap$Values.forEach(HashMap.java:981) [?:?]
        at org.opendaylight.protocol.bgp.rib.impl.config.BGPClusterSingletonService.instantiateServiceInstance(BGPClusterSingletonService.java:98) [223:org.opendaylight.bgpcep.bgp-rib-impl:0.10.1]
        at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.ensureServicesStarting(ClusterSingletonServiceGroupImpl.java:636) [335:org.opendaylight.mdsal.singleton-dom-impl:2.5.1]
        at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.tryReconcileState(ClusterSingletonServiceGroupImpl.java:563) [335:org.opendaylight.mdsal.singleton-dom-impl:2.5.1]
        at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.reconcileState(ClusterSingletonServiceGroupImpl.java:458) [335:org.opendaylight.mdsal.singleton-dom-impl:2.5.1]
        at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.ownershipChanged(ClusterSingletonServiceGroupImpl.java:339) [335:org.opendaylight.mdsal.singleton-dom-impl:2.5.1]
        at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.ownershipChanged(AbstractClusterSingletonServiceProviderImpl.java:238) [335:org.opendaylight.mdsal.singleton-dom-impl:2.5.1]
        at org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl.ownershipChanged(DOMClusterSingletonServiceProviderImpl.java:23) [335:org.opendaylight.mdsal.singleton-dom-impl:2.5.1]
        at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.onEntityOwnershipChanged(EntityOwnershipListenerActor.java:44) [282:org.opendaylight.controller.sal-distributed-datastore:1.8.1]
        at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.handleReceive(EntityOwnershipListenerActor.java:33) [282:org.opendaylight.controller.sal-distributed-datastore:1.8.1]
        at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38) [274:org.opendaylight.controller.sal-clustering-commons:1.8.1]
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167) [37:com.typesafe.akka.actor:2.5.11]
        at akka.actor.Actor.aroundReceive(Actor.scala:517) [37:com.typesafe.akka.actor:2.5.11]
        at akka.actor.Actor.aroundReceive$(Actor.scala:515) [37:com.typesafe.akka.actor:2.5.11]
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97) [37:com.typesafe.akka.actor:2.5.11]
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590) [37:com.typesafe.akka.actor:2.5.11]
        at akka.actor.ActorCell.invoke(ActorCell.scala:559) [37:com.typesafe.akka.actor:2.5.11]
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [37:com.typesafe.akka.actor:2.5.11]
        at akka.dispatch.Mailbox.run(Mailbox.scala:224) [37:com.typesafe.akka.actor:2.5.11]
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [37:com.typesafe.akka.actor:2.5.11]
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [37:com.typesafe.akka.actor:2.5.11]
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [37:com.typesafe.akka.actor:2.5.11]
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [37:com.typesafe.akka.actor:2.5.11]
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [37:com.typesafe.akka.actor:2.5.11] 


 Comments   
Comment by Ajay Lele [ 14/Dec/19 ]

Change done to BgpPeer.class in original commit causes below regression in CSIT test. More investigation in progress

2019-12-13T03:32:33,408 | ERROR | opendaylight-cluster-data-notification-dispatcher-51 | DataTreeChangeListenerActor      | 292 - org.opendaylight.controller.sal-clustering-commons - 1.9.3 | member-1-shard-default-config: Error notifying listener org.opendaylight.protocol.bgp.rib.impl.config.BgpDeployerImpl@70bb4e76
java.lang.IllegalStateException: Previous peer instance was not closed.
	at com.google.common.base.Preconditions.checkState(Preconditions.java:507) ~[36:com.google.guava:25.1.0.jre]
	at org.opendaylight.protocol.bgp.rib.impl.config.BgpPeer.start(BgpPeer.java:137) ~[238:org.opendaylight.bgpcep.bgp-rib-impl:0.11.3]
	at org.opendaylight.protocol.bgp.rib.impl.config.BgpPeer.restart(BgpPeer.java:149) ~[238:org.opendaylight.bgpcep.bgp-rib-impl:0.11.3]
	at org.opendaylight.protocol.bgp.rib.impl.config.BGPClusterSingletonService.restartNeighbors(BGPClusterSingletonService.java:367) ~[238:org.opendaylight.bgpcep.bgp-rib-impl:0.11.3]
	at org.opendaylight.protocol.bgp.rib.impl.config.BgpDeployerImpl.lambda$rebootNeighbors$6(BgpDeployerImpl.java:209) ~[238:org.opendaylight.bgpcep.bgp-rib-impl:0.11.3]
	at java.util.HashMap$Values.forEach(HashMap.java:981) ~[?:?]
	at org.opendaylight.protocol.bgp.rib.impl.config.BgpDeployerImpl.rebootNeighbors(BgpDeployerImpl.java:209) ~[238:org.opendaylight.bgpcep.bgp-rib-impl:0.11.3]
	at org.opendaylight.protocol.bgp.rib.impl.config.BgpDeployerImpl.handlePeersChange(BgpDeployerImpl.java:195) ~[238:org.opendaylight.bgpcep.bgp-rib-impl:0.11.3]
	at org.opendaylight.protocol.bgp.rib.impl.config.BgpDeployerImpl.handleModifications(BgpDeployerImpl.java:160) ~[238:org.opendaylight.bgpcep.bgp-rib-impl:0.11.3]
	at org.opendaylight.protocol.bgp.rib.impl.config.BgpDeployerImpl.onDataTreeChanged(BgpDeployerImpl.java:144) ~[238:org.opendaylight.bgpcep.bgp-rib-impl:0.11.3]
	at org.opendaylight.controller.md.sal.binding.impl.BindingDOMDataTreeChangeListenerAdapter.onDataTreeChanged(BindingDOMDataTreeChangeListenerAdapter.java:42) ~[287:org.opendaylight.controller.sal-binding-broker-impl:1.9.3]
	at org.opendaylight.controller.sal.core.compat.LegacyDOMDataBrokerAdapter$ProxyListener.onDataTreeChanged(LegacyDOMDataBrokerAdapter.java:353) ~[298:org.opendaylight.controller.sal-core-compat:1.9.3]
	at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.dataChanged(DataTreeChangeListenerActor.java:82) [300:org.opendaylight.controller.sal-distributed-datastore:1.9.3]
	at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.handleReceive(DataTreeChangeListenerActor.java:43) [300:org.opendaylight.controller.sal-distributed-datastore:1.9.3]
	at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:40) [292:org.opendaylight.controller.sal-clustering-commons:1.9.3]
	at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167) [41:com.typesafe.akka.actor:2.5.26]
	at akka.actor.Actor.aroundReceive(Actor.scala:539) [41:com.typesafe.akka.actor:2.5.26]
	at akka.actor.Actor.aroundReceive$(Actor.scala:537) [41:com.typesafe.akka.actor:2.5.26]
	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97) [41:com.typesafe.akka.actor:2.5.26]
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:612) [41:com.typesafe.akka.actor:2.5.26]
	at akka.actor.ActorCell.invoke(ActorCell.scala:581) [41:com.typesafe.akka.actor:2.5.26]
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268) [41:com.typesafe.akka.actor:2.5.26]
	at akka.dispatch.Mailbox.run(Mailbox.scala:229) [41:com.typesafe.akka.actor:2.5.26]
	at akka.dispatch.Mailbox.exec(Mailbox.scala:241) [41:com.typesafe.akka.actor:2.5.26]
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [41:com.typesafe.akka.actor:2.5.26]
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [41:com.typesafe.akka.actor:2.5.26]
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [41:com.typesafe.akka.actor:2.5.26]
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [41:com.typesafe.akka.actor:2.5.26] 
Generated at Wed Feb 07 19:14:23 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.