[CONTROLLER-1016] Clustering : BGP - Linkstate topology missing. Created: 11/Nov/14  Updated: 04/Dec/14  Resolved: 04/Dec/14

Status: Verified
Project: controller
Component/s: mdsal
Affects Version/s: Helium
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Moiz Raja
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File bug2353.tar.xz     File conf.tar.xz     File karaf_20141112.log.xz    
Issue Links:
Blocks
is blocked by CONTROLLER-996 Clustering : Exception thrown in Shar... Resolved
is blocked by CONTROLLER-997 Clustering : Occasional failure to cr... Resolved
is blocked by CONTROLLER-1005 Clustering: Write Tx commit may fail ... Resolved
is blocked by CONTROLLER-1006 Clustering : TransactionChain id crea... Resolved
is blocked by CONTROLLER-1007 Clustering : CDS sometimes creates tr... Resolved
External issue ID: 2353

 Description   

When Karaf is started with clustering (3 nodes, no replication, no persistence, quiet period after installing odl-mdsal-clustering), and then odl-bgpcep-all feature is installed, there are several errors in logs (to be attached shortly).

It is not clear which is the primary cause (even whether it is bug in BGP or in clustering), so reporting only symptoms:

Linkstate topology is missing links and nodes, but ipv4 topology is complete.

Posible direct (as opposed to primary) cause from log:

2014-11-11 17:00:54,661 | ERROR | CommitFutures-8 | RIBImpl | 259 - org.opendaylight.bgpcep.bgp-rib-impl - 0.3.2.Helium-SR1 | Failed to commit RIB modification
TransactionCommitFailedException

{message=preCommit encountered an unexpected failure, errorList=[RpcError [message=preCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-operational/member-1-shard-default-operational)]] after [5000 ms]]]}

 Comments   
Comment by Vratko Polak [ 11/Nov/14 ]

Attachment bug2353.tar.xz has been added with description: Archive with curl output and karaf.log from 3 instances (third one did not even had ipv4 topology).

Comment by Robert Varga [ 11/Nov/14 ]

Looks like a clustering problem. Since neither replication nor persistence is enabled, this should be working locally.

I do not think we can recover from this failure in the app – we have no way of affectivng the datastore, nor do we have visibility into how the DS heals.

Comment by Moiz Raja [ 11/Nov/14 ]

Vratko,

Could you please try this with a controller built from master. There were a couple of issues with the transaction chaining implementation in the clustered datastore that may possibly affect you.

Also I will need to check the configuration to see how persistence/replication were disabled.

Comment by Vratko Polak [ 11/Nov/14 ]

> Could you please try this with a controller built from master.

Will do that tomorrow.

> Also I will need to check the configuration to see how
> persistence/replication were disabled.

Persistence: The old way, using
system:property shard.persistent false
and I just realized it does not work that way anymore.
Is the new way documented somewhere on a wiki page?

But snapshot and journal directories were not present, so hopefully there was not much difference in behavior from truly no-persistence setup.

Replication:
Attached archive with clustering configuration, but also BGP and PCEP ODL configuration and XRVR configuration.

Comment by Vratko Polak [ 11/Nov/14 ]

Attachment conf.tar.xz has been added with description: Archive containing various configuration, as present when logs were gathered.

Comment by Vratko Polak [ 11/Nov/14 ]

I forgot to mention, that the same ODL and XRVR configuration was manually tested against non-clustered configuration (basically everything the same, just odl-mdsal-clustering feature not installed) and linkstate topology was there indeed.

Comment by Moiz Raja [ 12/Nov/14 ]

I added some notes on how to disable persistence. This is applicable to post stable/helium code only.

https://wiki.opendaylight.org/view/Running_and_testing_an_OpenDaylight_Cluster#How_do_we_disable_persistence.3F

Persistence is enabled by default in stable/helium so I wonder why snapshots and journal directory were not created for you.

Comment by Moiz Raja [ 12/Nov/14 ]

I looked at the attached configuration and it seems to be correct. I can only suspect that you are being hit by the following bugs,

2318
2319
2337
2339
2340

All of these are resolved on master but need to be merged to stable/helium.

Comment by Vratko Polak [ 12/Nov/14 ]

> Could you please try this with a controller built from master.

I tried the build from
https://jenkins.opendaylight.org/integration/view/Integration%20jobs/job/integration-master-project-centralized-integration/2719/artifact/distributions/extra/karaf/target/distribution-karaf-0.3.0-SNAPSHOT.tar.gz

What I got is the same bug as described in
https://bugs.opendaylight.org/show_bug.cgi?id=1918#c2
so clustering did not even start on two of the three nodes.
It would take me some more time to work around that.

>> snapshot and journal directories were not present,
>
> I wonder why snapshots and journal directory were not created for you.

Pardon my weak English. I meant to say that those directories were not created before starting karaf. Of course they were created during karaf run.

Comment by Vratko Polak [ 12/Nov/14 ]

> https://jenkins.opendaylight.org/integration/view/Integration%20jobs/job/integration-master-project-centralized-integration/2719/artifact/distributions/extra/karaf/target/distribution-karaf-0.3.0-SNAPSHOT.tar.gz

The bug is also present on master, also errors look similar.
First dangerously looking error is actually this one

2014-11-12 11:39:23,329 | ERROR | lt-dispatcher-17 | OneForOneStrategy | 234 - com.typesafe.akka.slf4j - 2.3.4 | Node identifier contains different value: (urn:opendaylight:params:xml:ns:yang:bgp-linkstate?revision=2013-11-25)isis-area-id[[B@7e1ac7e4] than value itself: [B@10398a6a

which was also present in Helium branch logs, I just did not recognize it then.
In master branch I spotted it because it comes directly before more familiar clustering error:

2014-11-12 11:39:28,362 | ERROR | CommitFutures-1 | RIBImpl | 263 - org.opendaylight.bgpcep.bgp-rib-impl - 0.4.0.SNAPSHOT | Broken chain in RIB KeyedInstanceIdentifier

{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.rib.rev130925.bgp.rib.Rib, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.rib.rev130925.BgpRib, org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.rib.rev130925.bgp.rib.Rib[key=RibKey [_id=Uri [_value=example-bgp-rib-2]]]]}

transaction DOM-CHAIN-3-11
TransactionCommitFailedException

{message=preCommit encountered an unexpected failure, errorList=[RpcError [message=preCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-operational/member-2-shard-default-operational/shard-member-2-txn-46#-86928823)]] after [5000 ms]]]}
Comment by Vratko Polak [ 12/Nov/14 ]

Attachment karaf_20141112.log.xz has been added with description: master branch compressed full log from .10 node

Comment by Tom Pantelis [ 12/Nov/14 ]

>
> 2014-11-12 11:39:23,329 | ERROR | lt-dispatcher-17 | OneForOneStrategy
> | 234 - com.typesafe.akka.slf4j - 2.3.4 | Node identifier contains different
> value:
> (urn:opendaylight:params:xml:ns:yang:bgp-linkstate?revision=2013-11-25)isis-
> area-id[[B@7e1ac7e4] than value itself: [B@10398a6a
>

This error emanates from the ImmutableLeafSetEntryNodeBuilder:

ImmutableLeafSetEntryNode(final YangInstanceIdentifier.NodeWithValue nodeIdentifier, final T value, final Map<QName, String> attributes) {
super(nodeIdentifier, value, attributes);
Preconditions.checkArgument(nodeIdentifier.getValue().equals(value),
"Node identifier contains different value: %s than value itself: %s",
nodeIdentifier, value);
}

The value is a byte [] (as evidenced by toString output "[B@...") so I can see why the values don't match as byte[]#equals only checks reference equality. The code needs to check element equality in case of array.

So this appears to be a bug in ImmutableLeafSetEntryNodeBuilder but why doesn't this manifest with the in-memory data store (IMDS)? I suspect with the IMDS the 2 byte[] vars happen to be the same instance/reference. However in the CDS, the data is serialized/de-serialized so this results in different instances and equality fails.

Comment by Moiz Raja [ 12/Nov/14 ]

Tom, this is related to the byte[] serialization defect. That was fixed however you can have a NodeIdentifier also with a byte[] as the value (leaflists) that is where this problem occurs.

This we can fix in the ValueSerializer but I suspect there is more to this bug than just this. I will be investigating this further...

Comment by Moiz Raja [ 14/Nov/14 ]

https://git.opendaylight.org/gerrit/#/c/12820/ - yangtools
https://git.opendaylight.org/gerrit/#/c/12794/ - controller
https://git.opendaylight.org/gerrit/#/c/12806/ - bgpcep

Comment by Moiz Raja [ 15/Nov/14 ]

https://git.opendaylight.org/gerrit/#/c/12827/ - contoller - stable/helium

Comment by Moiz Raja [ 17/Nov/14 ]

Yangtools patch merged

https://git.opendaylight.org/gerrit/#/c/12820/ - yangtools:master
https://git.opendaylight.org/gerrit/#/c/12884/ - yangtools:stable/helium

Generated at Wed Feb 07 19:54:28 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.