[CONTROLLER-1311] Clustering : Exception in BGP scale test Created: 13/May/15 Updated: 30/Jun/15 Resolved: 30/Jun/15 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Post-Helium |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Dana Kutenicsova | Assignee: | Tom Pantelis |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 3195 |
| Priority: | High |
| Description |
|
BGP scale tests 100k routes failed on following exception: 2015-05-11 12:57:40,915 | ERROR | ult-dispatcher-2 | Shard | 205 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-default-operational: Error handling BatchedModifications for Tx member-1-txn-313834 |
| Comments |
| Comment by Jozef Behran [ 13/May/15 ] |
|
Currently masked by |
| Comment by Moiz Raja [ 02/Jun/15 ] |
|
Does this exception still happen? I've not seen it happen in the bgp scale test. |
| Comment by Moiz Raja [ 09/Jun/15 ] |
|
Because of changes in CDS code - this particular situation with the BatchedModifications being sent to the Shard can only happen in the remote case - the local case is covered differently. |
| Comment by Tom Pantelis [ 12/Jun/15 ] |
|
This exception means that it attempted to create a new read/write tx on the chain before the previous one was readied. However notice that the tx ID (member-1-txn-313834) associated with the BatchedModifications message is the same as the tx ID reported as still being open. Ruling out a tx ID being erroneously reused, that means there was a previous BatchedModifications message from the front-end for the same tx ID that created a read/write tx on the chain and put an entry in the cohortCache. A subsequent BatcheModifications message should find an entry already in the cohortCache and not create a new read/write tx. But in this case it found no entry in the cache and tried to create a new chained tx which indicates there was sufficient time in between the BatchedModifications messages for the entry to be timed out of the cache. Unfortunately we don't have the whole log to confirm this but it seems like that's what happened. The code has since been refactored a bit so entries will no longer age out of the cache in between BatchedModifications message - they can only be aged out once they're enqueued on ready. So it doesn't appear this issue can occur now. However, looking at the code more, I think there's some places with incomplete cleanup in some error code paths. |
| Comment by Tom Pantelis [ 16/Jun/15 ] |
| Comment by Jozef Behran [ 30/Jun/15 ] |
|
Not observed during the last 2 weeks of testing. |