[CONTROLLER-1618] Scale testing: akka.pattern.AskTimeoutException ... after [30000 ms] Created: 22/Mar/17 Updated: 25/Jul/23 Resolved: 12/Apr/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Guy Sela | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| External issue ID: | 8041 | ||||||||
| Description |
|
Running scale tests on latest Boron code with a 3-node cluster. ODL 3 interesting errors: java.lang.RuntimeException: java.util.concurrent.ExecutionException: ReadFailedException{message=Error executeRead ReadData for path /(urn:opendaylight:netvirt:elan?revision=2015-06-02)elan-interfaces/elan-inter ], errorList=[RpcError [message=Error executeRead ReadData for path /(urn:opendaylight:netvirt:elan?revision=2015-06-02 ], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=nu ], errorList=[RpcError [message=Error executeRead ReadData for path /(urn:opendaylight:netvirt:elan?revision=2015-06-02)elan-interfaces/ ], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=java.la ], errorList=[RpcError [message=Error executeRead ReadData for path /(urn:opendaylight:netvirt:elan?revision=2015-06-02)elan-interfaces/elan-interface/elan-interface[ {(urn:opend aylight:netvirt:elan?revision=2015-06-02)name=tun816c0459241}], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=java.lang.Exception: Error creating READ_ONLY tr Ran jolokia/read/org.opendaylight.controller:Category=Shards,name=member-X-shard-topology-operational,type=DistributedOperationalDatastore on all ODLs. ,"value": {"ReadWriteTransactionCount":0,"SnapshotIndex":2645417,"InMemoryJournalLogSize":6,"ReplicatedToAllIndex":2645417,"Leader":"member-3-shard-topology-operational","LastIndex":2645423,"RaftState":"Follower","LastApplied":2645423,"LastCommittedTransactionTime":"1970-01-01 02:00:00.000","LastLogIndex":2645423,"LastLeadershipChangeTime":"2017-03-21 18:06:45.018","PeerAddresses":"member-2-shard-topology-operational: , member-3-shard-topology-operational: akka.tcp:\/\/opendaylight-cluster-data@172.16.1.153:2550\/user\/shardmanager-operational\/member-3-shard-topology-operational","WriteOnlyTransactionCount":0,"FollowerInitialSyncStatus":true,"FollowerInfo":[],"FailedReadTransactionsCount":0,"Voting":true,"StatRetrievalTime":"2.828 ms","CurrentTerm":1,"LastTerm":1,"FailedTransactionsCount":0,"PendingTxCommitQueueSize":0,"VotedFor":"member-3-shard-topology-operational","SnapshotCaptureInitiated":false,"CommittedTransactionsCount":0,"TxCohortCacheSize":0,"PeerVotingStates":"member-2-shard-topology-operational: true, member-3-shard-topology-operational: true","LastLogTerm":1,"StatRetrievalError":null,"CommitIndex":2645423,"SnapshotTerm":1,"AbortTransactionsCount":0,"ReadOnlyTransactionCount":0,"ShardName":"member-1-shard-topology-operational","LeadershipChangeCount":1,"InMemoryJournalDataSize":7452},"timestamp":1490175963,"status":200} ,"value": {"ReadWriteTransactionCount":0,"SnapshotIndex":2645983,"InMemoryJournalLogSize":3,"ReplicatedToAllIndex":2645983,"Leader":"member-3-shard-topology-operational","LastIndex":2645986,"RaftState":"Follower","LastCommittedTransactionTime":"1970-01-01 02:00:00.000","LastApplied":2645985,"LastLogIndex":2645986,"LastLeadershipChangeTime":"2017-03-21 18:06:45.011","PeerAddresses":"member-1-shard-topology-operational: , member-3-shard-topology-operational: akka.tcp:\/\/opendaylight-cluster-data@172.16.1.153:2550\/user\/shardmanager-operational\/member-3-shard-topology-operational","WriteOnlyTransactionCount":0,"FollowerInitialSyncStatus":true,"FollowerInfo":[],"FailedReadTransactionsCount":0,"StatRetrievalTime":"11.85 ms","Voting":true,"CurrentTerm":1,"LastTerm":1,"FailedTransactionsCount":0,"PendingTxCommitQueueSize":0,"VotedFor":"member-3-shard-topology-operational","SnapshotCaptureInitiated":false,"CommittedTransactionsCount":0,"TxCohortCacheSize":0,"PeerVotingStates":"member-1-shard-topology-operational: true, member-3-shard-topology-operational: true","LastLogTerm":1,"StatRetrievalError":null,"CommitIndex":2645985,"SnapshotTerm":1,"AbortTransactionsCount":0,"ReadOnlyTransactionCount":0,"ShardName":"member-2-shard-topology-operational","LeadershipChangeCount":1,"InMemoryJournalDataSize":3726},"timestamp":1490175977,"status":200} ,"value":{"ReadWriteTransactionCount":881228,"SnapshotIndex":2646308,"InMemoryJournalLogSize":5,"ReplicatedToAllIndex":2646308,"Leader":"member-3-shard-topology-operational","LastIndex":2646313,"RaftState":"Leader","LastCommittedTransactionTime":"2017-03-22 11:46:26.848","LastApplied":2646313,"LastLeadershipChangeTime":"2017-03-21 18:06:44.977","LastLogIndex":2646313,"PeerAddresses":"member-1-shard-topology-operational: akka.tcp:\/\/opendaylight-cluster-data@172.16.1.142:2550\/user\/shardmanager-operational\/member-1-shard-topology-operational, member-2-shard-topology-operational: akka.tcp:\/\/opendaylight-cluster-data@172.16.1.136:2550\/user\/shardmanager-operational\/member-2-shard-topology-operational","FollowerInitialSyncStatus":false,"WriteOnlyTransactionCount":0,"FollowerInfo":[{"timeSinceLastActivity":"00:00:00.070","active":true,"matchIndex":2646308,"voting": |
| Comments |
| Comment by Guy Sela [ 22/Mar/17 ] |
|
Attachment odl2.tar.gz has been added with description: ODL2 Logs |
| Comment by Guy Sela [ 22/Mar/17 ] |
|
Attachment odl3-part1.tar.gz has been added with description: Odl3 logs part 1 |
| Comment by Guy Sela [ 22/Mar/17 ] |
|
Attachment odl3-part2.tar.gz has been added with description: Odl3 logs part 2 |
| Comment by Guy Sela [ 22/Mar/17 ] |
|
Attachment odl1-part1.tar.gz has been added with description: ODL1 logs part 1 |
| Comment by Guy Sela [ 22/Mar/17 ] |
|
Attachment odl1-part2.tar.gz has been added with description: ODL1 logs part 2 |
| Comment by Guy Sela [ 22/Mar/17 ] |
|
Attachment odl1-part3.tar.gz has been added with description: ODL1 logs part 3 |
| Comment by Guy Sela [ 22/Mar/17 ] |
|
Attachment odl1-part4.tar.gz has been added with description: ODL1 logs part 4 |
| Comment by Guy Sela [ 22/Mar/17 ] |
|
Attachment odl1-part5.tar.gz has been added with description: ODL1 logs part 5 |
| Comment by Guy Sela [ 22/Mar/17 ] |
|
Attachment odl1-part6.tar.gz has been added with description: ODL1 logs part 6 |
| Comment by Tom Pantelis [ 22/Mar/17 ] |
|
Where's the rest of the jolokia output from ODL3? ODL3 was the leader so it's important to see all the output. In any event, there were over 2.6M transactions generate in shard topology-operational - pretty high volume. However the timeout errors occurred for shard inventory-config. What volume of transactions are occurring for inventory-config? It would be useful to see the jolokia output. This is probably the same as It would be useful to profile the CPU/memory usage and GC activity as the test is running. Also how much memory was allocated to the JVM? Is it enough? |
| Comment by Guy Sela [ 23/Mar/17 ] |
|
>> Where's the rest of the jolokia output from ODL3? ODL3 was the leader so it's >> important to see all the output. >> In any event, there were over 2.6M transactions generate in shard topology-operational - pretty high volume. This is probably the same as >> It would be useful to profile the CPU/memory usage and GC activity as the test >> is running. Also how much memory was allocated to the JVM? Is it enough? |
| Comment by Robert Varga [ 23/Mar/17 ] |
|
That number of transactions is alarming – I think we should be teaching netvirt about transaction chains and pingpong... At any rate, this is definitely a duplicate of 5280. |
| Comment by Guy Sela [ 23/Mar/17 ] |
|
1) How do you know how many transactions happen per second ? This must be the more interesting number. |
| Comment by Guy Sela [ 23/Mar/17 ] |
|
CPU usage is around 120% on a 45 cores system (Which means it can get up to 4500%). |
| Comment by Guy Sela [ 23/Mar/17 ] |
|
Some of the time the system is at around 300% CPU from potential 4500% |
| Comment by Robert Varga [ 23/Mar/17 ] |
|
Re. 1) it's just a guess – the combination of the two things brings about batching, which reduces the number of transactions significantly – 100:1 is usual in BGP. It also has the added benefit of spreading the workload across multiple threads. Re. 2) there are no hard guidelines on when AskTimeouts start to happen – they happen if the backend cannot keep up with the frontend and that depends on a large number of factors:
|
| Comment by Tom Pantelis [ 24/Mar/17 ] |
|
The TransactionCreationRateLimit JMX attribute under DistributedOperationalDatastore gives a pretty good idea of transactions per sec. This is the calculated rate limit used to throttle transactions on the front-end. |