[BGPCEP-494] PCRpt received with bandwidth reoptimization object leads to loop causing OOM Created: 20/Jul/16  Updated: 03/Mar/19  Resolved: 10/Aug/16

Status: Resolved
Project: bgpcep
Component/s: PCEP
Affects Version/s: Bugzilla Migration
Fix Version/s: Bugzilla Migration

Type: Bug
Reporter: Ajay L Assignee: Ajay L
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: PNG File Pending-pcrpt-objects.png     PNG File Processed-pcrpt-objects.png     PNG File lsp-symbolic-path-name.png     File pcrpt-oom-repro.pcap    
External issue ID: 6242

 Description   

Issue happens in real network environment. If a PCRpt message is received with bandwidth reoptimization object (ref: https://tools.ietf.org/html/rfc5440#section-7.7) it causes the controller to loop and ultimately results in heap OutOfMemory error. Crafted packet used to repro the issue is attached. Is is not clear at this point why PCC is including bandwidth reoptimization object in PCRpt



 Comments   
Comment by Ajay L [ 20/Jul/16 ]

Attachment pcrpt-oom-repro.pcap has been added with description: PCRpt used to reproduce the issue

Comment by Al Goddard [ 22/Jul/16 ]

Cisco DE is investigating if/how a 0x5 / 0x2 BW object could be sent from the router, but current XR code is only expected to send 0x5 / 0x5 BW Sample.

Comment by Ajay L [ 25/Jul/16 ]

Based on OOM heap dump analysis, controller is receiving PCRpt with objects in below sequence:
LSP
ERO
LSPA
Bandwidth
Bandwidth Reoptimization

RFC 5440 (https://tools.ietf.org/html/rfc5440#section-7.7) describes the 2 types of bandwidth objects and specifies under what scenario the reoptimization bandwidth object is used:

" o R (Reoptimization - 1 bit): when set, the requesting PCC specifies
that the PCReq message relates to the reoptimization of an
existing TE LSP. For all TE LSPs except zero-bandwidth LSPs, when
the R bit is set, an RRO (see Section 7.10) MUST be included in
the PCReq message to show the path of the existing TE LSP. Also,
for all TE LSPs except zero-bandwidth LSPs, when the R bit is set,
the existing bandwidth of the TE LSP to be reoptimized MUST be
supplied in a BANDWIDTH object (see Section 7.7). This BANDWIDTH
object is in addition to the instance of that object used to
describe the desired bandwidth of the reoptimized LSP. For zero-
bandwidth LSPs, the RRO and BANDWIDTH objects that report the
characteristics of the existing TE LSP are optional."

<request>::= <RP>
<END-POINTS>
[<LSPA>]
[<BANDWIDTH>]
[<metric-list>]
[<RRO>[<BANDWIDTH>]]
[<IRO>]
[<LOAD-BALANCING>]

Stateful PCEP draft (ref: https://tools.ietf.org/html/draft-ietf-pce-stateful-pce-15) introduces PCRpt message and it is defined as below:

<PCRpt Message> ::= <Common Header>
<state-report-list>
Where:

<state-report-list> ::= <state-report>[<state-report-list>]

<state-report> ::= [<SRP>]
<LSP>
<path>
Where:
<path>::= <intended_path><attribute-list>[<actual_path>]

Where:
<intended_path> is represented by the ERO object defined in
section 7.9 of [RFC5440].
<attribute-list> is defined in [RFC5440] and extended by
PCEP extensions.
<actual_path> is represented by the RRO object defined in
section 7.10 of [RFC5440].

Now attribute-list in RFC 5440 includes the bandwidth object:

<attribute-list>::=[<LSPA>]
[<BANDWIDTH>]
[<metric-list>]
[<IRO>]

<metric-list>::=<METRIC>[<metric-list>]

So technically the bandwidth reoptimization object can be expected within PCRpt message

Proposed fix will make 2 changes:
1. Add bandwidth reoptimization object to the list of possible objects in PCRpt message
2. Gracefully handle condition where unexpected object is present in received message

Comment by Ajay L [ 25/Jul/16 ]

(In reply to Al Goddard from comment #1)
> Cisco DE is investigating if/how a 0x5 / 0x2 BW object could be sent from
> the router, but current XR code is only expected to send 0x5 / 0x5 BW Sample.

Thx Al for the update. FYI - in the PCRpt sent by CRS which causes the issue, the bandwidth object has bandwidth value as zero whereas the bandwidth reoptimization object has non-zero bandwidth value

Comment by Ajay L [ 25/Jul/16 ]

master: https://git.opendaylight.org/gerrit/42434
stable/beryllium: https://git.opendaylight.org/gerrit/42435

Comment by Al Goddard [ 25/Jul/16 ]

(In reply to Ajay L from comment #3)
> (In reply to Al Goddard from comment #1)
> > Cisco DE is investigating if/how a 0x5 / 0x2 BW object could be sent from
> > the router, but current XR code is only expected to send 0x5 / 0x5 BW Sample.
>
> Thx Al for the update. FYI - in the PCRpt sent by CRS which causes the
> issue, the bandwidth object has bandwidth value as zero whereas the
> bandwidth reoptimization object has non-zero bandwidth value


From Cisco PCEP DE:

Hi Al,

I checked the code, and do not see a way for the PCE report message to ever contain the type 2 BW object.
This is based on the code provided in 5.3.3, and the code in the SMU/ engineering images.

For now, until such a packet is actually seen, I would be disinclined to think that was the trigger.
As discussed, I will cook up a test XRv image for you to actually send this object from the router though.

Thanks,
Jon

Comment by Ajay L [ 25/Jul/16 ]

(In reply to Al Goddard from comment #5)
> (In reply to Ajay L from comment #3)
> > (In reply to Al Goddard from comment #1)
> > > Cisco DE is investigating if/how a 0x5 / 0x2 BW object could be sent from
> > > the router, but current XR code is only expected to send 0x5 / 0x5 BW Sample.
> >
> > Thx Al for the update. FYI - in the PCRpt sent by CRS which causes the
> > issue, the bandwidth object has bandwidth value as zero whereas the
> > bandwidth reoptimization object has non-zero bandwidth value
>
> ----
> From Cisco PCEP DE:
>
> Hi Al,
>
> I checked the code, and do not see a way for the PCE report message to ever
> contain the type 2 BW object.
> This is based on the code provided in 5.3.3, and the code in the SMU/
> engineering images.
>
> For now, until such a packet is actually seen, I would be disinclined to
> think that was the trigger.
> As discussed, I will cook up a test XRv image for you to actually send this
> object from the router though.
>
> Thanks,
> Jon

Attaching couple of screenshots from heap dump analysis which show the various objects, including bandwidth and reoptimization bandwidth objects, received from PCRpt (after parsing by ODL code)

Comment by Ajay L [ 25/Jul/16 ]

Attachment Processed-pcrpt-objects.png has been added with description: Processed PCRpt objects

Comment by Ajay L [ 25/Jul/16 ]

Attachment Pending-pcrpt-objects.png has been added with description: Pending PCRpt objects

Comment by Al Goddard [ 26/Jul/16 ]

Additional info/request from Cisco DE:

Is this is from the heap dump after the heap went OOM?
There are a number of steps between a message being sent and it arriving parsed and settled as internal objects in the controller heap.

Am I correct that the 0x5/0x5 BW object is not visible here?

If it is possible to correlate this back to an actual message, that would help.
If that is not possible, correlating back to an LSP would help also. (assuming router state is available)
Ideally, a procedure to replicate this would be ideal, or was this a one-off?

Comment by Ajay L [ 26/Jul/16 ]

(In reply to Al Goddard from comment #9)
> Additional info/request from Cisco DE:
>
>
> Is this is from the heap dump after the heap went OOM?
> There are a number of steps between a message being sent and it arriving
> parsed and settled as internal objects in the controller heap.

Agree. But analysis so far does not show any issue in ODL parsing logic

>
> Am I correct that the 0x5/0x5 BW object is not visible here?

0x5/0x5 BW object? Did u mean 0x5/0x1 or 0x5/0x2? I see both of those objects in the heap dump

>
> If it is possible to correlate this back to an actual message, that would
> help.
> If that is not possible, correlating back to an LSP would help also.
> (assuming router state is available)

Attaching a screenshot showing LSP symbolic name which is "DCCRS2_t7"

> Ideally, a procedure to replicate this would be ideal, or was this a one-off?

Agree. But I think this has been seen only once so far

Comment by Ajay L [ 26/Jul/16 ]

Attachment lsp-symbolic-path-name.png has been added with description: LSP symbolic path-name

Comment by Al Goddard [ 27/Jul/16 ]

Can you answer the two questions from Cisco:

The image with the knob to change the bandwidth value is building now and should be available in a few hours.
I also wanted to address a couple of the outstanding points/questions (see below, responses in bold).

Thanks,
Jon

1. Regarding this:
> Is this is from the heap dump after the heap went OOM?
> There are a number of steps between a message being sent and it arriving
> parsed and settled as internal objects in the controller heap.

Agree. But analysis so far does not show any issue in ODL parsing logic

My understanding is that it was the parsing of the object that resulted in the eventual looping until OOM state. Was it something else?
So far, the analysis of messages captured before and since does not point to a problem here.

Can you answer: _______________

2. Do you send an RP object R-bit=1 in any messages? Spec shows this object as part of at PCRep message, not a PCRpt message.

The router implements RFC5440 as well as stateful PCEP drafts, so it can originate and process PCRep messages which do contain the RP object.
My understanding (from the attached) was that the memory was being interpreted as a PCRpt message. (Unless this screenshot is a result of the “crafted” message that was used.)'

Can you answer: _______________

3. Let me know if there are any specific debugs to validate the LSP 7 (shown below) would be sending this object.

The ‘dump-messages’ debug can provide such low-level debugging of all messages originating from or arriving on the router, if this can be recreated..

Comment by Ajay L [ 28/Jul/16 ]

(In reply to Al Goddard from comment #12)
> Can you answer the two questions from Cisco:
>
>
>
> The image with the knob to change the bandwidth value is building now and
> should be available in a few hours.
> I also wanted to address a couple of the outstanding points/questions (see
> below, responses in bold).
>
> Thanks,
> Jon
>
> 1. Regarding this:
> > Is this is from the heap dump after the heap went OOM?
> > There are a number of steps between a message being sent and it arriving
> > parsed and settled as internal objects in the controller heap.
>
> Agree. But analysis so far does not show any issue in ODL parsing logic
>
> My understanding is that it was the parsing of the object that resulted in
> the eventual looping until OOM state. Was it something else?
> So far, the analysis of messages captured before and since does not point to
> a problem here.
>
> Can you answer: _______________

Processing of objects in PCRpt message caused the loop. I was referring to the fact that issue was in the processing of objects and not in de-serializing or parsing the data received from wire into objects. So we still believe that somehow type=2 bandwidth object was received

>
> 2. Do you send an RP object R-bit=1 in any messages? Spec shows this
> object as part of at PCRep message, not a PCRpt message.
>
> The router implements RFC5440 as well as stateful PCEP drafts, so it can
> originate and process PCRep messages which do contain the RP object.
> My understanding (from the attached) was that the memory was being
> interpreted as a PCRpt message. (Unless this screenshot is a result of the
> “crafted” message that was used.)'
>
> Can you answer: _______________
>

The screenshot is from the OOM heap dump seen in ATT setup, not the crafted one. We do not believe PCRep getting interpreted as PCRpt is happening here. Per RFC 5440: "PCRep is a PCEP message sent by a PCE to a requesting PCC in response to a previously received PCReq message.". So PCRep is supposed to be received by the PCC (router in this case, not the controller)

>
> 3. Let me know if there are any specific debugs to validate the LSP 7 (shown
> below) would be sending this object.
>
> The ‘dump-messages’ debug can provide such low-level debugging of all
> messages originating from or arriving on the router, if this can be
> recreated..

Comment by Milos Fabian [ 28/Jul/16 ]

master: https://git.opendaylight.org/gerrit/#/c/42434/

Generated at Wed Feb 07 19:13:14 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.