VRRP mastership switchover taking too long

VRRP mastership switch over is usually completed instantaneously, but sometimes it takes longer. While testing VRRP mastership switch over multiple times by deleting logical interface, a large delay (20-30 seconds) is observed in VRRP mastership switch over.

This slow VRRP convergence can sometimes happen because the current master router cannot send out a VRRP packet with “priority 0” to the current backup peer while the logical interface goes down. According to RFC 3768, the priority value zero (0) has special meaning indicating that the current master has stopped participating in VRRP. This is used to trigger backup routers to quickly transition to master without having to wait for the current master to timeout. If the priority=0 packet is sent out from the master router and it reaches the backup router properly, the mastership switchover completes in a second, but if disabling the logical interface with the commands set interfaces ae12.2001 disable and commit prevents the priority=0 packet from going out, the backup router does not receive it. The backup router has to wait until the hold-timer expires. Due to this race condition between committing a config (disabling the logical interface) and sending out priority=0 packet, sometimes this long VRRP convergence is seen.

Topology

Below, messages.0 is of the problem case. The mgd sent notification to dcd, but it seemed stuck, then IFL=ae12.2001 went down. The mgd resumed sending notification to vrrp one second later, but the IFL was down already.

Below messages.1 is of the ideal case. The mgd sent notification to vrrpd, then IFL=ae12.2001 went down. In this scenario, priority=0 packet could be advertised.

This race condition between committing config (disabling the logical interface) and sending priority=0 packet is due to a software issue. The issue was resolved by PR 559366. This issue is reproducible in Junos OS 10.4, 11.1, 11.2, 11.3 and 11.4R1 releases.

The fix is available in 11.4R2 and 12.1 and later. This issue could be worked around by setting lower priority value on the current master (lower than 90). It can initiate VRRP mastership switch over.The IFL can be disabled after the mastership switch over is completed. In case the physical cable is down on the master, this long down time canot be avoided because the backup is unable to detect the downed link beyond the L2SW, and it has to wait until the hold-timer expires.

About the author

Prasanna

Leave a Comment