RPD_OS_MEMHIGH

This article describes the following syslog message:

This log message indicates that the routing protocol daemon (rpd) is running out of memory.

The routing protocol daemon (rpd) is using too much Routing Engine memory.

The message includes the amount and percentage of Routing Engine memory used, as shown in the example below:

Note that the amount of memory used is defined in KB and/or the percent available.

Note: Under normal conditions, this message should not appear.

This message is reported every minute in the syslog message file under the following circumstances:

  • When the memory utilization for the routing protocol daemon (rpd) process is using all available Routing Engine DRAM memory (in Routing Engines with a maximum of 2 GB DRAM), or
  • When the memory utilization for the routing protocol daemon (rpd) process reaches the 2 GB memory limit (in Routing Engines with 4 GB DRAM).

Symptoms of Memory Exhaustion

Symptoms of memory exhaustion include the following:

a. Persistent swapping activity

When the rpd is using all available memory, some portions of the existing memory may be swapped onto the disk to accommodate any new request for memory allocation.

In this case, the kernel writes the following type of message in the syslog file:

Note: This might happen with Routing Engines that have a maximum of 2 GB DRAM memory: The rpd can use all available DRAM memory. However, this might not happen with Routing Engines that have 4 GB of DRAM memory. The rpd can use only a maximum of 2 GB of DRAM memory.

Under normal conditions, this message also should not appear.

b. Failure to add routes

When the rpd is running out of all available memory, the kernel might fail to allocate memory for new prefixes.

In this case, the kernel writes the following type of message in the syslog file:

Under normal conditions, this message also should not appear.

There are two possible causes for this problem:

  • Excessive use of system resources

An inaccurate design in the policy and routing protocol areas can lead to situations where it might be necessary to use most of the available system resources or even exceed the limit.

  • Memory leakage by rpd

The rpd could be consuming all available memory due to a software issue in the Junos OS software routing protocol area.

Perform these steps to determine the cause and resolve the problem (if any):

1. Collect the show command output on the Routing Engine.

Capture the output to a file (in case you have to open a technical support case). To do this, configure each SSH client/terminal emulator to log your session.

2. Analyze the show command output.

a. Review the events that occurred at or just before the appearance of the RPD_OS_MEMHIGH message. These events might help identify the cause.

b. Check the number of routes installed in the routing table. Consult your Juniper Account or Sales team for scaling guidelines to verify that the node is not out of scale for routing table entries.

c. Check if a peer mistakenly advertised excess routes. Use prefix-limit to protect the node.

d. Check the number of MPLS label-switched paths (LSPs) and the amount of memory allocated for these LSPs. Consult your Juniper Account or Sales team for scaling guidelines to verify that the node is not out of scale for label-switched paths.

If the cause of high memory utilization is scaling, optimize the policy and routing protocol configuration to an optimal design. Alternatively, upgrade the Routing Engine hardware.

e. Identify the trigger, then correlate it to a specific event or change.

Identify the Trigger

1. Determine when the memory leak started and the memory growth trend over time.

Provide memory utilization graphs from your SNMP monitoring tool, if available, for routers reporting the message.

From the memory graphs, estimate the following:

  • How quickly the routers exhaust all available memory.
  • How quickly memory has grown over certain fixed periods of time (for example, every month).
  • When the memory leak started.

Note: Often, the beginning of the memory leak corresponds on the graph to a significant memory gradient.
Correlate the Trigger to a Specific Event or Change
2. Identify which task inside the rpd is consuming more memory.

Does the trigger correlate to any of the following:

  • Junos OS software upgrade?
  • Software configuration changes?
  • Events related to network topology changes or external causes? (Events such as protocol flaps or specific messages or errors can be reported in the syslog message file.)

Yes – Search for existing Problem Reports. If none is found but the problem is reproducible in the lab, file a new Problem Report.

Note: Interim or final solutions depend on the previous step: An interim solution might be necessary to disable specific features or to stop provisioning the system with specific configuration if those features or configuration changes contribute to the memory leak. An interim solution could also include restarting the rpd in a controlled manner on the most-affected routers, including not only routers already running out of resources, but also those where memory utilization is already high and resources are about to run out.

No – If the trigger is not identified or the problem is not reproducible, open a case with your technical support representative to investigate the issue further. Continue to Step 3.

3. Monitor other routers in the network.

The actions listed above can be executed for the rest of the routers in the network and can be repeated periodically.

This effort can help identify other routers where the rpd might be leaking memory even though resources are not yet exhausted.

For the other routers in the network, follow these recommendations:

  • Group all systems reporting the same symptoms.
  • Group all systems not reporting the same symptoms.
  • Identify common elements among these two groups.
  • Determine the commonalities among the routers reporting the same symptoms, and the differences between the two groups.

Some of the elements to take into account are listed below:

  • Junos OS software version.
  • Configuration changes.
  • Number of commits.
  • Events, periodic messages, or errors reported into the system message file.
  • Amount of memory leaked over the same periods of time.

Note: Simulations in a lab environment can also help to further identify the triggers.

Do other routers in the network report the message?

  • Yes – The common trait between the affected routers gives a clue about root cause.
  • No – Continue to Step 4.

4. If these efforts do not resolve the problem, contact your technical support representative to investigate the issue further.

About the author

Prasanna

Leave a Comment