Troubleshoot the 'DCD_CONFIG_WRITE_FAILED' log message

The DCD_CONFIG_WRITE_FAILED log message is generated in Junos.

The following cases have been captured:

Case 1 – Incorrect encapsulation that was set on the interface leads to the error message being generated.
Case 2 – illegal encapsulation for per-unit-schedular.

Other possible causes for this log message are:

Out of Order Messages:

These messages are due to rtsock messages being sent out of order to the kernel. For example, when these messages occur, an IFL is added to a non existing IFD. Also, the IFD add message reaches the IFL add message later. This might also occur when an IFD is deleted, prior to deleting an IFL.

In the above cases, the rtsock trace will clearly indicated the order, in which the messages are going to kernel.

Device is being over provisioned:

This is a rare scenario and occurs only when IFLs are created in the order of 45k or beyond. The kernel generates the No space left on device message, after a certain number of IFLs (roughly 8k per PIC); after which, every IFL creation request fails. DCD will keep the failed entries in the config thread and try to push them again to the kernel after a certain timeout.

Trying to read the existing entities:

Attempts to read the existing entities, such as IFD, IFL, or IFF, also leads to this error. Make sure that a delete message is sent, prior to the add message for the failed entity.

IH Case – Interface Hierarchy Use in Channelized Interfaces:

In this case, you will not see any error messages on the DCD or kernel; but there will be issues on the PFE side and the PFE might core or generate errors, when the IH IFD messages go out of order to the PFE.

When deleting and adding the IH IFDs, you need to maintain the ordering between the 2 peer channels and send the deletes as a batch to the PFE; so that the allocation request for time slots is achieved. For example, if you have multiple coc1 partitions on coc3, all the deletes of coc1 partitions have to be sent as a batch to kernel . dcd_ih_sort_queue, which is the function that sorts the config thread to bring this ordering.

ENOMEM:

The Kernel can generate this error in the following cases:

The kernel is out of memory and unable to allocate new resources. This occurs only when huge configurations (roughly > 65k IFLs) are being used.
A state change occurred on the entity, prior to all the daemons being done with the previous state update. First, check whether or not, the configuration is being applied successfully; most of the time, the configuration is successfully applied, inspite of the error message.

Case 1 – dcd config write fail due ‘illegal media change’:

The following errors were generated, when the dcd-config write error occurred:

May 7 22:42:56 xxx-gw4.m320 mib2d[1901]: SNMP_TRAP_LINK_DOWN: ifIndex 625, ifAdminStatus down(2), ifOperStatus down(2), ifName ge-7/0/0.101
May 7 22:42:56 xxx-gw4.m320 mib2d[1901]: SNMP_TRAP_LINK_DOWN: ifIndex 626, ifAdminStatus down(2), ifOperStatus down(2), ifName ge-7/0/0.102
May 7 22:42:56 xxx-gw4.m320 mib2d[1901]: SNMP_TRAP_LINK_DOWN: ifIndex 627, ifAdminStatus down(2), ifOperStatus down(2), ifName ge-7/0/0.103
May 7 22:42:56 xxx-gw4.m320 rpd[1855]: bgp_ifachange_group:6329: NOTIFICATION sent to 114.6.4.110 (External AS 45727): code 6 (Cease) subcode 6 (Other Configuration Change), Reason: Interface change for the peer-group
May 7 22:42:56 xxx-gw4.m320 mib2d[1901]: SNMP_TRAP_LINK_DOWN: ifIndex 623, ifAdminStatus down(2), ifOperStatus down(2), ifName ge-7/0/0.566
May 7 22:42:56 xxx-gw4.m320 mib2d[1901]: SNMP_TRAP_LINK_DOWN: ifIndex 587, ifAdminStatus down(2), ifOperStatus down(2), ifName ge-7/0/0.32767
May 7 22:42:56 xxx-gw4.m320 /kernel: ge-4/1/3: Illegal media change. Flexible-Ethernet-Services is invalid
May 7 22:42:56 xxx-gw4.m320 dcd[1905]: DCD_CONFIG_WRITE_FAILED: Interface ge-4/1/3, configuration write failed for an IFD CHANGE: Operation not supported
May 7 22:42:56 xxx-gw4.m320 /kernel: ge-7/0/0: Illegal media change. Flexible-Ethernet-Services is invalid
May 7 22:42:56 xxx-gw4.m320 dcd[1905]: DCD_CONFIG_WRITE_FAILED: Interface ge-7/0/0, configuration write failed for an IFD CHANGE: Operation not supported
May 7 22:42:56 xxx-gw4.m320 mib2d[1901]: SNMP_TRAP_LINK_DOWN: ifIndex 624, ifAdminStatus up(1), ifOperStatus down(2), ifName ge-4/1/3.566

As per the above logs, just before DCD_CONFIG_WRITE_FAILED, there is a kernel log, which indicates a illegal media change. This basically suggests that the interface does not support Flexible-Ethernet-Services encapsulation on the interface. The above errors need to be addressed by using the correct encapsulation under the interface. This will resolve the issue

Case 2 – dcd config write failed due to ‘illegal encapsulation for per-unit-scheduler’:

The following errors were generated, when the dcd-config write error occurred:

Jun 25 12:26:26 xxx-01-re0 /kernel: %KERN-4: ds-2/2/0:1: illegal encapsulation for per-unit-scheduler
Jun 25 12:26:26 xxx-01-re0 dcd[1454]: %DAEMON-3-DCD_CONFIG_WRITE_FAILED: Interface ds-2/2/0:1, configuration write failed for an IFD CHANGE: Invalid argument

As per the above logs, the kernel reported illegal encapsulation for the per-unit-scheduler for the ds-2/2/0:1 interface. When the encapsulation is checked, it will be PPP; so you will wonder why the kernel is reporting this error.

Check if the interface is called under LSQ (multilink bundle) interface, which would be configured with MLPPP encapsulation; so the kernel that is reported here is per-unit-scheduler, which is not compatible with MLPPP for the ds-2/2/0:1 physical interface, as it is part of the logical LSQ interface. So, the solution here is to configure the per-unit-schedular only for the multilink ISQ interface and not the physical interface.

Every time, this might be so obvious from the logs. So, for further troubleshooting:

The following information is required to effectively debug such an issue:

/var/log/messages* (full DCD log messages).

set the dcd.log interface’s traceoptions file to flags all (full kernel log messages).

set system syslog file kern kernel any
set system syslog time-format millisecond

From the Shell rtsockmon messages:

rtsockmon -t >/var/log/rtsockmon.log

When the logs rotate, make sure to obtain the *.gz files.

All this data needs to be collected, prior to the actual error occurrence, as there are other previous configurations, which could lead to this error. Additionally, you need to share this information with Engineering for further analysis.

Troubleshoot the ‘DCD_CONFIG_WRITE_FAILED’ log message

Related