My team ran into a nasty ACI bug (CSCva68310) that prevents you from adding nodes during setup to an ACI fabric. Here’s a quick write-up so that the next poor soul that spends WAY too much time struggling with fabric provisioning can hopefully get it fixed straightaway.
The team unboxed a brand new trio of APIC-CLUSTER-L3 servers, ran the initial setup from the KVM, and connected them to a Nexus 93180-YC leaf. The 93180 was connected to a 9336 spine. Nothing complicated at all.
No matter what the team did, the leaf and spine sat in “Inactive” mode and would not change state to Active during fabric discovery.
From the GUI, here’s what it showed:
Finally, on the leaf, the output of ‘show discoveryissues’ was giving the error of “Registration to all PM shards is not complete” and “Policy download is not complete”.
Since we only had one APIC member online and hadn’t even discovered the fabric yet, how could it be a shard issue?
After completely wiping the fabric (APICs with ‘acidiag touch clean’, ‘acidiag touch setup’, ‘acidiag reboot’, and the spine/leaf with ‘setup-clean-config.sh’), the issue still persisted.
After the fabric wipe, we saw that the fabric (out of the box) was throwing Fault Code F3031. The description of that fault code is:
- “Failed to parse the subject line as a valid ACI fabric certificate AND Invalid Serial Number AND Invalid Product ID”
The issue with this situation is it’s not a self-signed certificate that’s the issue — it’s the Cisco Manufacturer Installed Certificate (MIC) that is put on the APIC at the factory. The only way to fix this is to call TAC and have them replace your MIC.
From the bug notes:
- Correct pattern: /serialNumber=PID: SN:/CN=
- Incorrect Pattern:/CN=/serialNumber=PID: SN:
You can see your CN in the fault code at the bottom. Ours was obviously incorrect.
Bug ID CSCva68310 matches this almost perfectly, however the bug ID says that you have to have your Fabric Authentication Policy set to Strict (instead of permissive). This was NOT the case for us — this bug was applicable out of the box with no security policy changes.
Call Cisco TAC, have them reference the Bug ID, and tell them your need your MIC certificate replaced. You need to have the certificate replaced on ALL APICs that have this fault. Doing just on the primary APIC will not allow the remaining APICs to join the fabric.
Please reach out if you have any questions or comments or need assistance with your ACI fabric (troubleshooting, analysis, audit, automation — my favorite). And Cisco, please fix the cert format.