Troubleshooting

Instance connectivity issues

Can’t reach instance at external IP address

The following situations may cause an instance to be unreachable at its external IP address:

Instance not running: The instance state may have changed or failed to stay up because of certain system issues. The first thing to check is whether the instance state is running. You can also check the serial console for errors during boot.
Access denied by VPC firewall rules: By default, VPCs are configured to allow inbound ICMP and TCP port 22 traffic from any external host. If the firewall rules have been changed to limit access, that could cause the instance to be unresponsive to ping or ssh at its external IP address. If you are accessing the instance using other ports or protocols, verify that the firewall rules allow that traffic.
Guest OS or application errors: Certain application processes may have exited and no longer serve external requests. Check the instance’s serial console for errors.

If the instance in question is newly created, you can check if any of the following conditions apply:

Instance failed to boot: This can happen when a boot disk is not found or the disk image is missing some bootloader dependencies. You can check the bootloader messages through the instance’s serial console. Based on the information from the serial console, you can troubleshoot the booting issue following the troubleshooting tips below.
No network interface: Instances with external IP addresses must also have a VPC network interface regardless of the need to communicate with other instances on the private network. If an instance is created without a network interface, it will be unable to send or receive external traffic.
sshd not running or SSH key missing: If you can connect to the instance’s external IP on port 22, but cannot SSH to it, the sshd process may not be running on the instance. The other thing you may want to check is that you are using a private key matching one of the public keys in your user profile and a valid login user. Note that disk images from Linux distros usually have the sshd service enabled by default and the root user disabled.

Can’t reach instance at private IP address

Inter-VM traffic within or between VPC subnets is governed by firewall and routing rules. In addition to the causes mentioned above, other issues specific to private IP addresses are:

Access denied by VPC firewall rules: By default, VPCs are configured to allow all inter-VM traffic for instances within the same subnet. If the firewall rules have been changed to limit access, they may cause the instance to be unreachable from other VMs on the private IP. Check that you have the firewall rules configured to allow inbound traffic on the related protocols and ports.
Inter-subnet routing misconfigured: Subnets within the same VPC have routes to each other by default. Custom routes configured to drop or redirect traffic will override the default system routes if they affect the same destination IP address. Review the custom routers attached to the subnets involved and see if they have the correct route targets and destinations specified.
Inter-VPC subnet routing: Subnet routing across VPCs is not supported at this time. Traffic between instances in different VPCs, whether they are within the same project or different projects, can only go through their external IP addresses.

No outbound internet connectivity

Instances get their outbound access through internet gateways. The outbound access requires an instance’s external IPs (which are typically allocated from the default pool) to match with the pool attached to the corresponding VPC’s internet gateway (which is typically the system-created default gateway and linked to the default pool during VPC creation). A mismatch between them is possible whenever there is a change in the silo’s default IP pool. The previously created internet gateways are still linked to the original default pool but new instances get their external IPs from the new default pool.

To rectify the connectivity issue, you can follow one of the solutions below:

If the goal is to keep using both IP pools, you can link the new default pool to the default gateway via the attach IP pool to internet gateway API or its equivalent CLI command. This has to be done for each of the VPCs in all the projects of the silo.
If the goal is to totally replace the use of the old pool with the new one, besides attaching the new pool to the default gateway, you will also detach the old one from it and delete/recreate any instances created prior to the silo default pool setting change. The gateway IP pool change will need to be done on all VPCs in the silo.
If the goal is to use the old IP pool only for certain special cases without recreating the instances or linking both IP pools to the default gateway, you can swap the default gateway IP pool attachment from the old pool to the new one, as outlined in the option above. For an existing instance to work, you can create a floating IP from the new pool and attach it to the instance. Another alternative is to create a custom gateway attached to the old IP pool, create a custom router with the custom gateway as target, and attach the router to the VPC subnet associated with the instance.

Severe packet loss on inter-VPC traffic

Traffic between VPCs, including traffic to the external IP address of an instance and the Oxide external API, will exit the rack and hairpin back in. Some routing platforms require additional configuration to allow this to work correctly.

For example, we have found that Cisco NX-OS devices have ip redirects enabled by default (this default setting may not be visible when the operator inspects the configuration). When ip redirects is enabled, these hairpinned packets are forwarded to the router’s CPU for additional processing (in this case the CPU performs additional lookups so it can generate an ICMP redirect message, but this message will not be useful in this deployment scenario). Cisco routers also have a CPU protection mechanism in place, called Control Plane Policing (CoPP). The built-in CoPP configuration (often called a "policy") will rate limit the number of packets per second that the CPU is allowed to process. In our testing on Cisco NX-OS, even when set to the most lenient CoPP policy, we observed over 100K+ packets dropped per second while generating significant inter-VPC traffic.

Also, as of writing this document, Cisco NX-OS support documentation recommends proactively disabling ip redirects on all L3 interfaces.

If you have a Cisco NX-OS device and want to see if you are being impacted by this issue you can perform the following steps:

In the Cisco management console, show copp policy will display which CoPP profile is enabled.

You can run show policy-map interface control-plane | egrep "class-map|dropped" to check if the policer is dropping traffic.

If a class-map is found to be dropping traffic, you can observe the rate of packet drops using:

$ watch interval 1 show policy-map interface control-plane class <class-map-name>

You can watch the various class-map counters while a user performs inter-VPC traffic. If you see the drop counters increase, we recommend configuring no ip redirects on the interfaces connected to the rack.

Instance boot / disk issues

Serial console stuck on UEFI Shell

Oxide’s hypervisor supports UEFI boot using the OVMF firmware. When the UEFI bootrom cannot find a bootable disk, it falls back to the shell to prompt for user intervention. Here are some possible causes:

the boot disk image is corrupted or created with an incorrect block size*
the boot disk supports only BIOS/legacy boot
there is no boot disk attached to the instance

To remediate the boot issue, check that the instance has the correct boot disk image specified.

* Use the block size specified in the Linux distro image details; if the information is unavailable, try with 512 bytes.

Serial console blank or showing errors

The web UI or CLI may fail to connect to the instance’s serial port when the control plane has lost the state of the instance, potentially due to an unexpected restart of the Propolis server backing the instance. There are other less common connectivity issues that may be rectified by stopping and starting the instance. (Note: Stop/start is different from instance reboot; the former involves recreating the Propolis server and network interfaces).

If you see "device driver not found" errors on the serial console, it is likely that the boot disk image does not meet some of the prerequisites described in the image requirements. Contact Oxide Support if you need any assistance with getting your disk image to work.

If you notice any NVMe or disk I/O errors from the guest OS, the problems may be related to the storage volumes backing the instance. Look for any guest OS prompts for corrective actions (e.g., chkdsk, fsck) and follow those instructions. If there is none, you can try stopping and starting the instance to trigger a volume self-repair cycle. If the above actions do not resolve the issue, please contact Oxide Support for assistance.

Important

Please contact Oxide Support to report any abnormal instance behavior even if you are able to work around the issue with instance stop/start.

Disk rejected due to 20-char name match

The serial number of a disk presented to the guest OS uses the first 20 characters of the disk name. If the first 20 characters match across more than one disk being attached to a single instance, you will hit a serial number collision which will cause one of the matching disks to fail to be detected by that instance.

For example, if your instance had two disks named:

mydisk-test1234-data1
mydisk-test1234-data2

During boot, you will see an error similar to the following:

Duplicate cntlid 0 with nvme1, subsys nqn.2014.08.org.nvmexpress:01de01demydisk-test1234-data, rejecting

To resolve this issue, ensure your disk naming convention is unique within those first 20 characters.

Instance performance issues

tsc treated as unreliable by guest

Due to a current limitation in Oxide’s hypervisor, guests may spuriously detect the tsc timing register as unreliable, causing them to switch to the acpi_pm clocksource. This will cause timestamp syscalls, such as clock_gettime(3) on Linux and QueryPerformanceCounter on Windows, to be vastly slower, which can have significant performance impacts.

On a Linux guest you may see events in dmesg along these lines:

$ sudo dmesg | grep -e tsc -e clocksource
[    1.267335] tsc: Marking TSC unstable due to clocksource watchdog
[    4.464911] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    4.466898] clocksource: Switched to clocksource acpi_pm

You can check the current clocksource on Linux with the following command:

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

To mitigate this issue on Linux, you can add kernel boot parameter tsc=reliable to force the guest to use tsc as its clocksource. This is being tracked in omicron#8001.

No samples captured by Linux `perf-record(1)`

The Oxide hypervisor exposes the CPU performance monitoring unit (PMU) to guest VMs, but does not currently support the interrupts used by Linux’s perf-record(1) tool to collect samples.

On VMs where the PMU is not available, perf(1) will fall back to software events to record events. On Oxide, it recognizes that the PMU is available, but does not realize that driving event sampling via the performance counter interrupt is not currently possible, and so does not fall back to software.

To force perf-record(1) to use software events, execute it as:

sudo perf record -e task-clock

This issue is being tracked as propolis#894.

Note that perf-stat(1) is able to capture events from the PMU.

Disk/image import exceptions

Can’t delete disk after canceled image import

When you abort a disk import CLI command (e.g., with Ctrl-C) or cancel an image import on the web UI, it is possible that the incomplete import fails to fully unwind. If so, you may see a disk left in the importing_from_bulk_writes state and attempts to delete it will fail. The state is meant to prevent the disk from being put in use or deleted inadvertently.

To remove the disk, make two API calls in the order below:

import stop: change the disk state to import_ready
finalize: change the disk state to detached

Once the disk is detached, you will be able to delete it. Read the Disks and Snapshots guide to learn more about disk states.

Image import on web console stalls

If the issue happens in Firefox on MacOS, it may be a known sporadic HTTP/2 problem in the web UI. You can work around it by using a different browser to import the image.

Oxide API/Console access issues

Oxide API/Console connection time-out

The problem may be caused by rack or network configurations, or a rack service outage. Users with the fleet administrator role and other necessary infrastructure access can follow the instructions here to troubleshoot the issue.

"Something went wrong" error on the web console

The error usually means that the user does not have access to any projects in the silo. This may be caused by identity provider configuration issues (e.g., incorrect group attribute mapping, missing group/role assignments in the identity provider system). You will need to contact the rack operator to get the access rectified.

If you are the operator, you can find the troubleshooting tips in the Operator FAQ.

"401 Unauthorized" error in API requests

API client requests are authenticated with credentials stored in the oxide configuration file on your workstation or in the session environment variables. Request authentication fails when your credentials cannot be found in either place or if the API token is not valid for the host specified.

Replacing expired silo TLS certificates

Silo TLS certificates cannot be updated via the web console, this option is only available via the oxide CLI tool or the REST API. If a silo’s certificate has expired, but you have already logged in via the CLI, then you can use pass --insecure flag to oxide api:

oxide --insecure api /v1/certificates -X POST --input - <<EOF
{
    "description": "<CERT_DESCRIPTION>",
    "name": "<CERT_NAME>",
    "service": "external_api",
    "cert": $(jq -Rs . < cert.pem),
    "key": $(jq -Rs . < key.pem)
}
EOF

If you have not already logged in the CLI, oxide auth login will fail due to the expired certificate. This can be worked around by allowing your web browser to trust the expired certificate, logging in, then using your browser developer tools to locate your session cookie. This can be used to authenticate against the REST API:

curl --insecure --cookie "session=<SESSION_COOKIE>" -H "Content-Type: application/json" -X POST https://siloname.sys.example.com/v1/certificates --input - <<EOF
{
    "description": "<CERT_DESCRIPTION>",
    "name": "<CERT_NAME>",
    "service": "external_api",
    "cert": $(jq -Rs . < cert.pem),
    "key": $(jq -Rs . < key.pem)
}
EOF

Support bundles

When you are encountering a problem with the rack, Oxide Support may ask you to collect a support bundle. This will capture system and diagnostic information from the rack control plane, and save them to a zip archive on internal rack storage.

If you have fleet Administrator access, you can trigger the creation of a new bundle with:

oxide bundle create

This will typically take several minutes to complete. You view all support bundles present on the rack with:

oxide bundle list

A bundle in collecting state is still being created. When it is complete, it will transition to active state and you will be able to access it.

To download the bundle zip to your computer, use:

oxide bundle download --id $BUNDLE_ID --output $BUNDLE_FILE

Bundles will typically be several GiB, and you may have only one or two files you need to view. The Oxide CLI has a simple interface to view the contents of a bundle. This can be executed remotely against the rack with:

oxide bundle inspect --id $BUNDLE_ID

Or if you have already downloaded the bundle zip file, you can view that with:

oxide bundle inspect --path $BUNDLE_FILE

This will show a list of files in the bundle, and by pressing Enter you can view the contents of a given file. Note that the contents of the bundle archive are not stable and may be changed in future releases.

When you are confident that the support bundle is no longer needed, you can remove it from the rack with:

oxide bundle delete --bundle-id $BUNDLE_ID

When sending a bundle to Oxide Support, you will be asked to provide an SSH public key, and will be given the login information for an ephemeral host to upload the bundle to.

To upload the bundle via rsync(1) with username abc123 and host IP 10.0.0.10, you would run:

rsync -avh --progress my_support_bundle.zip abc123@10.0.0.10:

Troubleshooting

Table of Contents