Instance connectivity issues
Can’t reach instance at external IP address
The following situations may cause an instance to be unreachable at its external IP address:
Instance not running: The instance state may have changed or failed to stay up because of certain system issues. The first thing to check is whether the instance state is running. You can also check the serial console for errors during boot.
Access denied by VPC firewall rules: By default, VPCs are configured to allow inbound ICMP and TCP port 22 traffic from any external host. If the firewall rules have been changed to limit access, that could cause the instance to be unresponsive to
ping
orssh
at its external IP address. If you are accessing the instance using other ports or protocols, verify that the firewall rules allow that traffic.Guest OS or application errors: Certain application processes may have exited and no longer serve external requests. Check the instance’s serial console for errors.
If the instance in question is newly created, you can check if any of the following conditions apply:
Instance failed to boot: This can happen when a boot disk is not found or the disk image is missing some bootloader dependencies. You can check the bootloader messages through the instance’s serial console. Based on the information from the serial console, you can troubleshoot the booting issue following the troubleshooting tips below.
No network interface: Instances with external IP addresses must also have a VPC network interface regardless of the need to communicate with other instances on the private network. If an instance is created without a network interface, it will be unable to send or receive external traffic.
sshd
not running or SSH key missing: If you can connect to the instance’s external IP on port 22, but cannot SSH to it, thesshd
process may not be running on the instance. The other thing you may want to check is that you are using a private key matching one of the public keys in your user profile and a valid login user. Note that disk images from Linux distros usually have thesshd
service enabled by default and theroot
user disabled.
Can’t reach instance at private IP address
Inter-VM traffic within or between VPC subnets is governed by firewall and routing rules. In addition to the causes mentioned above, other issues specific to private IP addresses are:
Access denied by VPC firewall rules: By default, VPCs are configured to allow all inter-VM traffic for instances within the same subnet. If the firewall rules have been changed to limit access, they may cause the instance to be unreachable from other VMs on the private IP. Check that you have the firewall rules configured to allow inbound traffic on the related protocols and ports.
Inter-subnet routing misconfigured: Subnets within the same VPC have routes to each other by default. Custom routes configured to drop or redirect traffic will override the default system routes if they affect the same destination IP address. Review the custom routers attached to the subnets involved and see if they have the correct route targets and destinations specified.
Inter-VPC subnet routing: Subnet routing across VPCs is not supported at this time. Traffic between instances in different VPCs, whether they are within the same project or different projects, can only go through their external IP addresses.
No outbound internet connectivity
Instances get their outbound access through internet gateways. The outbound access requires an instance’s external IPs (which are typically allocated from the default pool) to match with the pool attached to the corresponding VPC’s internet gateway (which is typically the system-created default gateway and linked to the default pool during VPC creation). A mismatch between them is possible whenever there is a change in the silo’s default IP pool. The previously created internet gateways are still linked to the original default pool but new instances get their external IPs from the new default pool.
To rectify the connectivity issue, you can follow one of the solutions below:
If the goal is to keep using both IP pools, you can link the new default pool to the
default
gateway via the attach IP pool to internet gateway API or its equivalent CLI command. This has to be done for each of the VPCs in all the projects of the silo.If the goal is to totally replace the use of the old pool with the new one, besides attaching the new pool to the
default
gateway, you will also detach the old one from it and delete/recreate any instances created prior to the silo default pool setting change. The gateway IP pool change will need to be done on all VPCs in the silo.If the goal is to use the old IP pool only for certain special cases without recreating the instances or linking both IP pools to the
default
gateway, you can swap thedefault
gateway IP pool attachment from the old pool to the new one, as outlined in the option above. For an existing instance to work, you can create a floating IP from the new pool and attach it to the instance. Another alternative is to create a custom gateway attached to the old IP pool, create a custom router with the custom gateway as target, and attach the router to the VPC subnet associated with the instance.
Instance boot issues
Serial console stuck on UEFI Shell
Oxide’s hypervisor supports UEFI boot using the OVMF firmware. When the UEFI bootrom cannot find a bootable disk, it falls back to the shell to prompt for user intervention. Here are some possible causes:
the boot disk image is corrupted or created with an incorrect block size*
the boot disk supports only BIOS/legacy boot
there is no boot disk attached to the instance
To remediate the boot issue, check that the instance has the correct boot disk image specified.
* Use the block size specified in the Linux distro image details; if the information is unavailable, try with 512 bytes.
Serial console blank or showing errors
The web UI or CLI may fail to connect to the instance’s serial port when the control plane has lost the state of the instance, potentially due to an unexpected restart of the Propolis server backing the instance. There are other less common connectivity issues that may be rectified by stopping and starting the instance. (Note: Stop/start is different from instance reboot; the former involves recreating the Propolis server and network interfaces).
If you see "device driver not found" errors on the serial console, it is likely that the boot disk image does not meet some of the prerequisites described in the image requirements. Contact Oxide Support if you need any assistance with getting your disk image to work.
If you notice any NVMe or disk I/O errors from the guest OS, the problems may be related to the storage volumes backing the instance. Look for any guest OS prompts for corrective actions (e.g., chkdsk
, fsck
) and follow those instructions. If there is none, you can try stopping and starting the instance to trigger a volume self-repair cycle. If the above actions do not resolve the issue, please contact Oxide Support for assistance.
Instance performance issues
tsc treated as unreliable by guest
Due to a current limitation in Oxide’s hypervisor, guests may spuriously detect the tsc
timing register as unreliable, causing them to switch to the acpi_pm
clocksource.
This will cause timestamp syscalls, such as clock_gettime(3)
on Linux and QueryPerformanceCounter
on Windows, to be vastly slower, which can have significant performance impacts.
On a Linux guest you may see events in dmesg
along these lines:
$ sudo dmesg | grep -e tsc -e clocksource [ 1.267335] tsc: Marking TSC unstable due to clocksource watchdog [ 4.464911] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns [ 4.466898] clocksource: Switched to clocksource acpi_pm
You can check the current clocksource
on Linux with the following command:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc
To mitigate this issue on Linux, you can add kernel boot parameter https://www.kernel.org/doc/html/v6.11/admin-guide/kernel-parameters.html to force the guest to use tsc
as its clocksource.
This is being tracked as propolis#328.
Disk/image import exceptions
Can’t delete disk after canceled image import
When you abort a disk import CLI command (e.g., with Ctrl-C
) or cancel an image import on the web UI, it is possible that the incomplete import fails to fully unwind. If so, you may see a disk left in the importing_from_bulk_writes
state and attempts to delete it will fail. The state is meant to prevent the disk from being put in use or deleted inadvertently.
To remove the disk, make two API calls in the order below:
import stop: change the disk state to
import_ready
finalize: change the disk state to
detached
Once the disk is detached, you will be able to delete it. Read the Disks and Snapshots guide to learn more about disk states.
Image import on web console stalls
If the issue happens in Firefox on MacOS, it may be a known sporadic HTTP/2 problem in the web UI. You can work around it by using a different browser to import the image.
Replacing expired silo TLS certificates
Silo TLS certificates cannot be updated via the web console, this option is only available via the oxide
CLI tool or the REST API.
If a silo’s certificate has expired, but you have already logged in via the CLI, then you can use pass --insecure
flag to oxide api
:
oxide --insecure api /v1/certificates -X POST --input - <<EOF
{
"description": "<CERT_DESCRIPTION>",
"name": "<CERT_NAME>",
"service": "external_api",
"cert": $(jq -Rs . < cert.pem),
"key": $(jq -Rs . < key.pem)
}
EOF
If you have not already logged in the CLI, oxide auth login
will fail due to the expired certificate. This can be worked around by allowing your web browser
to trust the expired certificate, logging in, then using your browser developer tools to locate your session cookie.
This can be used to authenticate against the REST API:
curl --insecure --cookie "session=<SESSION_COOKIE>" -H "Content-Type: application/json" -X POST https://siloname.sys.example.com/v1/certificates --input - <<EOF
{
"description": "<CERT_DESCRIPTION>",
"name": "<CERT_NAME>",
"service": "external_api",
"cert": $(jq -Rs . < cert.pem),
"key": $(jq -Rs . < key.pem)
}
EOF
Oxide API/Console connection time-out
The problem may be caused by rack or network configurations, or a rack service outage. Users with the fleet administrator role and other necessary infrastructure access can follow the instructions here to troubleshoot the issue.