Important Notes
This release comes with new features for multi-tenant management. As part of this change, silo creation now requires a set of resource quotas (vcpu, memory, storage) to be specified.
As part of the software update, all existing silos will have resource quotas configured to use all available fleet capacity initially. Fleet administrators can modify the quotas afterwards via the new
/v1/system/silos/{silo}/quotas
API.See the New Features section for more information about silo resource allocation.
A previous issue with disk deletion (omicron#3866) resulted in incomplete removal of backend data volumes and over-reported disk usage. The issue may manifest as 500 errors when a user attempts to delete a project that has no resource in it. Such partially removed disks will be un-deleted during the software update process. They are marked in
faulted
state so that they can be cleanly deleted again. Please review and remove any faulted disks once the system has been updated.The
disk_import_blocks_from_url
API endpoint for importing disk images from a remote URL is no longer supported. Please download image files to your local workstation and use thedisk_bulk_write
APIs to import them instead. (Note: Theoxide disk import
CLI command is not affected by this change.)The Oxide CLI binaries (v0.2.0) have been updated for the new floating IP and resource quota endpoints.
System Requirements
Please refer to v1.0.0 release notes.
Installation
Oxide Computer Model 0 must be installed and configured under the guidance of Oxide technicians. The requirement may change in future releases.
Upgrade Compatibility
Upgrades from version 3 and 4 are both supported. We recommend shutting down all running instances on the rack before the software update commences.
All existing setup and data (e.g., projects, users, instances) should remain intact after the software update.
New Features
This release includes several new features for rack management and VM instance networking:
Silo resource allocation
Fleet administrators can now set limits on virtual resources (vcpu, memory, storage) usable by individual silos. Quotas are set during silo creation and can be modified afterwards to levels at or above the current utilization. Silo administrators can query the capacity and current utilization via API or CLI to independently manage the rack resources allocated to them.
Quotas are enforced when a new disk is provisioned and when an instance is started. If the action causes the silo’s aggregate resource usage to exceed its quotas, users will receive an InsufficientCapacity
error. Please refer to the new Silo Management guide for details on usable capacity and utilization calculations as well as some possible exception scenarios.
Marking sled non-provisionable
This new API allows operators to temporarily exclude a sled from new workload placement. The action may be required when diagnosing and mitigating the impact of unexpected sled issues (e.g., unresponsive sleds, unscheduled reboots). This operator API is a precursor for the sled maintenance and replacement feature set.
Floating IP address
Floating IPs are permanent, project-scoped resources which bind an individual IP address from a given IP Pool. They allow for well-known addresses to be allocated (explicitly or automatically) and assigned to target instances, making it easier to host services from a consistent address. Floating IPs are allocated or de-allocated only when instances are created or destroyed at this time. They can also be used along with ephemeral IP so that an instance can be accessed on more than one external IP address. Please refer to the user guides (Configuring Guest Networking and Managing Floating IPs) for more information.
Bug fixes:
Security fix: CVE-2023-50913 SSRF in Oxide software that could allow attacker to access the ClickHouse metrics datastore.
Spurious errors were returned for snapshot or disk deletions after multiple delete requests on the same snapshot (omicron#3866, omicron-PR#4547)
Disk create or instance start requests under high concurrency failed to complete (omicron#3304)
Instances sometimes fail to boot up when they are created under very high concurrency (propolis#535)
IP address could not be left blank when adding NIC to instance (console#1438)
Image sizes were not available on image picker (console#1824)
Project picker showed only the first 20 projects (console#1817)
Disk snapshot action did not provide any UI feedback (console#1815)
BGP configuration was not applied to switches after upgrade (omicron#4474)
BGP failed to handle
ConnectRetryTimerExpires
in Active state (maghemite#93)Link parameters from rack setup was not persisted in the control plane datastore (omicron#4470)
Link config API did not allow for setting link autonegotiation (omicron#4458)
RoT on production Rev E gimlet incorrectly prohibited software update (omicron#4420)
Reliability improvements:
Support for non-power-of-2 multipath route selection (dendrite-PR#685)
Background tasks to populate NAT entries of sleds and instances during normal and unexpected restarts (omicron#3631)
Better handling of racing VM suspend conditions (propolis#559, propolis#561)
Better handling of project resource usage mismatch conditions (omicron#4426)
Storage backend reliability and performance improvements (crucible-PR#1014, crucible#1038, crucible#1021, crucible-PR#991, crucible-PR#1019, crucible-PR#1047)
Firmware update:
None in this release
Known Behavior and Limitations
End-user features
Feature Area | Known Issue/Limitation | Issue Number |
---|---|---|
Image/snapshot management | Disks in | |
Image/snapshot management | Image upload sometimes stalls with HTTP/2 on Firefox. | |
Image/snapshot management | Unable to create snapshots for disks attached to stopped instances. As a workaround, user can detach a disk temporarily for snapshotting and re-attach it to the instance afterwards. | |
Image/snapshot management | The ability to modify image metadata is not available at this time. | |
Instance orchestration | The ability to select which SSH keys to be passed to a new instance is not available at this time. | |
Instance orchestration | Instance or disk provisioning requests may fail due to unhandled sled or storage failure on rare occasions. Users can retry the requests to work around the failures. | |
Instance orchestration | Disk volume backend repair may fail to complete under heavy large write workload, preventing instances from starting or stopping. | |
Instance orchestration | Instances no longer transition to failed state when propolis zone has crashed or is gone | |
Telemetry | Guest VM vcpu and memory metrics are unavailable at this time. | - |
VPC and routing | Inter-subnet traffic routing is not available by default. Router and routing rules will be supported in future releases. |
Operator features
Feature Area | Known Issue/Limitation | Issue Number |
---|---|---|
Access control | Device tokens do not expire. | |
Control plane | Sled and physical storage availability status are not available in the inventory UI and API yet. | |
Control plane | When sleds attached to the switches are restarted outside of rack cold-start, a full rack power cycle may be required to re-propagate sled NAT configurations. | |
Control plane | Operator-driven software update is currently unavailable. All updates need to be performed by Oxide technicians. | - |
Control plane | Operator-driven instance migration across sleds is currently unavailable. Instance migrations need to be performed by Oxide technicians. | - |
Network management | End users cannot query the names of non-default IP pools. The information needs to be provided by the administrators manually at this time. | |
Telemetry | Hardware metrics such as temperatures, fan speeds, and power consumption are not exposed to the control plane at this time. | - |
User management | User offboarding from the rack is not supported at this time. Apart from updating the identity provider to remove obsolete users from the relevant groups, operators will need to remove any IAM roles granted directly to those users in silos and projects. |