Important Notes

  1. This release comes with new features for multi-tenant management. As part of this change, silo creation now requires a set of resource quotas (vcpu, memory, storage) to be specified.

    1. As part of the software update, all existing silos will have resource quotas configured to use all available fleet capacity initially. Fleet administrators can modify the quotas afterwards via the new /v1/system/silos/{silo}/quotas API.

    2. See the New Features section for more information about silo resource allocation.

  2. A previous issue with disk deletion (omicron#3866) resulted in incomplete removal of backend data volumes and over-reported disk usage. The issue may manifest as 500 errors when a user attempts to delete a project that has no resource in it. Such partially removed disks will be un-deleted during the software update process. They are marked in faulted state so that they can be cleanly deleted again. Please review and remove any faulted disks once the system has been updated.

  3. The disk_import_blocks_from_url API endpoint for importing disk images from a remote URL is no longer supported. Please download image files to your local workstation and use the disk_bulk_write APIs to import them instead. (Note: The oxide disk import CLI command is not affected by this change.)

  4. The Oxide CLI binaries (v0.2.0) have been updated for the new floating IP and resource quota endpoints.

Installation

Oxide Computer Model 0 must be installed and configured under the guidance of Oxide technicians. The requirement may change in future releases.

Upgrade Compatibility

Upgrades from version 3 and 4 are both supported. We recommend shutting down all running instances on the rack before the software update commences.

All existing setup and data (e.g. projects, users, instances) should remain intact after the software update.

New Features

This release includes several new features for rack management and VM instance networking:

Silo resource allocation

Fleet administrators can now set limits on virtual resources (vcpu, memory, storage) usable by individual silos. Quotas are set during silo creation and can be modified afterwards to levels at or above the current utilization. Silo administrators can query the capacity and current utilization via API or CLI to independently manage the rack resources allocated to them.

Quotas are enforced when a new disk is provisioned and when an instance is started. If the action causes the silo’s aggregate resource usage to exceed its quotas, users will receive an InsufficientCapacity error. Please refer to the new Silo Management guide for details on usable capacity and utilization calculations as well as some possible exception scenarios.

Marking sled non-provisionable

This new API allows operators to temporarily exclude a sled from new workload placement. The action may be required when diagnosing and mitigating the impact of unexpected sled issues (e.g. unresponsive sleds, unscheduled reboots). This operator API is a precursor for the sled maintenance and replacement feature set.

Floating IP address

Floating IPs are permanent, project-scoped resources which bind an individual IP address from a given IP Pool. They allow for well-known addresses to be allocated (explicitly or automatically) and assigned to target instances, making it easier to host services from a consistent address. Floating IPs are allocated or de-allocated only when instances are created or destroyed at this time. They can also be used along with ephemeral IP so that an instance can be accessed on more than one external IP address. Please refer to the user guides (Configuring Guest Networking and Managing Floating IPs) for more information.

Bug fixes:

  • Security fix: CVE-2023-50913 SSRF in Oxide software that could allow attacker to access the ClickHouse metrics datastore.

  • Spurious errors were returned for snapshot or disk deletions after multiple delete requests on the same snapshot (omicron#3866, omicron-PR#4547)

  • Disk create or instance start requests under high concurrency failed to complete (omicron#3304)

  • Instances sometimes fail to boot up when they are created under very high concurrency (propolis#535)

  • IP address could not be left blank when adding NIC to instance (console#1438)

  • Image sizes were not available on image picker (console#1824)

  • Project picker showed only the first 20 projects (console#1817)

  • Disk snapshot action did not provide any UI feedback (console#1815)

  • BGP configuration was not applied to switches after upgrade (omicron#4474)

  • BGP failed to handle ConnectRetryTimerExpires in Active state (maghemite#93)

  • Link parameters from rack setup was not persisted in the control plane datastore (omicron#4470)

  • Link config API did not allow for setting link autonegotiation (omicron#4458)

  • RoT on production Rev E gimlet incorrectly prohibited software update (omicron#4420)

  • Reliability improvements:

    • Support for non-power-of-2 multipath route selection (dendrite-PR#685)

    • Background tasks to populate NAT entries of sleds and instances during normal and unexpected restarts (omicron#3631)

    • Better handling of racing VM suspend conditions (propolis#559, propolis#561)

    • Better handling of project resource usage mismatch conditions (omicron#4426)

  • Storage backend reliability and performance improvements (crucible-PR#1014, crucible#1038, crucible#1021, crucible-PR#991, crucible-PR#1019, crucible-PR#1047)

Firmware update:

  • None in this release

Known Behavior and Limitations

End-user features

Feature AreaKnown Issue/LimitationIssue Number

Image/snapshot management

Disks in importing_from_bulk_writes state cannot be deleted directly. The current procedures to unstick a canceled disk import are not obvious to CLI users.

omicron#2987

Image/snapshot management

Image upload sometimes stalls with HTTP/2 on FireFox.

omicron#3559

Image/snapshot management

Unable to create snapshots for disks attached to stopped instances. As a workaround, user can detach a disk temporarily for snapshotting and re-attach it to the instance afterwards.

omicron#3289

Image/snapshot management

The ability to modify image metadata is not available at this time.

omicron#2800

Instance orchestration

The ability to select which SSH keys to be passed to a new instance is not available at this time.

omicron#3056

Instance orchestration

Instance or disk provisioning requests may fail due to unhandled sled or storage failure on rare occasions. Users can retry the requests to work around the failures.

omicron#3480, omicron#2483

Instance orchestration

Disk volume backend repair may fail to complete under heavy large write workload, preventing instances from starting or stopping.

crucible#837

Instance orchestration

Instances no longer transition to failed state when propolis zone has crashed or is gone

omicron#4709

Telemetry

Guest VM vcpu and memory metrics are unavailable at this time.

-

VPC and routing

Inter-subnet traffic routing is not available by default. Router and routing rules will be supported in future releases.

omicron#2232

Operator features

Feature AreaKnown Issue/LimitationIssue Number

Access control

Device tokens do not expire.

omicron#2302

Control plane

Sled and physical storage availability status are not available in the inventory UI and API yet.

omicron#2035

Control plane

When sleds attached to the switches are restarted outside of rack cold-start, a full rack power cycle may be required to re-propagate sled NAT configurations.

omicron#3631

Control plane

Operator-driven software update is currently unavailable. All updates need to be performed by Oxide technicians.

-

Control plane

Operator-driven instance migration across sleds is currently unavailable. Instance migrations need to be performed by Oxide technicians.

-

Network management

End users cannot query the names of non-default IP pools. The information needs to be provided by the administrators manually at this time.

omicron#2148

Telemetry

Hardware metrics such as temperatures, fan speeds, and power consumption are not exposed to the control plane at this time.

-

User management

User offboarding from the rack is not supported at this time. Apart from updating the identity provider to remove obsolete users from the relevant groups, operators will need to remove any IAM roles granted directly to those users in silos and projects.

omicron#2587