Control Plane

Components

The control plane of Oxide Rack includes the following components:

  • Nexus: This core component of the control plane is a monolithic program that hosts all user-facing APIs (both operator and customer) and also provides APIs internal to the rack for other components to report faults, generate alerts, and so on. Nexus is also responsible for background control plane activity, including utilization management, server failure detection and recovery, and the like.

  • sled-agent: The agent runs on each server (including those functioning as switches) to create, update, and destroy Instances, storage resources, and networking resources. It is also responsible for inventorizing hardware, reporting faults and other events to Nexus, and managing software updates on the server.

  • bootstrap-agent: The agent runs on each server (including those functioning as switches) during rack and server initialization to establish rack-level trust, unlock storage, launch the sled agent, and configure new devices.

  • management-gateway-service: MGS is a bridge between service processors (SPs) and the rest of the control plane. It runs on each of the servers adjacent to a Sidecar switch, forwarding requests from Nexus onto the management network.

  • dendrite: Dendrite runs on each of the two servers adjacent to a Sidecar switch. It is the driver for managing data plane code, including programs developed in P4, on a Tofino ASIC.

  • cockroachdb: The control plane data storage system is a distributed CockroachDB database using consensus to provide strong consistency and high availability.

  • clickhouse: The metric data store based on ClickHouse captures telemetry data from components in the rack.

  • oximeter: The telemetry service collects metric data from the other components and stores it into ClickHouse. More details can be found in the telemetry section below.

  • ntp: The NTP service supports time synchronization of all servers in the rack with the external NTP server.

  • internal-dns: The internal DNS supports service discovery and dynamic service endpoint lookup among the control plane components.

  • external-dns: The external DNS integrates with the on-premises customer DNS and provides name service to the rack’s Console and API endpoints. The name service will be extended to user’s virtual machine instances in future product releases.

  • crucible-pantry: The pantry service operates on volumes that are not attached to instances and performs background tasks such as taking snapshots, scrubbing, and importing data from external sources.

control-plane
Note
Except for the agents which are instantiated on a per-server basis, all control plane services run in the form of distributed clusters on the rack to provide scalability and availability.

Telemetry

Oximeter is the system that describes, generates, collects, and stores telemetry data in the Oxide Rack. The system is made up of the following components:

  • producer: A producer is a software process that registers a data source with the control plane, generates measurements, and provides an HTTP endpoint from which oximeter pulls data.

  • collector: A collector is a software process that ingests metrics from one or more producers.

  • db: A library for interacting with the telemetry database, ClickHouse

Note
While the producer itself is software, it may be collecting the underlying data from hardware such as a sensor, drive, NIC, etc.

Oximeter provides two traits for describing data, Target and Metric. Some examples of targets are HTTP services or hardware components such as fans. A metric describes a feature of the target that is being measured at regular intervals. Keeping with the above examples, a metric could be: the number of 500-level responses a service generates, or the current speed of a fan. Each target and metric combination forms a uniquely-identifiable timeseries in the ClickHouse database.

Most services in the Oxide Rack are designed to be capable of using the Oximeter to produce timeseries data. In future product releases, the telemetry infrastructure will be expanded to include monitoring and alerting capabilities that leverage the metrics stored in ClickHouse, as well as more client tools for integrating with system monitoring applications external to the Oxide Rack.

Last updated