OPC-SDK-RFC-008: CNF Runtime Chassis and Resource Governance
Status: Draft for Implementation
Version: 1.0.0
Date: 2026-05-19
Audience: NF implementers, platform engineers, SREs, security reviewers
1. Abstract
This RFC defines the common Rust runtime chassis used by every OpenPacketCore CNF. It standardizes process startup, task supervision, shutdown, health probes, admin endpoints, runtime pools, resource budgets, panic policy, configuration bootstrap, signal handling, telemetry initialization, memory behavior, and operational debug surfaces.
The goal is that AMF, SMF, UPF, NRF, PCF, SEPP, SMSC, and all other CNFs share one predictable runtime skeleton instead of each inventing its own Tokio setup, shutdown behavior, health semantics, and task lifecycle.
2. Scope
2.1 In Scope
- Runtime initialization.
- Tokio worker and blocking pool configuration.
- Task supervision and cancellation.
- Startup and readiness phases.
- Graceful shutdown and drain.
- Health and admin HTTP endpoints.
- Runtime resource budgets and backpressure hooks.
- Panic and fatal-error policy.
- Metrics, logging, and tracing bootstrap.
- Memory, allocator, and OOM behavior.
- Common CLI/env/bootstrap contract.
2.2 Out of Scope
- NF-specific protocol logic.
- Kubernetes controller behavior. See RFC 009.
- Node/NIC scheduling and SR-IOV contracts. See RFC 011.
- Config commit semantics. See RFC 001.
3. Design Goals
3.1 Security
- Fail closed when required bootstrap security material is unavailable.
- Keep debug endpoints disabled or authorization-gated in production.
- Ensure panic output and fatal-error reports are redacted.
- Make shutdown safe: no partial config writes, key leaks, or unaudited emergency exits.
3.2 Performance
- Avoid runtime-pool contention between management, control, crypto, and data-plane work.
- Bound queues, tasks, memory, and blocking work.
- Make health checks cheap and non-blocking.
- Provide predictable drain behavior under load.
3.3 Maintainability
- Provide one reusable
opc-runtimecrate. - Make lifecycle phases explicit and testable.
- Provide standard task naming and metrics.
- Keep per-NF custom code in callbacks, not in process scaffolding.
3.4 Functionality
- Support control-plane, data-plane, and library-like CNF profiles.
- Support local developer mode and production mode.
- Support graceful restart, termination, and Kubernetes probe integration.
- Support runtime introspection without exposing secrets.
4. Runtime Crate
The shared crate is opc-runtime.
crates/opc-runtime/
src/
lib.rs
bootstrap.rs
profile.rs
supervisor.rs
task.rs
shutdown.rs
health.rs
admin.rs
resources.rs
panic.rs
telemetry.rs
memory.rs
signals.rs
testkit.rs
Every NF binary SHOULD be a thin wrapper around opc_runtime::run.
5. Runtime Profile
#![allow(unused)] fn main() { pub struct RuntimeProfile { pub mode: RuntimeMode, pub nf_kind: NetworkFunctionKind, pub instance_id: InstanceId, pub async_workers: WorkerCount, pub blocking_threads: ThreadLimit, pub crypto_threads: ThreadLimit, pub management_threads: ThreadLimit, pub max_tasks: usize, pub max_queued_bytes: usize, pub shutdown_grace: Duration, pub drain_timeout: Duration, } }
Profiles:
dev: permissive, local files, debug endpoints enabled on loopback.lab: production-like, explicit waivers allowed.production: fail closed, debug gated, strict resource limits.conformance: deterministic test profile.perf: optimized benchmark profile with fixed CPU/resource assumptions.
6. Startup State Machine
Every CNF starts through:
| Phase | Purpose |
|---|---|
ProcessInit | parse CLI/env, install panic hook, initialize logging |
TelemetryInit | metrics/tracing/logging exporters |
SecurityInit | identity, trust bundles, key providers |
ConfigBootstrap | load initial config through RFC 001 |
ResourcePreflight | verify CPU, memory, filesystem, devices |
ServiceBind | bind listeners but do not report ready |
PeerWarmup | optional NRF registration, discovery, backend connection |
Ready | readiness probe returns success |
Draining | termination accepted, new work limited |
Stopped | all supervised tasks exited |
Startup MUST fail closed in production if any required phase fails.
7. Task Supervision
7.1 Task Model
All long-lived tasks MUST be registered with the supervisor:
#![allow(unused)] fn main() { pub struct TaskSpec { pub name: TaskName, pub kind: TaskKind, pub criticality: Criticality, pub restart: RestartPolicy, pub shutdown: ShutdownPolicy, } }
Task kinds:
listenerprotocol-workersession-workermanagement-workerbackground-syncmetrics-exporterwatchertimer
7.2 Criticality
| Criticality | Behavior on Failure |
|---|---|
fatal | Transition CNF to fatal shutdown |
degrade | Mark degraded and optionally restart |
best-effort | Log/metric and continue |
Critical task failures MUST be visible through readiness and alarm state.
7.3 Restart Policy
Restart policy MUST include:
- max restarts per window,
- backoff,
- jitter,
- failure classification,
- whether restart is allowed after config changes.
Unbounded task restart loops are forbidden.
8. Runtime Pool Isolation
The runtime MUST expose separate execution domains:
- async I/O workers,
- blocking/CPU pool,
- crypto pool,
- management pool,
- data-plane workers where applicable.
Data-plane CNFs SHOULD integrate with RFC 011 CPU pinning and IRQ affinity. Management-plane work MUST NOT execute on data-plane pinned workers.
9. Resource Governance
9.1 Budgets
Each CNF declares:
#![allow(unused)] fn main() { pub struct ResourceBudget { pub max_heap_bytes: Option<usize>, pub max_tasks: usize, pub max_channels: usize, pub max_queue_bytes: usize, pub max_request_body_bytes: usize, pub max_open_files: usize, pub max_backend_connections: usize, } }
Budgets MUST be profile-configurable and observable.
9.2 Backpressure
The runtime provides shared primitives:
- bounded mpsc channels,
- byte-accounted queues,
- weighted semaphores,
- admission guards,
- deadline propagation,
- cancellation tokens.
Unbounded channels are forbidden in production runtime code unless an RFC 006 waiver exists.
9.3 Memory Behavior
The runtime SHOULD:
- expose allocator metrics where available,
- support an optional hardened allocator profile,
- fail fast on configured memory-budget breach,
- avoid memory-heavy debug dumps in production,
- support heap profile endpoints only under explicit authorization.
10. Shutdown and Drain
10.1 Signals
The runtime MUST handle:
SIGTERM: graceful drain.SIGINT: graceful drain in dev, configurable in production.- fatal internal errors: controlled shutdown path when possible.
10.2 Drain Sequence
Drain order:
- Stop accepting new external work.
- Mark readiness false.
- Notify NRF/deregister where applicable.
- Stop management writes except emergency recovery.
- Drain protocol workers up to timeout.
- Flush audit and evidence breadcrumbs.
- Checkpoint local state where applicable.
- Shut down listeners and background tasks.
Each NF can add steps but MUST preserve safety ordering.
10.3 Kubernetes Integration
terminationGracePeriodSeconds MUST be at least shutdown_grace plus probe
latency margin. PreStop hooks MAY call admin drain but MUST NOT be the only
drain mechanism.
11. Health and Admin Surface
11.1 Endpoints
Default admin listener:
/livez/readyz/startupz/metrics/debug/runtimegated/debug/tasksgated/debug/config-versiongated
Production debug endpoints MUST require authorization or be disabled.
11.2 Health Semantics
/livez means the process event loop is alive. It MUST NOT depend on external
peers.
/readyz means the CNF can serve its intended role. It SHOULD include:
- config applied,
- critical tasks healthy,
- required listeners bound,
- required security material valid,
- required backends reachable according to NF policy.
12. Panic and Fatal Error Policy
12.1 Panics
Production builds MUST install a panic hook that:
- redacts secrets,
- records task name,
- increments fatal metrics,
- emits a structured fatal log,
- triggers supervisor policy.
Panics in parser or protocol handlers are bugs and MUST be covered by RFC 005 fuzzing regression tests.
12.2 unwrap and expect
Runtime and NF code MUST avoid unwrap and expect outside tests, build
scripts, and explicitly justified invariants. Justifications MUST be grep-able
and evidence-linked.
13. Bootstrap Contract
CLI/env values are limited to bootstrap concerns:
- config bootstrap source,
- management bind address,
- admin bind address,
- production/dev mode,
- identity socket path,
- tracing exporter endpoint,
- initial log level,
- feature gates for explicit waivers.
Dense protocol behavior MUST come from canonical config, not env vars.
14. Telemetry Initialization
The runtime initializes:
- structured JSON logging,
- OpenTelemetry tracing,
- Prometheus metrics,
- build info,
- runtime profile info,
- panic/fatal counters.
Required metrics:
opc_runtime_build_info{nf,version,git_sha}opc_runtime_tasks{nf,kind,state}opc_runtime_task_restarts_total{nf,task}opc_runtime_queue_depth{nf,queue}opc_runtime_queue_bytes{nf,queue}opc_runtime_shutdown_total{nf,reason}opc_runtime_panic_total{nf,task}opc_runtime_memory_bytes{nf,kind}opc_runtime_ready{nf}
15. Time and Clocks
The runtime MUST provide a clock abstraction for tests:
#![allow(unused)] fn main() { pub trait Clock: Send + Sync { fn now(&self) -> Timestamp; fn monotonic(&self) -> Instant; } }
Security expiry and audit timestamps use wall clock plus monotonic sequencing where required. Timers use monotonic time.
16. Module Ownership
| Module | Responsibility |
|---|---|
opc-runtime-bootstrap | CLI/env/profile loading |
opc-runtime-supervisor | task registry, restart, failure policy |
opc-runtime-shutdown | signal handling and drain orchestration |
opc-runtime-health | health model and probe endpoints |
opc-runtime-admin | gated debug/admin routes |
opc-runtime-resources | budgets, queues, semaphores |
opc-runtime-telemetry | logging, metrics, tracing init |
opc-runtime-testkit | fake clock, fake tasks, shutdown tests |
Agents implementing NF business logic should consume opc-runtime; they should
not fork startup/shutdown code.
17. Testing Requirements
17.1 Unit Tests
- Startup state transitions.
- Task restart/backoff.
- Fatal vs degraded task failure.
- Bounded queue byte accounting.
- Panic hook redaction.
- Health state aggregation.
- Clock abstraction.
17.2 Integration Tests
- SIGTERM drains in order.
- Readiness flips false before listeners stop.
- NRF deregistration hook is called during drain.
- Background task failure degrades readiness.
- Debug endpoints are disabled or authorized in production.
17.3 Fault Injection
- Hung task on shutdown.
- Task restart loop.
- Telemetry exporter unavailable.
- Missing identity socket.
- Memory budget breach.
- Queue saturation.
- Panic in a worker task.
17.4 Performance Gates
/livezp99 under 1 millisecond in healthy process.- Supervisor task spawn overhead negligible relative to direct spawn in NF startup tests.
- Runtime metrics collection does not allocate on every scrape for static metric sets.
- Queue admission overhead p99 under 10 microseconds.
18. Acceptance Criteria
This RFC is implemented when:
- Every NF binary uses
opc-runtimefor startup, supervision, health, and shutdown. - Long-lived tasks are supervised and named.
- Readiness semantics are consistent across CNFs.
- Shutdown drains safely and predictably.
- Production debug endpoints are gated or disabled.
- Runtime pools and queues are bounded.
- Panic and fatal-error handling is redacted and observable.
- Runtime behavior is covered by shared testkit and fault injection tests.