OPC-SDK-RFC-013: Fault Management and Alarm Substrate
Status: Draft for Implementation
Version: 1.0.0
Date: 2026-05-19
Audience: SREs, NF implementers, operator authors, observability engineers
1. Abstract
This RFC defines the OpenPacketCore fault management and alarm substrate. It standardizes alarm identity, severity, probable cause, affected object, raise/update/clear semantics, deduplication, suppression, correlation, Kubernetes condition mapping, gNMI/NETCONF notification projection, external fault-management sink integration, and evidence requirements.
Metrics, logs, and traces describe behavior. Alarms describe actionable service faults. Carrier CNFs need both.
2. Scope
2.1 In Scope
- Alarm model and lifecycle.
- Severity and probable-cause taxonomy.
- Affected-object naming.
- Raise, update, clear, acknowledge, suppress.
- Alarm correlation and deduplication.
- Mapping to Kubernetes conditions and events.
- Mapping to gNMI/NETCONF notifications.
- External FM sink integration.
- Alarm metrics, audit, and tests.
2.2 Out of Scope
- Full OSS/BSS ticketing implementation.
- Vendor-specific FM protocols unless implemented as adapters.
- Raw log aggregation.
- Performance SLO alerting rules outside CNF-generated alarms.
3. Design Goals
3.1 Security
- Alarms must not leak secrets or raw subscriber identifiers.
- Alarm administration must be authorized.
- Suppression and acknowledgement are audited.
- LI/security alarms must preserve regulated handling boundaries.
3.2 Performance
- Raising an alarm must be cheap and non-blocking.
- Alarm storms must be deduplicated and rate-limited.
- External sink outages must not block packet or request handling.
3.3 Maintainability
- One alarm vocabulary across all CNFs.
- Stable alarm IDs and probable causes.
- Generated YANG notification projection.
- Shared testkit for alarm lifecycle.
3.4 Functionality
- Support active and historical alarms.
- Support severity changes.
- Support clear conditions.
- Support suppression windows.
- Support external sinks and local query.
4. Alarm Model
#![allow(unused)] fn main() { pub struct Alarm { pub alarm_id: AlarmId, pub alarm_type: AlarmType, pub severity: Severity, pub probable_cause: ProbableCause, pub affected_object: AffectedObject, pub tenant: Option<TenantId>, pub slice: Option<Snssai>, pub region: Option<RegionId>, pub text: RedactedText, pub details: AlarmDetails, pub raised_at: Timestamp, pub updated_at: Timestamp, pub cleared_at: Option<Timestamp>, pub correlation_id: Option<CorrelationId>, } }
AlarmId MUST be stable for the same active fault instance.
5. Severity
Severity levels:
| Severity | Meaning |
|---|---|
critical | service outage, data loss, security boundary failure |
major | serious degradation or redundancy loss |
minor | limited impairment with workaround |
warning | approaching fault or policy exception |
indeterminate | fault detected but impact unknown |
cleared | fault no longer active |
Severity mapping MUST be consistent across CNFs.
6. Probable Cause Taxonomy
The SDK maintains a versioned taxonomy:
config-apply-failedconfig-drift-detectedcertificate-expiringcertificate-expiredidentity-unavailableauthorization-policy-invalidsession-store-unavailablelease-lostbackend-timeoutnrf-unreachablesbi-overloadpeer-unreachablepacket-drop-thresholddataplane-preflight-failedstorage-corruptionaudit-chain-invalidkey-unavailableli-delivery-failedcharging-export-failedprivacy-policy-violation
Per-NF causes may be added but MUST be namespaced.
7. Affected Object
Affected objects use structured names:
#![allow(unused)] fn main() { pub enum AffectedObject { NfInstance { kind: NfKind, instance: InstanceId }, Interface { nf: InstanceId, name: String }, Peer { nf: InstanceId, peer_id: String }, SessionStore { nf: InstanceId, shard: Option<String> }, Slice { snssai: Snssai }, Tenant { tenant: TenantId }, Certificate { key_id: KeyId }, DataPlaneQueue { nf: InstanceId, interface: String, queue: u16 }, } }
Raw subscriber identifiers MUST NOT be affected-object names.
8. Alarm Lifecycle
States:
raisedupdatedacknowledgedsuppressedclearedexpired
Lifecycle rules:
- A repeated raise with same dedup key updates the active alarm.
- Clear requires a matching active alarm or creates a no-op metric.
- Acknowledgement does not clear.
- Suppression does not delete history.
- Severity downgrade is an update, not clear plus raise.
9. Deduplication and Correlation
Dedup key:
alarm_type || probable_cause || affected_object || tenant || slice
Correlation groups related alarms, such as:
- NRF unavailable causing SBI discovery failures.
- certificate expiry causing mTLS failures.
- session store outage causing lease lost alarms.
Correlation MUST NOT hide critical alarms; it only helps presentation.
10. Suppression
Suppression may be:
- maintenance window,
- known outage,
- test mode,
- dependency alarm correlation.
Suppression requires authorization and audit. Security-critical alarms SHOULD not be suppressible unless carrier policy explicitly allows it.
11. Storage
The alarm store MUST support:
- active alarm query,
- historical alarm query,
- append-only lifecycle events,
- bounded retention,
- tenant/slice filtering,
- tamper-evident audit for admin actions.
Local storage may use RFC 001 persistence for management alarms. High-volume alarm history SHOULD be exported to an external FM system.
12. Projection to Kubernetes
Alarms map to Kubernetes Conditions and Events:
- critical/major active alarms can drive
Ready=FalseorDegraded=Trueaccording to NF policy, - warning alarms usually do not change readiness,
- clear events update conditions when no other active alarm holds the state.
Condition reason strings MUST be stable.
13. Projection to gNMI/NETCONF
The alarm subsystem MUST expose:
- active alarms operational tree,
- alarm history operational tree,
- notifications for raise/update/clear,
- authorized acknowledge/suppress operations.
YANG notification generation SHOULD use RFC 002 metadata and RFC 006 evidence tags.
14. External FM Sinks
Sink adapters:
- webhook,
- Kafka/NATS,
- OpenTelemetry events,
- SNMP/NETCONF adapter where needed,
- carrier OSS adapter.
External sink failure MUST:
- raise a sink alarm,
- buffer within limits if policy allows,
- never block fast paths,
- expose drop counters.
15. Alarm Sources
Common sources:
- RFC 001 config commit failures,
- RFC 003 identity/key/cert failures,
- RFC 004 session store and lease failures,
- RFC 007 SBI overload/discovery failures,
- RFC 008 runtime task failures,
- RFC 009 lifecycle migration failures,
- RFC 011 data-plane preflight and drop thresholds,
- RFC 010 privacy/legal-hold/export failures.
16. Observability
Required metrics:
opc_alarm_active{severity,cause}opc_alarm_events_total{event,severity,cause}opc_alarm_suppressed_total{cause}opc_alarm_sink_delivery_total{sink,outcome}opc_alarm_sink_queue_depth{sink}opc_alarm_clear_without_active_total{cause}
Alarm text MUST be redacted through RFC 010.
17. Configuration Model
Shared YANG groupings SHOULD include:
alarms/severity-policyalarms/suppressionalarms/sinksalarms/retentionalarms/readiness-impactalarms/correlation
Per-NF YANG may add alarm thresholds, such as packet drop ratio or peer outage duration.
18. Module Ownership
| Module | Responsibility |
|---|---|
opc-alarm-model | alarm structs, severity, causes |
opc-alarm-store | active/history store |
opc-alarm-manager | raise/update/clear/dedup |
opc-alarm-policy | suppression and readiness impact |
opc-alarm-k8s | condition/event mapping |
opc-alarm-yang | gNMI/NETCONF operational projection |
opc-alarm-sink | external sink adapters |
opc-alarm-testkit | alarm lifecycle fixtures |
Agents adding new alarms must add taxonomy entries, tests, and evidence tags.
19. Testing Requirements
19.1 Unit Tests
- Dedup key stability.
- Severity transition.
- Clear behavior.
- Suppression authorization.
- Redaction.
- Readiness impact policy.
19.2 Integration Tests
- Runtime task failure raises alarm.
- Alarm maps to Kubernetes condition.
- Alarm notification appears on gNMI subscription.
- External sink receives raise/update/clear.
- Sink outage buffers or drops according to policy.
19.3 Fault Injection
- Alarm storm.
- Sink outage.
- Store unavailable.
- Unauthorized suppression attempt.
- Duplicate raise from many tasks.
19.4 Performance Gates
- Alarm raise common path does not block longer than 100 microseconds.
- Alarm storm of 10,000 duplicate events deduplicates without unbounded memory.
- External sink outage does not impact protocol request p99.
20. Acceptance Criteria
This RFC is implemented when:
- Every CNF uses shared alarm model and manager.
- Alarm severity and probable cause taxonomy are stable and versioned.
- Raise/update/clear semantics are deterministic.
- Kubernetes conditions and events are derived consistently.
- gNMI/NETCONF alarm operational state and notifications are available.
- Suppression and acknowledgement are authorized and audited.
- External sink failures do not block service paths.
- Alarm behavior is covered by shared testkit and evidence.