Files
vscode/extensions/copilot/docs/monitoring/otel-data-flow.html
Zhichao Li 4a4411e88e Native OTel instrumentation for Copilot CLI (background, terminal, debug panel) (#4507)
* add OTel instrumentation spec and plan for all agents

* feat: OTel instrumentation for Copilot CLI background agent

- Add agentOTelEnv.ts config derivation helpers (CLI + Claude)
- Enable SDK OtelLifecycle via env vars before LocalSessionManager ctor
- Add invoke_agent copilotcli wrapper span with traceparent propagation
- Forward OTel env vars to terminal CLI sessions
- Update spec and plan docs for all agents
- 33 tests passing (14 new + 19 existing)

* feat: filter debug-panel-only spans from OTLP export

Spans with non-standard gen_ai.operation.name values (content_event,
user_message) are excluded from external OTLP export while remaining
visible in the Agent Debug Log panel via onDidCompleteSpan.

Only GenAI-conventional operations (invoke_agent, chat, execute_tool,
embeddings, execute_hook) are exported to the user's collector.

* fix: add IOTelService to CopilotCLISessionService ctor in participant test

* fix: pass chatSessionId to CapturingToken for debug panel routing

The CapturingToken was created without chatSessionId, so the debug panel
couldn't route copilotcli OTel spans to the correct session view.

Also: Copilot CLI runtime only supports otlp-http (not gRPC). Terminal
CLI sessions require an HTTP-compatible OTLP endpoint.

* docs: add CLI HTTP-only limitation to spec and dual-port Aspire setup to test plan

* fix: forward OTel env vars to CLI terminal sessions

- Include OTel env vars in terminal profile provider path (dropdown)
  which previously only set shell info without auth/OTel env
- Pass empty env to deriveCopilotCliOTelEnv for terminal sessions so
  vars are always included regardless of process.env pollution from
  the in-process background agent
- Update test plan to use Grafana LGTM stack

* fix: add CHAT_SESSION_ID to attributes in CopilotCLISession

* docs: update OTel instrumentation specification for Copilot CLI and Claude Code

* feat: bridge SDK native OTel spans to Agent Debug panel

Replace synthetic span approach (PR #4494) with a bridge SpanProcessor
that forwards SDK-native spans from the Copilot CLI runtime's
BasicTracerProvider into the extension's IOTelService event stream.

This gives the debug panel the full SDK span hierarchy (subagents,
permissions, hooks, nested tool calls) — identical to what Grafana shows.

Architecture:
- Add injectCompletedSpan() to IOTelService interface for external span
  injection without OTLP re-export
- Create CopilotCliBridgeSpanProcessor that converts ReadableSpan to
  ICompletedSpanData, injects copilot_chat.chat_session_id from a
  traceId→sessionId map, and fires onDidCompleteSpan
- Install bridge on SDK's TracerProvider via internal
  MultiSpanProcessor._spanProcessors array (OTel SDK v2 removed the
  public addSpanProcessor API, but this internal array is the same
  pattern the SDK itself uses in forceFlush)
- Propagate traceparent from extension root span to SDK via
  otelLifecycle.updateParentTraceContext() so all spans share a traceId
- Filter bridge to only forward spans from registered CLI sessions

Code changes:
- copilotCliBridgeSpanProcessor.ts: new bridge processor
- copilotcliSession.ts: remove all synthetic spans (chat, tool, error),
  keep root invoke_agent span + traceparent propagation + bridge wiring
- copilotcliSessionService.ts: install bridge after first session
  creation, wire bridge + SDK trace context updater to sessions
- IOTelService: add injectCompletedSpan to interface + all impls
- Remove outdated synthetic span tests
- Add OTel data flow architecture diagram (HTML)

* fix: update span processing to use parent span context and enhance subagent event identification

* display names for tool call and subagent events

* docs: merge arch and spec into single developer guide

Combine agent_monitoring_arch.md (foreground-only) and agent-otel-spec.md
(all agents) into a single comprehensive developer reference covering all
four agent paths, bridge architecture, and SDK internal access warnings.

* docs: fix stale addSpanProcessor reference in data flow diagram

* chore: move plan and test docs to offline archive

These documents are reference material for the OTel sprint, not needed
in the shipped PR. Archived to ~/Documents/copilot-otel-archive/.

* test: add bridge SpanProcessor unit tests

13 tests covering: traceId filtering, parentSpanContext conversion,
CHAT_SESSION_ID injection, attribute flattening, event conversion,
HrTime→ms conversion, unregister/shutdown behavior.

* test: add span event identification and naming tests

7 tests covering invoke_agent identification logic: top-level skip,
SDK wrapper skip (no agent name), subagent detection (name attribute
and span name parsing), unknown/missing operation name handling.

* fix: always enable SDK OTel for debug panel regardless of user config

The CLI SDK's OtelLifecycle must always initialize so the bridge
processor can forward native spans to the debug panel. When user
OTel is disabled, COPILOT_OTEL_ENABLED is still set but no OTLP
endpoint is configured — the SDK creates spans (for debug panel)
but doesn't export to any external collector.

The bridge installation is also now unconditional — it installs
even when user OTel is disabled.

* chore: remove transient sprint plan

* fix: suppress SDK OTLP export when user OTel is disabled

When user OTel is disabled, force the SDK to use file exporter to
/dev/null instead of letting it default to OTLP. Also clear any
leftover OTEL_EXPORTER_OTLP_ENDPOINT from previous sessions to
prevent orphaned traces in Grafana.

* docs: add background agents section to user monitoring guide

Cover Copilot CLI (background + terminal) and Claude Code agent
tracing in the user-facing guide. Includes span hierarchy examples,
service.name filtering table, and CLI HTTP-only limitation note.

* docs: remove Claude Code from user guide (not yet supported)

* fixup! feat: OTel instrumentation for Copilot CLI background agent

* fix: address PR review comments

- Use GenAiOperationName constants in EXPORTABLE_OPERATION_NAMES (avoids drift)
- Remove unnecessary delete of OTEL_EXPORTER_OTLP_ENDPOINT from process.env
- Replace 'as any' OTel mocks with typed NoopOTelService in terminal tests
- Clarify comment on empty env arg for terminal OTel env derivation
- Add ExportResultCode.SUCCESS comment for clarity

* fixup! fix: always enable SDK OTel for debug panel regardless of user config

* fix: handle SDK native hook spans in debug panel

The SDK's OtelSessionTracker creates 'hook {type}' spans with
github.copilot.hook.type attributes (not gen_ai.operation.name).
These were silently dropped by completedSpanToDebugEvent. Now
detected by span name prefix and converted to Hook: {type} events.

* add execute_hook spans for Claude hook executions in monitoring documentation

* docs: add hook spans to CLI trace hierarchy in user guide
2026-03-20 22:53:20 +00:00

37 KiB