08 May 2026

Telemetry guide for engineers

When your code runs in production, you can't attach a debugger or read console output. Telemetry is how you observe what your system is actually doing at runtime.

When I run my code in production, I cannot attach a debugger or read console output.
Telemetry is how you observe what your system is doing at runtime. Without it, failures are invisible until a customer complains.
Good telemetry means I can find out about problems before users do, and when something goes wrong, I have the data to diagnose it quickly.

Foundation

The four pillars

There are four pillars of telemetry. Each answers a different question. Together they give you complete observability, no single one is sufficient on its own.

Logs, what happened?

why this matters

A log is a timestamped record of something that happened. Free-form text with structured context attached. Logs answer: what happened, and what was the state at that moment?

import logger from "awesomeLogger";

logger.info("Order received", { orderId, createdAt });
logger.warn("Order not in DB, creating a new order", { orderId });
logger.error("Storing data to DB failed", { error, orderId });

Use the right level for the right situation:

LevelUse when
infoNormal, expected operations worth recording (order received, claim approved)
warnSomething unexpected happened but the system recovered (fail-open, fallback triggered)
errorSomething failed and action may be needed

Three rules to follow always: attach structured context as the second argument, never interpolate IDs into the message string alone. Log at decision points and failure paths, not on every function call. Never log sensitive data: PII, tokens, card numbers.

Metrics, how often, how fast, how many failures?

why this matters

A metric is a numeric measurement recorded at a point in time, aggregated over time to produce charts and alerts. Metrics answer: how often? how fast? how many failures?

import { metrics } from "cool-metrics";

metrics.record(METRICS.RECORD.SOMETHING.UPDATE_SUCCESS_COUNT, 1);
metrics.record(METRICS.RECORD.SOMETHING.UPDATE_DURATION_MS, timer.getTime());

The standard pattern for any operation is to count the attempt, then branch on the result:

// 1. Count the attempt
metrics.record(METRICS.MY_AREA.MY_OP.COUNT_REQUESTS, 1);

const timer = startTimer();
const result = await doSomething();

if (!result.success) {
  // 2a. Count the failure
  metrics.record(METRICS.MY_AREA.MY_OP.FAIL_COUNT, 1);
  return result;
}

// 2b. Count the success + record how long it took
metrics.record(METRICS.MY_AREA.MY_OP.SUCCESS_COUNT, 1);
metrics.record(METRICS.MY_AREA.MY_OP.DURATION_MS, timer.getTime());

Add a metric for any operation that can succeed or fail, any operation whose frequency or latency matters for alerting, and any fallback or downgrade path so you can alert if it fires too often. Define all metric paths centrally in constants.ts, never hardcode strings at the call site.

Events, who, what, and why?

why this matters

An event is a structured record of a significant occurrence, with rich key-value attributes attached. Unlike logs, which are text for humans, events are designed for machine querying. Events answer: what exactly happened, with full context, so I can query and group it?

Emit an event when a significant business-level thing happened (process blocked, inventory shortage, payment fallback triggered), when you need queryable and filterable context rather than just a count, or when you want to answer questions like "which stores triggered this most?" or "which claims were affected?"

Metrics vs events rule of thumb: use a metric when you need a number. Use an event when you need to know who, what, and why.

Events are stored and queryable, they have a cost. Use them for meaningful business occurrences, not for routine happy-path operations that happen thousands of times a minute.

Traces, where did this request spend its time?

why this matters

A trace is a record of a request's journey across multiple services or function calls, with timing for each step. Traces answer: where did this request spend its time? which step was slow?

If you use tools like New Relic in production, traces are handled automatically by its agent, you do not need to manually emit them. If you see slow operations in New Relic distributed tracing, that data comes from the agent instrumenting your HTTP calls and database queries automatically.

You only need to think about traces if you are adding custom spans.


In practice

How they work together

Imagine a refund process fails for a client. Here's what each pillar gives you:

PillarWhat you see
Metricrefund/fail-count spikes on the dashboard → alert fires
Loglogger.error('Refund order failed', { orderId, error }) → tells you which order and the raw error
Eventevent.custom('refundEvent', 'refund-failed', { orderId, clientId, clientName }) → lets you query "which client had the most refund failures this week?"
TraceShows that completedRefund took 8s before timing out → points to an API latency issue

Each pillar answers a different question. You need all four for complete observability.


Before you ship

Checklist before marking a PR as ready

For any significant operation you add or modify, ask yourself:

Question
Does it log at the right level (info / warn / error) with structured context?
Does it record a request count metric?
Does it record success and failure counts separately?
Does it record a duration metric?
If it represents a significant business event, does it emit a custom event with enough attributes to query?
Are all new metric paths and event names defined in consts.ts, not hardcoded inline?

Pitfalls

Common mistakes to avoid

Logging without structured context

why this matters

A log with no context is nearly useless during an incident. Always attach the relevant identifiers as a second argument.

// Bad
logger.error("Something failed");

// Good
logger.error("Something completing failed", { somethingId, someoneId, error });

Recording success but not failure

why this matters

If you only count successes, you can never alert on failures. Always record both branches.

// Bad, you can't alert on failures if you never count them
if (result.success) {
  metrics.record(METRICS.MY_OP.SUCCESS_COUNT, 1);
}

// Good
metrics.record(
  result.success ? METRICS.MY_OP.SUCCESS_COUNT : METRICS.MY_OP.FAIL_COUNT,
  1
);

Hardcoding metric strings

why this matters

Hardcoded strings can't be refactored safely, can't be searched across the codebase, and are easy to mistype. Always define paths in constants.ts first.

// Bad
metrics.record("project/client/operations/something/count", 1);

// Good, add to constants.ts first, then reference it
metrics.record(METRICS.PROJECT.MY_CLIENT_COUNT, 1);

Closing

Final thoughts

Telemetry is not something you bolt on after the fact. It is part of the implementation, as much as error handling or input validation.

The four pillars are not interchangeable. Logs tell you what happened. Metrics tell you how often and how fast. Events let you query who and why. Traces show you where time was spent. Each one fills a gap the others leave open.

The checklist above is the fastest way to build the habit. Before every PR, run through it. Over time it becomes automatic, and your on-call shifts get a lot quieter.