top of page
Search

Stop Collecting Everything: A Better Approach to Network Telemetry

  • robertbmacdonald
  • Nov 13, 2025
  • 3 min read

After 15+ years in network engineering, I've learned that effective telemetry starts with one question: what matters to the business?


The telemetry paradox is real. You think collecting everything gives you visibility and you might need it "someday", but it creates noise. And noise is expensive—in storage costs, in analyst time, and in your ability to spot real problems.


Start With Business Outcomes, Not Technical Metrics

Before you enable a single collector, ask what the business cares about.


For an e-commerce platform, it's transaction completion rates and page load times. For a financial trading firm, it's market data integrity and order-to-execution latency. For a SaaS provider, it's API availability and response times.


These aren't network metrics. They're business metrics.


Your job is to work backward from these outcomes to the technical telemetry that actually informs them. If a business metric degrades, what network data would help you diagnose why?


Yes, you need interface counters, CPU utilization, and bandwidth graphs. But you need to understand WHY you're collecting them first. Each metric should map to a business outcome you're trying to protect or improve.


Apply the USE Method to Telemetry Design

Once you've identified what business outcomes matter, use Brendan Gregg's USE Method to determine what to collect: Utilization, Saturation, and Errors.


For every network resource—interfaces, links, buffers, queues—you want metrics that answer these three questions.


Utilization: How busy is this resource? (Interface throughput, bandwidth percentage)

Saturation: Is this resource overloaded? (Queue depth, discards, buffer drops)

Errors: Is this resource malfunctioning? (CRC errors, frame errors, link flaps)


If a metric doesn't map to U, S, or E for a resource that impacts your business outcomes, you probably don't need it.


This framework dramatically reduces telemetry sprawl. You're collecting data because it directly answers diagnostic questions about resources that matter, not because you might need it someday.


Data Retention: Granularity Over Time

Not all metrics age the same way. You need a retention strategy that matches how you actually use the data.


For capacity planning, averages are NOT fine. Peaks matter. Bursts matter. How much they matter depends on the business, but in networking we must be aware of and plan for the peak. For some businesses like financial trading, these peaks are when markets are most active and volatile, which presents considerable opportunity. For retail businesses, peaks are vital during busy times like Black Friday and the Christmas season.


Error counters and saturation metrics need high granularity—1-second or 5-second intervals—but only for recent history. I keep them at full resolution for 48 hours, then downsample to 1-minute averages for 30 days, then 5-minute averages for a year. When I'm troubleshooting an active issue, I need to see exactly when errors spiked.


Utilization metrics can be smoothed more aggressively. 5-minute averages (and peaks!) are sufficient for most trending and forecasting work. Keep full-resolution data for a week at most.


Flow data can be expensive and voluminous. I typically aggregate NetFlow to summary statistics by protocol and endpoint for 90 days. Unless you're doing forensic analysis, you probably don't need every flow record from three months ago. Be sure to understand your flow data source; sFlow telemetry is already sampled data.


This tiered approach keeps storage costs manageable while ensuring the data you need for troubleshooting is still available at the resolution that matters.


The Real Cost of "Collect Everything"

Storage costs are obvious, but they're not the only expense.


The hidden cost is signal-to-noise ratio. When you collect everything, you spend time wading through irrelevant data during incidents, miss actual problems because they're buried in dashboards, and create alert fatigue when you threshold everything "just in case."

Less data, collected with intention, beats comprehensive data with no clear purpose.


What This Looks Like in Practice

Here's how I approach telemetry for a new environment:


Define 3-5 key business metrics. Map those to network resources that could impact them. For each resource, collect USE metrics—nothing else initially.


Deploy this minimal stack. Use it for 30 days. Document what questions you can't answer.


Only then add the next tier of collection. Flow data where you need visibility into application behavior. Packet captures for specific troubleshooting workflows, not continuous collection.


This iterative approach keeps costs manageable and ensures every metric you collect has a clear purpose. Set a calendar reminder to reassess quarterly: which metrics did you actually query in the past 90 days? If you didn't use it, stop collecting it.


The Bottom Line

Telemetry is not about comprehensiveness. It's about relevance.


Answer the business question first. Build the technical measurement stack second. And remember—you can always add more telemetry later, but you can't get back the time and money spent collecting data you'll never use.

 
 
 

Recent Posts

See All
The Hidden Problem Breaking Your Packet Analysis

The Troubleshooting Scenario You're deep into troubleshooting a performance issue. The application team reports intermittent slowdowns on a critical database connection. Response times spike from 2ms

 
 
 
Troubleshooting Slow TCP Transfers on Windows

After publishing my article on Linux TCP troubleshooting using ss and nstat, I got several requests for the Windows equivalent. The short answer: there isn't one. Windows exposes per-socket TCP statis

 
 
 

Comments


bottom of page