It's Always The Network (Until You Prove It Isn't)

robertbmacdonald
Oct 28
5 min read

"The network is slow."

I've heard this at 3am, in conference rooms, and in Slack channels more times than I can count. And about 80% of the time, after I dig into it, the network is fine. The problem is somewhere else entirely.

But here's the thing—you can't just say 'it's not the network' and walk away. You need data to either find the problem or clear the network so troubleshooting can move forward.

Over the years, I've developed a systematic approach based on Brendan Gregg's USE Method. It was originally designed for system performance troubleshooting, but it works perfectly for networks.

This framework helps me isolate issues quickly, usually in 20-30 minutes, and either confirm the network is at fault or narrow the investigation to the actual source.

Step 1: Define the Problem (Actually Define It)

"Slow" doesn't mean anything to me.

We need to know: slow for who? Slow since when? Slow compared to what baseline?

Bad problem statement: "The network is slow"

Good problem statement: "File transfers from server XYZ in datacenter A to server ZYX in datacenter B dropped from 800Mbps to 80Mbps starting yesterday at 2pm"

Another good one: "API response times between this app server and that database went from 5ms to 50ms this morning"

If you can't quantify it, you can't fix it.

Step 2: Identify Your Key Metrics

You need to match metrics to the problem type.

File transfers? You're looking at throughput in Mbps or Gbps, and actual transfer times.

Request/response latency issues? You want round-trip time and application response time.

Interactive applications feeling sluggish? Check jitter and packet loss.

Get specific numbers. "High latency" tells me nothing. "50ms when it should be 5ms" tells me exactly what we're dealing with.

Timestamps are critical. When did the problem start?

These metrics are critical for determining the success or failure of the troubleshoot.

Step 3: Map the Network Path (Both Directions)

This is critical and often overlooked: network paths can be asymmetric.

The path your traffic takes on transmit might be completely different from the path it takes on receive. Different routing policies, different ISPs in some cases, different physical links.

I use traceroute, routing tables, MAC address tables, link bundle and load balancer hashing algorithm outputs, and flow data to map every single hop. Every switch, router, firewall, load balancer, WAN link. All of it gets documented.

You can't troubleshoot what you can't see.

Step 4: Apply the USE Method

This is where the USE Method comes in. For every network element in that path, I check three things: Utilization, Saturation, and Errors.

Utilization (The Hardest to Get Right)

Most monitoring systems poll SNMP counters every 5 minutes and then show you an average. This is almost useless for troubleshooting.

Why? Because averages hide micro-bursts. You could have 99% utilization for 10 seconds that causes massive problems, but if it averages out to 40% over 5 minutes, your monitoring won't show it.

Use streaming telemetry or increase polling frequency where possible. Ideally, leverage advanced features like Cisco’s Nexus Micro-Burst Monitoring or Arista’s LANZ feature.

What I'm looking for: sustained utilization above 70-80% for extended periods OR significant changes in sustained utilization compared to baseline.

Saturation (This Is Usually Where I Find Problems)

Saturation is not the same as utilization hitting 100%. Saturation is when traffic is arriving faster than it can be forwarded.

The number one indicator of saturation is packet discards. Not errors—discards. Specifically input discards and output discards (e.g. ifInDiscards and ifOutDiscards). I check discard counters on every hop in the path.

When you see non-zero discard counters incrementing, that's your smoking gun. Queues are overflowing. Traffic is being dropped because the device can't keep up.

Errors (Physical Layer Problems)

Errors are different from discards. Errors usually indicate hardware or physical layer issues.

I look for CRC errors, which mean bad packets with failed checksums. I check input errors and output errors separately. Frame errors, runts, giants—all of these point to layer 1 or layer 2 problems.

Common causes: bad cables, dirty fiber connectors, wrong SFP or transceiver, speed/duplex mismatches (rare these days but still happens), or outright hardware failures.

Step 5: Pause and Review

If I found utilization, saturation, or error issues in the network path, then the network is implicated. I prioritize based on what is discovered.

Prioritize saturation if you see lots of discards. Prioritize errors if you see CRC failures and physical layer issues. Then address sustained high utilization for capacity planning.

I document everything with specific counters and timestamps. Then we proceed with remediation—QoS policies, link upgrades, cable replacement, whatever is needed.

But if the network path is clean? Zero significant discards, no errors, reasonable utilization? Then the network is exonerated, and I need to look elsewhere.

Step 6: When the Network Is Clean

This is where most "network" problems actually live.

If the USE method shows clean network metrics, I start looking at the host network stack—TCP window sizes, buffer configurations, socket drops, connection pooling. Then application behavior. Then server resources like CPU, memory, or disk I/O. Sometimes it's external dependencies like DNS or authentication services.

These deserve their own deep-dive posts, which I'll write in the future. But the point is: once you've cleared the network with data, you can confidently redirect the troubleshooting effort.

A Real Example

I was brought in to help a team of software developers. Their application had begun running extremely slowly. API calls that normally took milliseconds were taking seconds. Everyone assumed it was the network.

I mapped the path: app server to edge switch to core switch to WAN router to WAN router to core switch to edge switch to app server.

USE method results:

Utilization: 30-40% across all links
Saturation: zero discards anywhere in the path
Errors: zero CRC errors, clean interfaces

Twenty minutes into the call, I had cleared the network completely. The actual problem? The application wasn't draining socket buffers quickly enough. Once they knew where to look, they immediately began working on application code improvements.

That's the value of the systematic approach. No politics, no guessing, just data.

Why This Works

Anyone can pull up a dashboard and point at a graph. The value is in having a repeatable methodology that either finds the problem or eliminates possibilities with confidence.

It's systematic, so it's faster. It's data-driven, so there's no politics. And it's reproducible, which builds credibility over time.

Sometimes the network really is the problem. But more often, it's not. Either way, you'll know quickly, and you'll have the data to back it up.

Kyberis Networks helps organizations troubleshoot complex network and infrastructure issues. If you’re dealing with persistent performance problems, feel free to reach out.

It's Always The Network (Until You Prove It Isn't)

Recent Posts

Comments