top of page
Search

The Hidden Problem Breaking Your Packet Analysis

  • robertbmacdonald
  • Dec 11, 2025
  • 7 min read

The Troubleshooting Scenario

You're deep into troubleshooting a performance issue. The application team reports intermittent slowdowns on a critical database connection. Response times spike from 2ms to 200ms randomly, then return to normal. Users are complaining, and you need answers.


You start with TCP analysis (referencing the techniques from my previous articles):


  • SACK blocks are present, indicating selective packet loss

  • Retransmission rate sits at 0.8% (well above the healthy 0.1% baseline)

  • Window scaling is configured correctly on both sides

  • No obvious congestion signals in the TCP window behavior

  • MSS and MTU settings look fine


The data points to packet loss somewhere in the path. Time to dig deeper.


You request packet captures from both endpoints:


  • Client-side capture (application server)

  • Server-side capture (database server)

  • Network path: application server → TOR switch → spine → TOR switch → database server

  • Baseline RTT measured at 0.4ms under normal conditions


The captures arrive. You open both in Wireshark, ready to correlate the two sides of the connection and pinpoint where packets are disappearing.


Then things get weird.


You start with the TCP handshake - the simplest interaction:


  • Client capture shows SYN sent at 10:15:23.450123

  • Server capture shows SYN received at 10:15:23.462891

  • The timestamps suggest 12.7ms of network delay for a 0.4ms path


That's already suspicious. But maybe there's congestion you haven't seen yet. You check the SYN-ACK response:


  • Server sends SYN-ACK at 10:15:23.463104 (213μs after receiving the SYN)

  • Client receives SYN-ACK at 10:15:23.450891 (768μs after it sent the original SYN)


Wait. Stop.


According to these timestamps, the client received the SYN-ACK at 10:15:23.450891. But the server didn't even receive the SYN until 10:15:23.462891 - 12 milliseconds later. The response arrived before the request was received. Time is flowing backwards.


You try to manually align the captures by offsetting one by 12ms. Some packets line up better. But then other parts of the flow break - packets appear to arrive before they're sent, responses precede requests. The offset isn't consistent throughout the capture.


You can't determine:


  • The actual network transit time for any given packet

  • Whether a retransmission at the client happened before or after the original packet arrived at the server

  • If the application's query processing started before or after the database received the request

  • The true sequence of events when both sides show simultaneous activity

  • Whether that 200ms delay is network transit, queuing, or application processing


Without accurate timestamps, you can't merge these captures into a single coherent timeline. You have two separate stories being told by clocks that disagree about what time it is.


Root cause analysis requires understanding causality - what happened first, what happened as a result. With broken time, causality is impossible.


You're flying blind.


The Revelation

The problem isn't your captures. It's not Wireshark. It's not even the network.


It's clock drift.


You check the time synchronization on both systems:


  • Application server: running chrony (NTP client) Last sync: 45 seconds ago Current offset: +8.2ms (clock is running fast)

  • Database server: running systemd-timesyncd (NTP client) Last sync: 127 seconds ago Current offset: -4.5ms (clock is running slow)

  • Total skew between the two systems: ~12.7ms


Both systems think they're synchronized. By NTP standards, they are:


  • ±50ms is considered "acceptable" by most NTP implementations

  • ±10ms is considered "good" in production environments

  • Both servers are well within spec


But for packet-level analysis, you need microsecond precision. These clocks are off by 12,700 microseconds. You're trying to measure 400μs of network transit time with clocks that disagree by 12,700μs.


It's like trying to measure the thickness of a piece of paper with a yardstick.


Why did this happen?


  • NTP syncs periodically, not continuously (typically every 64-1024 seconds)

  • Clock drift occurs between sync intervals based on oscillator quality

  • Software timestamping happens in the kernel, affected by system load and scheduling delays

  • Network jitter impacts NTP's ability to accurately measure time offset


The broader realization hits you: This isn't just about packet captures.

Every distributed system in your infrastructure suffers from this same problem:


  • Log aggregation across multiple servers

  • Distributed tracing timestamps

  • Security event correlation

  • Database replication monitoring

  • Causality determination in microservices


You've been troubleshooting in quicksand this whole time, building analysis on top of timestamps that are fundamentally unreliable.


Why This Matters Beyond Troubleshooting

The engineering principle: In distributed systems, time is not a free resource you can take for granted. It's infrastructure that must be explicitly designed, maintained, and monitored - just like your network, compute, or storage.


Here's a concrete example of why this matters in business terms.


In financial markets and trading systems (where I spent years of my career), time accuracy isn't just a troubleshooting inconvenience - it's a regulatory requirement with serious consequences:


MiFID II (Markets in Financial Instruments Directive) in Europe requires:


  • 100 microsecond maximum divergence from UTC for high-frequency algorithmic trading

  • 1 millisecond accuracy for other electronic trading activities

  • 1 second accuracy for voice trading

  • Traceable synchronization to UTC

  • Documented time sync procedures and monitoring

  • Penalties for non-compliance include fines and trading suspension


CAT (Consolidated Audit Trail) in US equity markets requires:


  • Synchronized timestamps across all market participants

  • Timestamps in millisecond or finer increments for all reportable events

  • Industry members already capturing finer granularity must report up to nanosecond precision

  • Traceability to NIST (National Institute of Standards and Technology)

  • Serious regulatory penalties for inaccurate reporting


Why regulators care: Market manipulation investigations, trade dispute resolution, and systemic risk analysis all depend on accurately reconstructing the sequence of events across multiple systems and venues. If your timestamps are wrong, you can't prove what happened when.


Without proper time synchronization, firms face:


  • Failed regulatory audits

  • Inability to definitively resolve trade disputes

  • Fines for inaccurate reporting

  • Potential trading restrictions


This isn't theoretical. Firms have been fined for timestamp accuracy violations. The infrastructure investment in precision time synchronization is mandatory, not optional.

If you're not actively managing time synchronization, you're building your observability, compliance, and troubleshooting capabilities on quicksand.


Why NTP Isn't Enough

NTP uses network round-trip time to estimate the time offset between client and server. The protocol assumes network delay is symmetric - the time from client to server equals the time from server to client (RFC 5905).


But in real networks, this assumption breaks constantly:


  • Asymmetric routing is common (forward path ≠ return path)

  • Variable queuing delays at every hop

  • Different link speeds on different paths

  • Bufferbloat and queue management algorithms


NTP's accuracy in production:


  • Best-case over the internet: ±1-5ms

  • Best-case on a local LAN: ±0.1-1ms

  • Typical production: ±1-50ms (sync intervals, jitter, asymmetric paths)

  • Under system load: worse - software timestamping in the kernel is subject to scheduling delays and CPU contention


The fundamental issue: NTP delivers millisecond-scale accuracy for systems that need microsecond-scale precision.


What Accuracy Do You Actually Need?

This is the critical question. The answer depends on what you're trying to measure or correlate.


For packet-level troubleshooting and network analysis:


  • Minimum: better than your network RTT

  • If RTT is 0.4ms, you need sub-400μs accuracy just to determine packet ordering

  • Target: 10x better than RTT (±40μs) for confident analysis

  • NTP at ±10ms is 250x worse than required


For distributed systems observability:


  • Log correlation: ±1ms minimum baseline

  • APM/distributed tracing: ±100μs for meaningful waterfall charts

  • Database replication monitoring: ±100μs to separate actual lag from clock skew

  • Trading/financial systems: ±1μs or better for regulatory compliance


The reality: If you're doing performance engineering, capacity planning, or root cause analysis with NTP-synchronized clocks, your data has a ±1-50ms margin of error built into every timestamp. You're measuring microsecond-scale phenomena with millisecond-scale clocks.


You're building conclusions on garbage time data.


Check your current time sync. Run this on your Linux systems:


  • chronyc tracking (if using chrony)

  • ntpq -p (if using ntpd)

  • timedatectl timesync-status (for systemd-timesyncd)


Look at the "Offset" or "System time offset" value. That's your current accuracy. If it's measured in milliseconds, you have a problem.


What's Next

So what's the solution? This is where Precision Time Protocol (PTP) enters the picture.


But simply switching from NTP to PTP isn't enough. There are critical decisions around:


  • Hardware timestamping vs software timestamping (and why it matters)

  • Network architecture options: boundary clocks vs transparent clocks vs end-to-end

  • PTP profiles and domains for different use cases

  • Client implementation, tuning, and validation

  • Cost and complexity tradeoffs at different accuracy requirements

  • Selecting your UTC reference source and establishing traceability


In the next article, we'll dig into PTP architectures and explore when software PTP with kernel timestamping is "good enough" versus when you need to invest in hardware-accelerated solutions with smart NICs and dedicated grandmaster clocks.


We'll also cover the monitoring and validation you need to ensure your time sync is actually working - because a misconfigured PTP deployment can be worse than NTP.


Until then: go check your clock offsets. You might be surprised what you find.


References

[1] Mills, D., et al. "Network Time Protocol Version 4: Protocol and Algorithms Specification." RFC 5905, Internet Engineering Task Force, June 2010. https://datatracker.ietf.org/doc/html/rfc5905


[2] European Securities and Markets Authority (ESMA). "Commission Delegated Regulation (EU) 2017/574 - RTS 25 on clock synchronisation." Official Journal of the European Union, June 2016. http://ec.europa.eu/finance/securities/docs/isd/mifid/rts/160607-rts-25_en.pdf


[3] U.S. Securities and Exchange Commission. "Rule 613 - Consolidated Audit Trail." https://www.sec.gov/about/divisions-offices/division-trading-markets/rule-613-consolidated-audit-trail


[4] U.S. Securities and Exchange Commission. "Order Granting Conditional Exemptive Relief, Pursuant to Section 36 of the Securities Exchange Act of 1934 and Rule 608(e) of Regulation NMS, Relating to Granularity of Timestamps Specified in Section 6.8(b) and Appendix D, Section 3 of the National Market System Plan Governing the Consolidated Audit Trail." Securities Exchange Act Release No. 88608, 85 FR 20743, April 14, 2020. https://www.federalregister.gov/documents/2020/04/14/2020-07789/order-granting-conditional-exemptive-relief-pursuant-to-section-36-of-the-securities-exchange-act-of


[5] FINRA. "Regulatory Notice 20-41: FINRA Amends Its Equity Trade Reporting Rules Relating to Timestamp Granularity." December 2020. https://www.finra.org/rules-guidance/notices/20-41


[6] Network Time Protocol Project. "Association Management." NTP.org Documentation. https://www.ntp.org/documentation/4.2.8-series/assoc/


[7] Network Time Protocol Project. "Poll Process." NTP.org Documentation. https://www.ntp.org/documentation/4.2.8-series/poll/


[8] Meinberg Radio Clocks. "Time Synchronization Accuracy with NTP." Knowledge Base. https://kb.meinbergglobal.com/kb/time_sync/time_synchronization_accuracy_with_ntp


[9] Meinberg Radio Clocks. "Time Synchronization Errors Caused by Network Asymmetries." Knowledge Base. https://kb.meinbergglobal.com/kb/time_sync/time_synchronization_errors_caused_by_network_asymmetries


[10] Lombardi, M.A., et al. "Practical Limitations of NTP Time Transfer." National Institute of Standards and Technology, 2016. https://tf.nist.gov/general/pdf/2776.pdf


[11] FSMLabs. "MiFID II - Ten Things You Need to Know About Clock Sync." April 2021. https://fsmlabs.com/mifid-ii-ten-things-you-need-to-know-about-clock-sync/

 
 
 

Recent Posts

See All
Troubleshooting Slow TCP Transfers on Windows

After publishing my article on Linux TCP troubleshooting using ss and nstat, I got several requests for the Windows equivalent. The short answer: there isn't one. Windows exposes per-socket TCP statis

 
 
 
bottom of page