The Millisecond Watchdog: Monitoring Rules for Low-Latency Trading

In standard web architecture, a 500ms latency spike is an annoyance. In low-latency trading, it is a bankruptcy risk.

When you are competing in microseconds, averages are lies. If your average latency is 10µs (microseconds), but your 99th percentile is 5ms, your strategy is already dead. You just don’t know it yet because your dashboard is smoothing out the “micro-bursts” that actually kill you.

Most observability tools are built for web servers, not ticker plants. They poll too slowly. If you poll every 10 seconds, you miss the crash that happened at second 3 and recovered at second 4.

Below is the monitoring structure I mandate for any high-frequency system. We break it down into three domains: The Eyes (Market Data), The Hands (Execution), and The Brain (Post-Trade).

The Eyes: Market Data Feeds

If your system sees the price of Apple (AAPL) as $150.00, but the exchange sees it as $150.05, you are about to sell an asset for less than it is worth. This is “stale data,” and it is the silent killer of algorithms.

The Rules:

  • Feed Freshness (The “Speed of Light” Check)
    • The Problem: Your feed handler processes data slower than the exchange sends it.
    • The Check: Compare the timestamp stamped by the Exchange (packet generation time) vs. the timestamp when your server received it.
    • The Logic:Python
    • # Alert if we are lagging behind the exchange clock latency_skew = local_receipt_time - exchange_packet_timestamp if latency_skew > 50_microseconds: trigger_alert("WARNING: Ticker Plant Lagging") elif latency_skew > 200_microseconds: trigger_circuit_breaker("CRITICAL: Stale Prices - HALT TRADING")
  • Sequence Gap Detection
    • The Problem: UDP packets get dropped. A missed packet might contain the trade that cleared the book level you are trying to hit.
    • The Logic: IF (Current_Seq_Num != Last_Seq_Num + 1) -> TRIGGER_RECOVERY

The Hands: Order Execution

Once you decide to trade, how fast can you pull the trigger? This measures the “Tick-to-Trade” loop.

The Rules:

  • Tick-to-Trade Latency (Internal Processing)
    • The Problem: Your strategy logic is heavy, or a thread is getting context-switched by the OS, causing a “pause” in decision-making.
    • The Logic:Python# Measure time from data arrival to order egress processing_time = order_sent_timestamp - tick_arrival_timestamp # We don't care about averages. We care about outliers. if processing_time > (baseline_latency + 3 * standard_deviation): log_warn("Micro-burst detected in Strategy Engine B")
  • Order-to-Ack Latency (Network Health)
    • The Problem: The network path to the exchange is congested. You sent the order fast, but it’s stuck in a switch buffer.
    • The Logic: Measure the Round Trip Time (RTT) between sending NewOrderSingle and receiving ExecutionReport (Ack).

The Brain: Post-Trade & Reconciliation

This is the “sanity check” layer. It ensures that what your algorithm thinks it owns matches what the exchange says it owns.

The Rules:

  • The “Phantom Fill” Detector (Drop Copy Rec)
    • The Problem: Your algo thinks it bought 100 shares. The exchange says you bought 0. Or vice versa.
    • The Logic: Compare your internal state against the “Drop Copy” (a separate, read-only feed from the exchange that confirms all your trades).SQL-- Pseudo-query for real-time reconciliation SELECT * FROM internal_trades FULL OUTER JOIN drop_copy_trades ON internal_trades.order_id = drop_copy_trades.order_id WHERE internal_trades.id IS NULL OR drop_copy_trades.id IS NULL
    • Action: If this query returns ANY rows, fire a P0 Alert immediately. You have a position break.
  • The “Fat Finger” Reject Rate
    • The Problem: A bad deployment causes your algo to send invalid orders (e.g., selling stock you don’t have). The exchange rejects them.
    • The Logic: IF (Rejected_Orders / Total_Orders) > 5% within 10s -> KILL_SWITCH_ENABLE

The Hardware Heartbeat

In low latency, the hardware is the software. You cannot ignore the physical layer.

  • NIC Discards: IF rx_discards > 0 on your Solarflare/Mellanox cards, your CPU is too slow to handle the incoming packet rate. You are flying blind.
  • Jitter (Variance): IF (Max_Latency - Min_Latency) > 10µs. This usually means “noisy neighbors” on your server or improper CPU isolation.

Conclusion: The Kill Switch

The most important monitoring rule in trading is not a warning; it is an action.

Every metric above should feed into a unified Kill Switch. If the data is too stale, if the rejects are too high, or if the position break is real, the monitoring system must have the authority to pull the plug automatically.

In high-frequency trading, it is better to be offline than to be wrong.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *