Stateful protocol analysis at wire-speed
Practical tests, conducted in 2016 on modern computer hardware, through the replay at full speed of pre-captured traffic dumps, have demonstrated that the trafMon probe can afford fully filled (1500 bytes) packets data rate at 2.3 Gbps. Thus, when mirroring the traffic over a Fast-Ethernet switch, the best is to mirror to a 10 Gbps interface, for avoiding the device dropping mirror packets in excess, and to dimension the probe kernel capture buffer in such a way as to cope with high rate burst of traffic.
Several optimisation techniques have been implemented in the probe to afford this high rate. The hardware technological enhancements must now mostly rely on multiplying CPUs and Cores to continue increasing performances of new computer models. The current design of the probe, split in two processes, could then be even further exploited by distributing the traffic flows over parallel chains of protocol dissection and their analysis.
In-kernel systematic dissection
Care has been taken to conduct the packet handling work as far as possible without copying its kernel-resident content. Contents signature of non-fragmented datagram units is performed this way. Irrelevant packets (e.g. uninteresting IP fragments, or packet not matching the configured set of flow class criteria …) are rejected as quickly as possible.
For those packet of interest, only relevant results of protocol dissection are transmitted, via a circular buffer in shared memory, to the probe child process running on a different CPU core.
Flow classification sieve single traversal
To determine, on the basis of protocol headers in-kernel dissection, whether this packet belongs to one or more flow classes of interest, an efficient strategy has been implemented in the probe father process.
Each flow class defines a filter expression in terms of conditions applied on the data fields of network/transport layers protocols (IPv4, ICMP, UDP or TCP). But any Boolean expression can be formulated in up to three levels of Boolean connectives (AND, NOT AND, OR, NOT OR). When compiling the trafMon runtime XML configuration file, the probe builds a sieve with the expressions taken from all flow classes, ordered by involved protocol fields, and organised in three layers.
A single sieve traversal permits to determine which (set of) flow class(es) the packet belongs to, if any. Candidate flow classes have all their remaining expressions disabled as soon as their connectives are negated. Hence the type of processing and relevance of the packet information can be determined as efficiently as possible. Thanks to the ordered comparisons applied to the protocol fields, one after the other, the value under scrutiny resides in the processor register, so that the successive tests limit CPU access to external RAM memory.
Preservation of only relevant information
When an inspected packet reveals (potentially) relevant for further analysis by the probe child process, the father pushes only the minimum set of dissected data chunks to the circular buffer shared with the child process.
The first chunk consists in the structure with all relevant IPv4 fields and, relevant or not, port numbers and checksum,
If TCP header fields are also needed, a second chunk is pushed and, if it is an FTP control packet [or HTTP when later implemented], a third chunk is pushed.
Otherwise, if the packet is a relevant DNS or NTP or SNMP or ICMP data unit, a corresponding second chunk is pushed.
If the packet is an IP fragment and the reassembled datagram needs content signature (for one-way flow distributed observation), only then an additional chunk with the packet content itself is also pushed towards the child process.
True parallelism between header dissection and stateful analysis
While the probe father process loops over the packet-by-packet dissection, filtering and information pushing in the shared memory circular buffer, the probe child process is fetching this buffer, updating its stateful protocol analysis accordingly, publishing its data records and forwarding its PDU to the collector, in true parallelism. Indeed, the father locks itself on CPU core #0, while the child affinity is set to all other CPU cores.
Issue for subsequent fragments and unqualified FTP data connections
The excess (irrelevant) information copied to the child process is limited to those unclassified subsequent IP fragments for which the first one (with necessary identifying protocol headers) hasn’t been processed yet, and to those TCP segments between identified peers that could well belong to FTP data transfer connections whose FTP control connection has been identified in the dissecting father process, but whose session dialogs (hence associated data connection actual port numbers) are dynamically analysed only in the child process.
To handle this at best, a cache of irrelevant fragments identifiers is kept in the father: a record with pair of IPv4 addresses and fragment ID is created for those datagrams whose second and subsequent fragments (those not yet captured) are irrelevant and can be skipped.
And for FTP candidate TCP data connections, another cache is created with the pair of IP addresses of FTP sessions, each record being maintained until the FTP session terminates (FIN or RST detected over the FTP control connection). Then, all TCP packets matching any such registered pair of hosts are preserved and passed to the child process.
TCP connections start-stop sampling vs full analysis (retransmissions/window evolution)
The above depicted preservation of TCP packets from candidate FTP data connections can be significantly reduced (heavily alleviating the probe processing load), where only the start-stop TCP sampling is requested for specified FTP sessions. Then only the SYN (start) and FIN or RST (stop) packets of those TCP data connections must be further processed. The consequence is a lack of monitoring of TCP retransmissions and window evolution, as well as an uncertainty on the exact transported payload size (file size). Indeed, comparing the start and end TCP sequence numbers gives a size modulo 4 GBytes.
Distributed measurement of every packet under stringent traffic performance constraints
While, as explained above, the trafMon tool is able to monitor hectic high volumes of multiprotocol traffic flows, it can also operate in quite stringent environments, encompassing highly sensitive real-time data flows under safety-of-life grade of performance requirements.
Here, the challenge is the accuracy of the latency measurement of every packet in each direction, injecting the distributed observations in a controlled way so as to avoid impacting the monitored traffic.
For monitoring one-way latency, packet losses [and jitter – still to do] of uni-directional data flows, observations must be collected from distributed points over the data flows network travel paths (at least source and destination sites).
This means that the disseminated probe computers and the central processing system running the trafMon collector must be precisely synchronised at the millisecond or better. Using NTP over the WAN trunks could be a solution, due to the imposed high efficiency of the transmission stability. But, when trafMon will measure a performance glitch, what would be due to NTP slippage and what is the real impact on the monitored flow?
Then the best way is to use of the, now common off-the-shelf, GNSS (Global Navigation Satellite System) based local NTP servers: such as GPS NTP server. Locating such a level-1 NTP server (GPS clock considered at level 0) on the same LAN (short) segment as any trafMon probe or collector system will give the necessary accuracy, independently of the network link quality.
Guaranteed observations centralisation through fully controlled protocol
Since the accuracy of the trafMon one-way measurements relies on timestamps taken at multiple locations in a distributed infrastructure, the central consolidation of these measurements is critical. The central trafMon collector has to be sure that it received all observations sent by all probes, before drawing conclusions about packets that would have been lost somewhere along the network path.
This is obtained using mechanisms of probe observations PDU reception acknowledgment, PDU re-sending capability, heartbeat exchange and the like.
In most cases, the observations are sent in-band with the monitored traffic, unless another network path is available. This implies that during a network disruption, no observations would be received by the central system, and that the observations done during the network disruption would be lost forever if the probes drop their observations too quickly.
The number of retries, timeout and its increasing constants can be custom specified. When one PDU, of any type, exhausts its number of retries, a so-called long-retry mode is entered. Then the failed PDU is continuously retried at its highest frequency (initial timeout period). Also, those definitional PDU types (flow descriptions, histogram slice definitions) or low volume key information (individual file transfer records) are continuously retried at low pace (highest timeout period), as well as those observations PDU whose content relates to a break-border time window just before the detected break of communication (supposedly containing meaningful troubleshooting observations leading to the network disruption). At first acknowledge coming from the collector, the long-retry mode ends, and normal but regulated send and retries of probe PDU is re-established.
All the observations done by the distributed trafMon probes are centralised by the trafMon collector. The traffic generated by the probes sending one-way partial observations to the central processing system has been optimized in order to produce the smallest possible network footprint for the monitoring activity. Only the hash, typically 2-5 bytes long, and the size of the observed packet is sent with its timestamp(s) relative to 1/4 second current reference. Many observations are ‘packed’ into compound datagrams, ordered by flow ID, that each may contain up to several hundreds of observations. Compressed encoding reduces each observation to no more than a few bytes, although accuracy down to the millisecond is maintained. The result is that the volume of one-way PDU traffic is less than 1% of that of the monitored traffic.
Being one-way observations or any other measurements, the probe PDU protocol must be kept under control so as to minimise its potential impact on monitored dataflow. Hence a maximum size is imposed to the probe PDU (small packets induce less delay when serialising at WAN line entrance, hence less impacting higher priority successors). Also, a minimum delay between consecutive probe (first or retried) PDU sending is imposed thanks to the assignment of queued PDU units to successive free slots of the time cut in slices of equal periods.
Reconciling semi-simultaneous packet observations, coping with signature collision
With only 2 bytes of packet signature hash, occurrences of collision for a same flow and at approximately the same time is more than probable. But even in such case, a correct match can be done by relying on the “near simultaneous” occurrence of corresponding observations from the separate probes.
The trafMon collector stores its partial one-way observations in a custom-implemented balanced BTree data structure, where a non-exact search always provides a pointer to the left of right sibling of the looked-up key. Furthermore, all elements of the BTree (the leaves) are organised in an ordered doubly linked list. Not only does this permit to travel over the incomplete but obsolete records identifying the packet losses. But it permits to embed an approximate timestamp as part of the search key (in addition to the flow ID and the – possibly colliding – signature hash). Having retrieved a “near sibling”, the collector can look at its younger or older neighbours to decide on the best match record to be complemented with the newly received partial probe observation. This way, even with twice the same signature for two different packets, the merging of partial observations from the distributed probe is achieved at best.
Massive amount of fine-grain observations
The trafMon collector continuously outputs quite significant amount or raw observations logs. These are regularly (every 10 minutes) bulk loaded into temporary working tables used to update the persistent listing tables or to update the counters aggregate tables at 1 minute, 1 hour and 1 day.
The central processing system must have solid-state high capacity disk drives and large amount of DDR4 high-speed RAM. Furthermore, the database persistent tables are physically split in separate physical partitions (1 day, 8 days or 31 days log, depending on granularity). This allows to simply drop fine grain ancient data to maintain the disk capacity.
Nevertheless, the database processing could become the bottleneck and strategies supporting the big data discipline would be welcome.