AWS Partner Network (APN) Blog
Boost High Frequency Trading Performance on Amazon EC2 with Chronicle Tune
Chronicle Software |
By Roger Simmons, R&D Managing Director – Chronicle Software
By David Sung, Senior Solutions Architect, FSI – AWS
By Atiek Arian, Senior Manager – AWS
By Boris Litvin, Principal Solutions Architect – AWS
By Raj Pathak, Principal Solutions Architect – AWS
By Guy Bachar, Senior Solutions Architect – AWS
Cryptocurrency high frequency trading in the cloud
Cryptocurrency high-frequency trading firms run their primary workloads on AWS to lower latency to major cryptocurrency exchanges operating in the Cloud. Latency is a crucial consideration in the optimization of their trading algorithms and overall profitability. Latency profiles on the cloud differ when compared to traditional markets running dedicated, on-premises co-location data centers. In previous blogs, we have discussed service features and architectures designed to reduce networking latency for crypto trading topologies on centralized exchanges. In this blog, you will learn how customers improve tail latency ranging from 66% to 95% by using the Chronicle Tune product to optimize their Amazon EC2 instances. In addition, we show how EC2 Cluster Placement Groups reduces network round-trip tail latency by 30%-67%.
Understanding the impact of tail latency
Tail latency refers to high latencies experienced by a small percentage of transactions in a system. Trading systems observe an increase in tail latency during periods of high market volatility and heavy load. This can be due to increased trading activity or a need to process order book and market data at increased rates. Tail events occur when a relatively small percentage of events disproportionally affect profitability. A rise in tail latencies that coincide with these periods of elevated trading activity creates an inverted relationship between the overall profitability of trading strategies and tail events.
Chronicle Tune solutions
While network effects impact tail latency, the configuration and tuning of compute is equally important for low latency applications in Capital Markets. Chronicle Tune, an AWS Partner, aims to automate this process by configuring the OS for optimal performance and application CPU isolation, with a focus on reducing latency and jitter.
The testing covers both standalone and networking performance across EC2 instance pairs.
Benchmark setup and approach
The testing approach employed a set of relatively lightweight applications designed for low-latency application stacks. The test applications used various Chronicle libraries – Market Data Distributor (MDD), C++ Queue, and RingZero, along with a utility to monitor environmental jitter from factors like kernel and scheduler activity. These applications offered a transparent view of typical low-latency performance by intentionally limiting the number of components for clarity.
Stand-alone setup
Stand-alone tests were conducted on the following EC2 instance types in the Tokyo (ap-northeast-1) region:
Figure 1: EC2 instance types setup for stand-alone tests |
Stand-alone test results
Below are the stand-alone test results:
Jitter
This test ran 75 seconds and built up a profile of the jitter observed by a pinned, spinning process, reported as a count of jitter events at different impacts (pause durations). The testing showed that Chronicle Tune optimizations reduced jitter occurrences by 2 orders of magnitude on the c6in.metal instance and 3 orders of magnitude on the m5zn.metal instance. This confined the impact duration to 1 µs on both tested instances.
Figure 2: Jitter Stand-alone test results on m5zn and c6in instances |
Chronicle Market Data Distributor (MDD) end-to-end latency
This test measured the end-to-end latency of messages exchanged between two threads via a Chronicle MDD. The test sampled 2 million 16-byte market data messages written at a rate of 40K msgs/s. The MDD is backed by a Linux virtual memory backed filesystem (tmpfs) and is configured with 4 slots per ring with up to 12 keys. Chronicle Tune optimizations reduced the 99.999th percentile tail latency by 66% on the c6in.metal instance and 72% on the m5zn.metal instance.
Figure 3: Market Data Distributor (MDD) End-to-end latency tests results on m5zn and c6in instances |
Chronicle Queue end-to-end latency
This test sampled 10 million 256-byte messages at 100K msgs/s, written to a Chronicle C++ Queue with a 256MB block size. The test measured the end-to-end latency from a message added by one thread, being visible and read by another thread. The Queue was also backed by tmpfs. The results show that the optimizations applied by Chronicle Tune reduced the 99.999th percentile tail latency by 94% on the c6in.metal instance and 95% on the m5zn.metal instance.
Figure 4: Queue end-to-end latency results on m5zn and c6in instances |
Chronicle RingZero end-to-end latency
This test measured the end-to-end latency of 15 million 256-byte messages exchanged between two threads using a 1024-slot Chronicle RingZero backed by tmpfs at 1 million msgs/s. The results show that the optimizations applied by Chronicle Tune reduced the 99.999th percentile tail latency by 77% on the c6in.metal instance and 89% on the m5zn.metal instance.
Figure 5: RingZero end-to-end latency results on m5zn and c6in instances |
Network test setup and methodology
To evaluate network performance, identical c6in.metal instances were launched in the Tokyo Region, with one pair within an EC2 Cluster Placement Group (CPG) and another outside the CPG.
We used Chronicle Tune in our network tests to establish a reliable baseline and accurately identify performance differences arising from different network setups.
Network tests were focused on EC2 placement strategy differences combined with the influence of interrupt request (IRQ) and application pinning and/or placement.
In the following performance tests, we compared the network round-trip latency using batches of 256-byte messages at rates of 5K, 10K and 25K msgs/s. For each message rate, the test measured the median statistics across ten separate runs, each lasting ten seconds.
Network test results
Standard vs. CPG
As the message rate increases, we noticed improvement in the latency profile of CPG, particularly at 25K msgs/s. Here, CPG achieved an impressive 67% reduction in tail latency compared to the standard configuration.
Message Rate (msgs/s) |
Network Round Trip – Standard vs CPG |
5k |
|
10k |
|
25k |
Figure 6: Cluster Placement Groups Strategy test results at rates of 5K, 10K and 25K msgs/s |
Elastic Network Interface IRQs steering and application topology tests setup and methodology
Elastic Network Interface (ENI) is the core networking component in EC2 instances. Understanding how ENI interrupt configuration and placement impact the application topology is crucial for optimizing network performance.
To evaluate this factor, two c6in.metal instances were launched within a CPG. The ENI of each instance was attached to socket 0 (cores 0-31). Across various message rates, the relative performance of the following ENI and application topologies was compared:
- Baseline: ENI IRQs and application threads were pinned to socket 0. In the “Ideal” scenario, IRQs service on the same socket as the ENI attachment, allowing access to shared L3 CPU cache with no Intel QPI overhead.
- Crossed 1: ENI IRQs were pinned to socket 0 (favourable), and application threads to socket 1 (less favourable).
- Inverted: ENI IRQs and application threads were both pinned to socket 1, the less favourable socket.
- Crossed 2: ENI IRQs were pinned to socket 1 (less favourable), and application threads pinned to socket 0 (favourable).
These tests compared the relative impact of steering IRQs to a non-native socket (socket 1) and the overhead of exchanging traffic between sockets using Intel QPI or similar interconnects.
The ENI and application thread configuration for these tests was controlled using Chronicle Tune with the following illustration:
Figure 7: ENI IRQs steering and application topology |
ENI IRQs steering and application topology test results
The results for these ENI and application topology tests were similar across the range of message rates. The below graph illustrating the results at 10K msgs/s captures the key points:
Figure 8: ENI IRQs Steering and application topology test results |
These test results highlight the relatively large tail latency impact experienced when running with crossed IRQ and application configurations (IRQs on one socket; application threads on another). This shows the performance penalty when moving data between sockets across the processor interconnect. This configuration also introduces appreciable jitter at higher percentiles. Test results show percentage increase from baseline, which varies from %148-160% on the Crossed scenarios.
The results show that steering ENI IRQs to the less favourable socket (away from the ENI-attached socket) introduces marginal overhead and should not be overstated during tuning. The tail latency uplift between the “Baseline” and “Inverted” test configurations was consistent, but small. Ensuring both IRQs and application threads operate within the same socket (i.e. the same NUMA zone) is of better performance.
Conclusion
The Chronicle Tune product, available on AWS Marketplace provides a quick and automated way to tune Operating Systems deployed on Amazon EC2 instances for low-latency applications. Our testing highlighted latency improvements ranging from 66% to 95% when Chronicle Tune was applied. EC2 Cluster Placement Groups provided a reduction in latency between 30% and 67% in network round-trip tests. The impact of running application threads on opposite sockets compared to the IRQ handlers is relatively high, with tail latency increasing by 148%-160%. In contrast, the impact of pinning IRQs to the less favourable socket, on the EC2 instances under test, appears to be well contained.
.
.
Chronicle Software — AWS Partner Spotlight
Chronicle Software Delivering low-latency, high-performance Java and C++ software for the financial services industry.
Contact Chronicle Software | Partner Overview | AWS Marketplace