Low-Latency Trading Infrastructure 101:

Using an FPGA to Shatter the PCIe bus Latency Floor

Move logic to the network side of the PCIe bus

Given that Peripheral Component Interconnect bus (PCIe bus) latency imposes a hard limit on how low software solution latency can go, is there some way to get around it?

The answer is to move logic to the Network Interface Card (NIC) so that double traversal of the PCIe bus is not required. This is where a Field Programmable Gate Array (FPGA), as part of a NIC, comes in.

FPGA - A simplified explanation

An FPGA can be thought of as a large number of transistors which can be configured and reconfigured to be wired up together to form a digital circuit. Programmable hardware is the idea here, designing what you want the hardware to be rather than simply what you want it to do.

For example, the design for a given Central Processing Unit (CPU), such as MIPS, can be loaded onto an FPGA which then functions as a MIPS CPU. An alternative CPU architecture, such as RISC-V, could then be loaded onto the same FPGA to replace it. The FPGA would then work as a RISC-V CPU. There is a sense in which hardware is now “soft” in that the FPGA can be configured as different hardware.

Using an FPGA as a programmable Network Interface Card

An FPGA, attached to a circuit board containing SFP+ cages used to connect 10G network cables, can be configured (FPGA-speak for programmed) with the digital logic required to make a NIC.

This is essentially how the Cisco Nexus K3P-S (formerly known as the ExaNIC X25) was made, using an AMD (formerly Xilinx) XCKU3P FPGA.

The FPGA can be configured to perform additional custom work; for example, constructing an order book once the market data packet is received. It can calculate a market data signal and even send an order. Sending this order avoids the need to cross the PCIe bus as it does not require the CPU. An order is primed onto the NIC with an instruction to send it when a particular market data signal threshold is crossed. This send condition is known as triggering.

FPGA versus CPU

With a slower maximum clock speed than a CPU, an FPGA performs consecutive operations more slowly.

However, an FPGA can perform operations in parallel to a much greater degree than a CPU. The slower clock speed of an FPGA will still win at multiplying matrices, delivering a lower latency performance when using a parallelised algorithm.

With FPGA-latency comes development cost trade offs

Using an FPGA to bypass the PCIe bus comes with increased development costs because it requires a very different programming paradigm to C.

Programming the FPGA is usually done in Verilog or VHDL languages and is significantly more laborious than coding for a CPU.

HLS, where C++ is used in elements of FPGA programming, can reduce development time.