ClickNP stands for “Click Network Processor”. The programming model of ClickNP resembles the Click modular router, which is a flow graph of elements connected via channels.
Not yet, but we are seeking possibility to ship it.
We have plan to build an open network processing platform together with the networking community, e.g. NetFPGA, and release the ClickNP toolchain as part of platform.
The ClickNP compiler cannot be used in standalone form. It is a system with many components, including Shell and BSP on the FPGA, and the PCIe device driver and library on host CPU. Some of them are internal to Microsoft. So we need to collaborate with the community to create an open platform.
The challenges for packet processing are low latency and high throughput, where each task is typically small.
Other acceleration frameworks, for example CUDA and OpenCL, bears a batch processing model in mind. Tasks are processed in batch and communicated via shared memory. Batching would introduce non-trivial latency, and shared memory would be the throughput bottleneck for small tasks.
The advantage of FPGA comes from parallelism, not only data parallelism as in GPU, but also pipeline parallelism. To leverage pipeline parallelism, a stream processing model is more suitable. ClickNP proposes a modular stream processing model and implements a lot of optimizations for it.
OpenCL programming model is derived from GPU programming and shoehorned into FPGA. It is NOT really suitable for FPGA. Many features of OpenCL is useless or inefficient to FPGA. Additionally, OpenCL does not support efficient communication between CPU and FPGA.
We have extended OpenCL in several important ways to improve its capability for concurrent processing and efficiently exploit massive parallelisms in FPGA. For example, we prefer stream processing to batch processing, and prefer message-based coordination to shared-memory.
ClickNP uses OpenCL SDK just as a HLS tool and it can work with other HLS tools as well like Vivado HLS.
The performance of Click2NetFPGA is two orders of magnitude lower than what we report in the ClickNP paper. There are several bottlenecks in their system design, e.g. memory and packet I/O, and they also miss several important optimizations to ensure fully pipelined processing.
Additionally, Click2NetFPGA does not support FPGA/CPU joint processing and thus unable to update configuration or read states while data plane is running.
There are already many domain-specific languages for packet processing on FPGA. ClickNP is complementary to this line of work. These works mostly focus on packet switching and routing, while ClickNP targets general network functions. Furthermore, ClickNP provides efficient FPGA/CPU joint processing.
Functional languages with parallel patterns are a better fit for hardware generation because they provide high-level abstractions to programmers with little experience in hardware design and avoid many of the problems faced when generating hardware from imperative languages.
Yes, this is a rich body of literature on improving the programmability of FPGA by providing high-level programming abstractions. In the ClickNP paper, I cited several works along this direction. In addition, there is a new language called DHDL on ISCA’16 and ASPLOS’16.
However, none of these languages have mature open source implementations. ClickNP is designed to be a practical platform for packet processing, so we need to rely on a mature high-level synthesis tool.
The limitations of HLS does not complicate programming in ClickNP very much. Because the computational complexity of each packet is typically low, and packet length has an upper bound, so we can simply unroll all the loops in the element, eliminate memory dependency and achieve fully pipelined packet processing.
ClickNP is for general network functions. In ClickNP we can implement arbitrary packet parser, states and actions. On the other side, P4 is more constrained, which assumes a pipeline of match-action tables and is more suitable for programmable switching chip. It would be interesting to implement P4 on top of ClickNP.
This overhead is large for small elements. However, for complicated elements, overall the overhead is less than two times.
Furthermore, the capacity of FPGA is evolving quickly in the recent years. For example, the Altera Arria 10 FPGA has 2.5x capacity than the FPGA we currently use. We believe the area cost will be of less concern in the future.
See Table 2, our optimized FPGA implementation has 21 ~ 696x speedup over a software implementation on one CPU core. This performance gain comes from the ability to utilize the vast parallelism in FPGA.
Considering the power footprint of FPGA (
30W) and CPU (5W per core), ClickNP elements are 4 ~ 120x power efficient than CPU.
FPGAs are becoming inexpensive in the last few years, because the volume of FPGA market is increasing rapidly due to data center acceleration.
Processing speed of a single element pipeline is constrained to the clock frequency. However, we can duplicate the pipelines and load balance between the pipelines.
Our FPGA only has two 40G Ethernet ports, which is the throughput bottleneck. If we scale number of Ethernet ports, I think the major constraint is the area cost of Ethernet MAC and PHY, where each 40G port costs about 10% logic resource on our FPGA.
Actually, there are switches built with FPGA, e.g. Corsa DP 6440, which supports an aggregate capacity of 640 Gbps. However, it is still one magnitude slower than programmable switches.
ClickNP targets virtualized network functions instead of switches. Forwarding, routing and scheduling in switches are typically simpler and less flexible than network functions, so they are better implemented in ASIC.
We have not built an IDS / IPS, but in theory it should not be a problem. IDS rules are typically compiled into NFA or DFA, which can be implemented efficiently on FPGA.
The advantage of FPGA comes from parallelism. Not all tasks are suitable for FPGA. Algorithms that are naturally sequential and processing that has very large memory footprint with low memory locality, should process better in CPU.
Additionally, FPGA has a strict area constraint. That means you cannot fit an arbitrarily large logic into a chip.
Therefore, we should support fine-grained processing separation between CPU and FPGA, which is one of our contributions.
FPGA is more suitable for regular and frequent tasks. CPU is more suitable for irregular and infrequent tasks. In terms of SDN, FPGA is more suitable for data-plane processing and CPU is more suitable for control plane.
For the “use registers” optimization, the programmer can use “register” keyword to explicitly instruct the compiler to place an array in registers instead of memory. By default, ClickNP uses a heuristic to determine whether or not an array is placed in registers or memory.
For delayed write, ClickNP automatically generates the delayed write code whenever possible.
For memory scattering, the optimization is also automatic.
For loop unrolling, ClickNP provides “.unroll” directive to unroll a loop. ClickNP can also unroll loops whose number of iterations is unknown at compilation time but under an upper bound specified by the programmer.
Currently, the programmer needs to move the slow branch manually into another element manually.
Most memory dependencies in network functions are single read-write, which can be resolved with delayed write technique.
When there are multiple dependent memory accesses to one array in one iteration, I recommend you to create a state machine and break the code into multiple stages, so that there are at most two dependent memory accesses to each array in one iteration.
Current HLS tools only support loop unroll for a known number of iterations. ClickNP extends this capability to unroll a loop whose number of iterations is unknown at compilation time but under an upper bound. The programmer can specify the upper bound and use “continue” and “break” as in C.
HLS tools generate an optimization report which shows the dependency among the operations in each element.
The global memory API in OpenCL is designed to transfer huge amount of data, for example multiple GB. This is suitable for applications with large data set, but not for network functions that require strong stream processing capability. Since OpenCL is not optimized for stream processing, the communication latency between host CPU and FPGA is as high as 1 ms.
In order to transfer one message in PCIe I/O channel, needs several PCIe round-trips to ring the doorbell, transfer the data via DMA and write the completion status register. There can be only one in-flight message per each slot. When batch size is small, the throughput is bottlenecked by PCIe latency.
To amortize this overhead, we aggressively use batching. Data in the send buffer is automatically batched. At 16 KB batch size, we can already achieve the peak throughput, and the round-trip latency is 9 microseconds. This latency can be further reduced by using multiple slots and multiple host elements.
The Verilog project is synthesized under a high clock frequency (300 MHz), and run the timing analysis tool to find the delay on the critical path in the application logic. Then we calculate the maximum clock frequency and update the clock frequency in PLL HEX file. There are dual-clock FIFOs between the Shell part with fixed clock frequency and the application logic with clock frequency determined after synthesis and fitting.
HLS tools, including Altera OpenCL and Vivado HLS, support simulation on CPU. ClickNP also supports simulation of PCIe channel and host elements. You can use printf and familiar debug tools to debug the simulation program. HLS tools claim to guarantee equivalence of high-level program semantics and the on-board behavior. In our experience, we have not seen violation of the semantics of C language.
Surely, not all bugs can be found in simulation, because simulation cannot enumerate all possible inputs and all async behaviors of element execution. A programmer must take care in designing elements and their connections to avoid deadlocks.
HLS tools generate a resource utilization report in an early compilation stage, which contains conservative estimation of logic, memory and DSP blocks. When this estimation is below 90% of total resource, the fitting will usually be successful.
Typically one to two hours, depending on the logic resource used. The compilation time may be several hours when logic utilization is high. This is one limitation of FPGA programming. In the long term, HDL synthesis tools should be optimized to greatly shorten their compilation time.
HLS tools generate an optimization report including the initiation interval of each element, which is basically how many cycles are needed between subsequent iterations. When there is no memory dependency and the element is fully pipelined, the element can process one iteration every clock cycle.
The latency, number of clock cycles, can be extracted from the generated Verilog pipeline.
HLS tools can give an estimation of maximum frequency of an element. However, for the accurate clock frequency, we need to embed the element into a ClickNP project and get the frequency from timing analysis report after full compilation.
Because these elements are computationally intensive. A CPU core has limited parallelism, while FPGA can perform these computations in parallel. Furthermore, FPGA is more efficient in bit operations, which is the case in AES encryption and hash computation in HashTCAM.
Another reason is that we did not optimize the elements especially for CPU. The CPU numbers will look better if we use some architectural optimizations. But the order of magnitude is already there.
Some elements use larger number of resources because these elements have higher level of parallelism, for example HashTCAM has 16 independent hash tables that operate in parallel.
The BRAM usage grows linearly with the number of entries in a flow table or cache. This number is configurable.
We did not use FPGA to generate TCP flows in the pFabric evaluation. Instead we modify a software TCP flow generator and place the flow priority, which is the total size of the flow, in packet payload.
In the pFabric experiment, there is only one ingress port and one egress port, where does the queue come from?
We restrict the egress port to be 10 Gbps using a RateLimit element. The ingress port is 40 Gbps. So there will be queue on the egress port.