Internship on tight integration of smart networking and compute offload on FPGA
Internships at the Xilinx Research Labs in Dublin, Ireland
Xilinx Research Labs is a small, diverse and dynamic part of Xilinx. Through customization and tailored solutions, we investigate how programmable logic and FPGAs can help make data centers faster, cheaper and greener by accelerating common applications and reducing the energy consumption of a given workload. Our team conducts cutting-edge research in topics such as machine learning, HPC and video processing to push the performance envelope of what’s possible with today’s devices and to help shape the next big thing in computing. In particular, the team in Dublin is focused on deep neural networks, including training paradigms and techniques, hardware friendly novel topologies, quantization techniques, and custom hardware
architectures that help support the enormous computational workloads associated with the roll-out of AI, even for energy-constrained compute environments. Fulfilling this goal requires top talent and thus we are looking to enrich our team with the finest engineers with bold, collaborative and creative personalities from top universities worldwide.
ACCL is an implementation of MPI-like communication utilizing direct FPGA-to-FPGA Ethernet. It supports send and receive operations, as well as collective operations, which include:
• Broadcast: sending one data structure to all other nodes
• Scatter: divide up a data structure and each piece to one other node
• (All)Gather: reverse of scatter
• (All)Reduce: gather from other nodes and compute the elementwise sum of data received
FPGA offloading of MPI enables communication between FPGA-resident compute kernels without CPU intervention, reducing the latency of communication. On FPGA accelerator boards equipped with network interfaces (e.g. Alveo), ACCL utilizes these interfaces to eliminate the latency of moving data cross the PCIe bus to the host NIC. Furthermore, ACCL can be configured to support custom datatypes for arithmetic collectives (reduce/allreduce) if applications require.
The performance of MPI communication is key to the scalability of many HPC applications, including DNN training. An FPGA is utilized as a smart NIC to implement for example accelerated collectives with ACCL for a DNN training flow, but the FPGA can also execute compute kernels tightly integrated with
the networking fabric. This creates the possibility of re-partitioning the DNN training flow between a GPU and a FPGA, and research questions arise on the topics of identifying an optimal partitioning, the achievable speed-up through GPU-FPGA co-execution, and the required modifications to ACCL.
Existing research identifies the pre-processing pipeline and the DNN optimizer as good candidates for offloading to the FPGA, as their compute patterns are either too specialized (e.g JPEG decompression) or not intensive enough (elementwise operations in the optimizer) to execute efficiently on a GPU.
Description of Work
The internship is focused on:
• identifying opportunities to execute parts of a DNN distributed training pipeline in the FPGA (in addition to the collectives offload with ACCL), including but not limited to the preprocessing and optimizer
• identify and implement mechanisms for low-latency tight integration between compute kernels executing in FPGA and ACCL, allowing the kernels to request ACCL communication services without host intervention. This may involve modifications to ACCL and potentially the implementation of a HLS library for kernel-to-ACCL communication
• implement a demonstrator system with at least one part of the DNN traning pipeline
executing in the FPGA and requesting communication services from ACCL directly.
Internship Duration: 6-9 Months
• Deliver an interface library for HLS components to request services from ACCL (i.e. a HLS driver for ACCL)
• Quantify the value of FPGA-GPU co-execution for various parts of the DNN training pipeline
• Demonstrate a working system with FPGA offload and ACCL-directed communication.
The outcome of the project is an enhanced ACCL and a set of FPGA compute kernels, which together can increase the efficiency of a DNN distributed training pipeline.
Skills and Tools
The work requires (and builds) experience with Python and C++/HLS, as well as working with DNN training frameworks (Pytorch), and FPGA application design frameworks (Vitis). The internship also requires some knowledge of FPGA design (Vivado IP integrator, high level synthesis). The project does
not require networking knowledge.