Internship on applications of FPGA smart networking for DNN training acceleration
Internships at the Xilinx Research Labs in Dublin, Ireland
Xilinx Research Labs is a small, diverse and dynamic part of Xilinx. Through customization and tailored solutions, we investigate how programmable logic and FPGAs can help make data centers faster, cheaper and greener by accelerating common applications and reducing the energy consumption of a given workload. Our team conducts cutting-edge research in topics such as machine learning, HPC and video processing to push the performance envelope of what’s possible with today’s devices and to help shape the next big thing in computing. In particular, the team in Dublin is focused on deep neural networks, including training paradigms and techniques, hardware friendly novel topologies, quantization techniques, and custom hardware
architectures that help support the enormous computational workloads associated with the roll-out of AI, even for energy-constrained compute environments. Fulfilling this goal requires top talent and thus we are looking to enrich our team with the finest engineers with bold, collaborative and creative personalities from top universities worldwide.
ACCL is an implementation of MPI-like communication utilizing direct FPGA-to-FPGA Ethernet. It supports send and receive operations, as well as collective operations, which include:
• Broadcast: sending one data structure to all other nodes
• Scatter: divide up a data structure and each piece to one other node
• (All)Gather: reverse of scatter
• (All)Reduce: gather from other nodes and compute the elementwise sum of data received
FPGA offloading of MPI enables communication between FPGA-resident compute kernels without CPU intervention, reducing the latency of communication. On FPGA accelerator boards equipped with network interfaces (e.g. Alveo), ACCL utilizes these interfaces to eliminate the latency of moving data cross the PCIe bus to the host NIC. Furthermore, ACCL can be configured to support custom datatypes for arithmetic collectives (reduce/allreduce) if applications require.
The performance of MPI communication is key to the scalability of many HPC applications, including DNN training. Because of this, several software frameworks have emerged to provide support for distributed training including Pytorch DDP and Horovod. These frameworks abstract away the details
of collective communication and implement distribution in DNN training transparently and efficiently. As such, they provide an opportunity for new communication libraries such as ACCL to be seamlessly inserted into a DNN training flow. Some frameworks such as Horovod provide support for network compression, i.e. casting the to-be-transmitted data to a lower precision to reduce the latency of network communication.
Description of Work
The internship is focused on:
• identifying opportunities to integrate ACCL with a high-level DNN distributed training framework, potentially Horovod because of its existing support for compression but not excluding others.
• Implementing said integration and evaluating performance
Internship Duration: 6-9 Months
• Compare and contrast distributed training frameworks from the perspective of ability to integrate with ACCL
• Implement integration, making sure to expose key advantages of ACCL including support for custom datatypes and compression
• Identifying a suitable DNN training benchmark to help compare the ACCL training against other communication backends. The benchmark will include various methods of distribution including data-parallel (DP), model/pipeline-parallel (MP), Facebook fully sharded data parallel (FSDP)
• Evaluate and compare ACCL with other backends at scale, publish results
The outcome of the project is a fork (and evaluation thereof) of one popular distributed training framework to utilize ACCL as communication backend and therefore enable transparent distributed training of DNNs with FPGA cards acting as network interfaces.
Skills and Tools
The work requires (and builds) experience with Python and C++, as well as working with DNN training frameworks (Pytorch), and FPGA application design frameworks (Vitis). The internship does not require low-level FPGA design or networking skills, as ACCL has convenient Python and C++ bindings.