Mobirise

The 1st NTU-Imperial Workshop on Future Cloud Systems

The workshop brings researchers from NTU Singapore, Imperial College London, academic and industry organizations to exchange ideas and spearhead advancements in cloud computing systems.

Hybrid mode: in person and on Zoom
Registration deadline: June 5th (Wednesday)
Workshop date:
June 10th


Venues
NTU: LHN-TR+12 (The Arc, level B2)
Imperial: Huxley 217


Topics

  • Serverless cloud architectures and virtualization
  • Heterogeneous architectures and accelerators
  • Networking infrastructure and disaggregated systems
  • System support for machine learning and large language models
  • Edge and real-time systems

Schedule

The emerging Compute Express Link (CXL) technology is quickly garnering widespread adoption in the industry. CXL is well-positioned to revolutionize the way server systems are built and deployed, as it enables new capabilities in memory system design. CXL-centric or CXL-augmented memory systems bear characteristics that cater well to the growing memory capacity and bandwidth demands of modern workloads. In this talk, we will focus on two specific use cases we have identified for CXL-centric memory systems in the context of (i) bandwidth-intensive server workloads and (ii) memory pooling in large-scale NUMA systems. We will additionally briefly discuss ongoing research directions in the same space.

Today’s data centers consist of thousands of network-connected hosts, each with CPUs and accelerators such as GPUs and FPGAs. These hosts also contain network interface cards (NICs), operating at speeds of 100Gb/s or higher, that are used to communicate with each other. In this talk, we will share RecoNIC, an FPGA-based RDMA-enabled SmartNIC platform that is designed for compute acceleration while minimizing the overhead associated with data copies in traditional CPU-centric accelerator systems by bringing network data as close to computation as possible. Since RDMA is the de-facto transport-layer protocol for improved communication in data center workloads, RecoNIC includes an RDMA offload engine for high throughput and low latency data transfers. Developers have the flexibility to design their accelerators using RTL, HLS or Vitis Networking P4 within the RecoNIC’s programmable compute blocks. These compute blocks can access host memory as well as memory in remote peers through the RDMA offload engine. Furthermore, the RDMA offload engine is shared by both the host and compute blocks, which makes RecoNIC a very flexible platform. We have open-sourced RecoNIC for the research community to enable experimentation with RDMA-based applications and use-cases.

Datacenter congestion management protocols must navigate the throughput-latency buffering trade-off in the presence of growing constraints due to switching hardware trends, oversubscribed topologies, and varying network configurability and features. In this context, receiver-driven protocols, which schedule packet transmissions instead of reacting to congestion, have shown great promise and work exceptionally well when the bottleneck lies at the ToR-to-receiver link. However, independent receiver schedules may collide if a shared link is the bottleneck instead.
In this talk, I will present SIRD, a receiver-driven congestion control protocol designed around the simple insight that single-owner links should be scheduled while shared links should be managed through traditional congestion control algorithms. The approach achieves the best of both worlds by allowing precise control of the most common bottleneck and robust bandwidth sharing for shared bottlenecks. SIRD is implemented by end hosts and does not depend on Ethernet priorities or extensive network configuration.

Mobile Edge computing system (MEC) has emerged as a promising paradigm for enabling IoT devices to handle computation-intensive applications. Due to the limited communication resources of the wireless network and computation resources of servers in MEC, efficient job mapping and resource management strategies in MEC become crucial, especially for time-critical applications. In this presentation, we will introduce a resource allocation problem and a resource scheduling problem for deadline-constrained jobs in MEC with both communication and computation contentions. Because these problems are NP-Hard, we focus on designing approximation algorithms. Especially, we will answer the following two questions: (1) how are the resource allocation and scheduling problems in MEC different from traditional resource allocation (i.e., knapsack problem and its variants) and scheduling (i.e., flow shop problems) problems? (2) Can we approximate these problems in MEC and how good are the resulting approximations?

Reinforcement learning (RL) is a key technology for solving hard decision-making problems, surpassing human game play, and enabling conversational AI bots such as ChatGPT. Supporting RL workloads poses new challenges: RL jobs exhibit complex execution and communication patterns, and rely on large amounts of generated training data. Current RL systems fail to support RL algorithms efficiently on GPU clusters: they either hard-code algorithm-specific strategies for parallelisation and distribution; or they accelerate only parts of the computation on GPUs (e.g. DNN policy updates).

In this talk, I will argue that current RL systems lack an abstraction that decouples the definition of an RL algorithm from its strategy for distributed execution. I will describe our work on MSRL, a distributed RL training system that uses the new abstraction of a fragmented dataflow graph (FDG) to execute RL algorithms in a flexible way. An FDG maps functions from the RL training loop to independent parallel dataflow fragments. Fragment can execute on different devices through a low-level dataflow implementation, e.g., an operator graph of a DNN engine, a CUDA GPU kernel, or a multi-threaded CPU process. Our experiments show that MSRL exposes trade-offs between different execution strategies, while surpassing the performance of existing RL systems with fixed strategies.

Serverless computing has revolutionized cloud architecture by offloading resource management to cloud providers, thereby speeding up the deployment of commercial services and broadening their adoption. As the demand for Generative AI (GAI) applications grows, the existing CPU-centric cloud architectures fall short in meeting the needs of large-scale GAI applications. In particular, existing cloud systems lack support for the elastic management of heterogeneous clusters with CPU/GPU/xPU hardware and high-speed communication fabrics essential for large-scale GAI deployments. In this talk, I will discuss the state of serverless cloud systems and introduce the serverless cloud systems research ecosystem we have been building with colleagues from the University of Edinburgh and ETH Zurich. Then, I will discuss our latest works on providing elasticity in serverless systems for modern and emerging cloud applications, including large-scale LLM inference, and highlight the future research directions we are exploring at the HyScale lab at NTU Singapore.

With the rapid development of deep learning technology, it is highly imperative for IT companies and research institutes to set up large-scale GPU datacenters for training and deploying deep learning-based applications. How to efficiently schedule these workloads and manage the valuable resources becomes an important yet challenging problem. In this talk, I will introduce some novel scheduling systems from three perspectives. (1) For users, we design a novel scheduling framework to satisfy different requirements and demands for model training. (2) For datacenter operators, we leverage deep learning interpretability to achieve transparent and convenient scheduling.

Disaggregated heterogeneous data centers promise higher efficiency, lower total cost of ownership, and more flexibility for data center operators. However, current software stacks can levy a high tax on application performance. Applications and OSes are designed for systems where local PCIe-connected devices are centrally managed by CPUs, but this centralization introduces unnecessary messages through the shared data center network in a disaggregated system.

In this talk I will present some of our work in FractOS, a distributed OS designed to minimize the overheads of disaggregation in heterogeneous data centers. FractOS elevates devices to first-class citizens in the system, enabling direct peer-to-peer data transfers and task invocations among them, without centralized application and OS control. By accelerating a heterogeneous application with FractOS, we can see 47% better performance, while reducing network traffic by 3x.

Organizers

Dmitrii Ustiugov (NTU), Marios Kogias (Imperial), Arvind Easwaran (NTU)

© Copyright 2024 NTU-Imperial workshop organizers - All Rights Reserved

AI Website Maker