Anyone have the time to masticate this bolus of a thesis? Perhaps just Chapter 11...
Autonomous and Predictive Systems to Enhance the Performance of Resilient Networks
Chapter 11 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs 11.1 Introduction The proliferation of the Internet of Things (IoT) and the success of rich cloud services have pushed the horizon of a new computing paradigm, edge computing, that requires faster data processing at the network’s edge. The edge computing market is expected to reach close to $9 billion by 2025 [254]. As a specific example, the significant factors driving the growth of the IoT in the manufacturing market include growing demand for industrial automation in the manufacturing industry, rising need for centralized monitoring and predictive maintenance of resources, rise in the number of cost-effective and intelligent connected devices and sensors, among others. To keep up with this demand, there has been a shift to serverless frameworks for computation [255, 256]. Serverless frameworks allow IoT applications to be de195 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 ployed within minutes with the flexibility to reduce or expand operations seamlessly. For instance, serverless functions provide authentication and encryption capabilities on-site instead of uploading the data over a vulnerable network to the cloud. This requires efficient scale up (or down) of edge computing infrastructure for transient spikes in serverless workloads. However, managing edge computes capacity on the fly, i.e., transient compute elasticity, carries specific challenges [257, 258]. First, the expanding edge deployments are time and resource-intensive. A typical solution is to over-allocate resources for possible future demand. However, over-allocation leads to under-utilization for the most part and is economically undesirable. Edge computing requires careful planning of the available resources in-situ to achieve its primary objective of faster processing and reduced latencies. Second, and most importantly, sudden spikes in demand for processing could create compute bottlenecks, leading to service level agreement (SLA) violations. SLA comprises the agreed-upon QoS (Quality of Service) attributes monitored regularly; failing to meet the QoS attributes can attract hefty penalties. In this context, we ask the following research question: How could we design an architecture that can handle sudden spikes in demand, address transient elasticity, and allocate compute resources efficiently? We propose AKIDA, a new edge computing platform that leverages heterogeneouscomputing nodes (including domain-specific accelerators like SmartNICs) to dynamically allocate computation requirements for workload spikes with minimal cold start latency. We use SoC-based SmartNICs to predict and intelligently load-balance containerized serverless workloads across the heterogeneous-compute resources. AKIDAuses untapped general-purpose compute on SmartNICs for in-network application processing when demand escalation is imminent. SmartNICs are ideal candidates for application offload because: (i) they are closer to the data ingress pipeline 196 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 that enables them to bypass the network stack overhead at the host server, (ii) of the availability and proximity of SoC-based onboard compute for application processing [259, 260], (iii) they are a feasible alternative to the traditional servers for short term compute, and (iv) unused compute cycles on the SmartNICs can be re-purposed for workloads. This is the first study to propose containerized application offload to SoC-based SmartNICs to our knowledge. Although prior works have studied the applicability of offloading specific parts of applications, e.g., using P4 programmability, actor-programming paradigm, etc. [261, 262, 263], those studies are limited to particular applications and require code modification for other types of application offload. In contrast, AKIDAis designed to offload a network of containers onto the SmartNIC, making it truly application-agnostic and scalable. Our platform has three unique elements: (i) a workload predictor, (ii) a traffic distributor, and (iii) an orchestrator. The workload predictor estimates the potential change in demand for the next time horizon by extracting fine-grained input features from historical time-series data. The traffic distributor distributes the traffic based on the transient spikes and CPU load on each cluster node. Finally, the orchestrator sets the threshold levels for intelligent traffic distribution to cluster nodes and manages the end-to-pipeline for application processing. It also can reallocate workloads on the fly to the SmartNICs, if the incoming requests for an application suddenly change. AKIDA’s orchestrator can be generalized for scaling edge across multiple servers and different kinds of SmartNICs. Stated otherwise, our system can be scaled to offload applications across different dimensions of heterogeneity (for instance, if the cluster introduces additional compute nodes). This approach enables us to secure a competitive advantage compared to legacy edge architectures and deployments. This chapter makes the following key contributions: • Design of a novel architecture that leverages heterogeneous computing nodes 197 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 (SmartNICs and host server) to facilitate efficient handling of transient spikes at the edge; • Development and characterization of workload predictor and orchestrator that work in tandem to reduce SLA violations, efficiently handle spikes in demand, and reduce cold start latency; • Characterization of competitive advantages of our architecture through an indepth analysis of capital expense costs and overhead savings from minimizing SLA violations. Our investigation reveals that capital expenditure (CAPEX) can be reduced by 1.5⇥, while the operational expenditure (OPEX) can be decreased by 3.5⇥. In addition, our architecture demonstrably reduces SLA violation by as much as 20% in real-world deployments. 11.2 Background This section provides an overview of multicore SoC-based SmartNICs, and how they are integrated into the edge computing platform. In addition, we briefly discuss the edge computing architecture and explore some common SLA violations typically prevalent in this context. 11.2.1 SmartNICs There are broadly three categories of network accelerators or SmartNICS: ASIC, FPGA, and SoC-based SmartNICs [264, 262]. In this study, we focus on SoC-based SmartNICs only. Multicore SoC-based SmartNICs use embedded CPU cores to process packets, trading some performance to provide substantially better programmability than ASIC-based designs. (e.g., DPDK-style code can be directly run on a 198 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 familiar Linux environment). For instance, Mellanox Bluefield [259] uses generalpurpose CPU cores (ARM64), while others, like Netronome [265], have specific cores for network processing. SoC-based SmartNICs (e.g., Mellanox) have two modes of operation: Embedded, and Separated modes. The interfaces are mapped to the host OS network stack in embedded mode, and the kernel routes packets from the host. The host OS and the SmartNIC have separate, independent network stacks to process packets in the separated mode. While we observe slightly better tail-latencies from packet processing in embedded mode, the offset from separate mode is negligible. For AKIDA, we adopt the separated mode due to its programmable flexibility and the ability to run containers directly on the SmartNIC’s ARM64 OS. 11.2.2 Edge Computing The adaption of cloud computing platforms is increasing rapidly. However, efficient processing of the data that has been produced at the edge of the network is a challenging task. Data-driven applications are increasingly deployed at the edge and will consequently benefit from edge computing, which we explore here. Networking bottlenecks: Compared to the fast-developing cloud-based processing speed, the network bandwidth has reached a standstill. With the growing quantity of data generated at the edge, the rate of data transportation is becoming the bottleneck for the cloud-based computing paradigm. For instance, we expect autonomous vehicles to output a vast amount of data per hour that needs real-time processing. In this instance, edge computing is beneficial over cloud computing because of the significant savings in latency overheads. Additionally, scaling these pipelines for multiple vehicles would require 199 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 computation at the edge, not the cloud. Explosion of IoT: Almost all kinds of electrical devices will become part of IoT, and they will play the role of data producers and consumers, such as air quality sensors, LED bars, streetlights, and even an Internet-connected microwave oven. Reports suggest that the number of IoT devices at the edge will develop to more than billions in a few years [266]. Thus, the raw data produced will be enormous, making conventional cloud computing not efficient enough to handle all this data – application processing at the edge could account for this surge in demand. Data producers: In the cloud computing paradigm, the end devices at the edge typically are data consumers. For example, they are consuming on-demand video streams on a smartphone. However, vast amounts of data are now produced by the said-consumers. Changing from a data consumer to a data producer requires more placement of functionalities at the edge. 11.2.3 SLA Violations Service Level Agreements are critical when applications are deployed in a Service Oriented Architecture (SOA). SLAs are commonly adopted in cloud computing and, more recently, at the Edge. SLA defines the level of service the consumer expects based on metrics that the application provider lays out. SLA composes of the metrics by which the service is measured, such as monitoring the QoS (Quality of Service) attributes [267, 268], and the remedies or penalties if the metric measurement does not meet the agreed-on service level termed as SLA Violation. Some of the most common QoS attributes that are part of SLA are response time and throughput, we primarily focus on response time. In Edge Computing, where there are limited resources when the application re200 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 ceives multiple queries at scale, the response time suffers high tail latency. This problem is further strained when the host OS has an additional background workload for other applications or maintains the edge infrastructure for its Network and Storage needs. This leads to SLA violation and the consumer’s poor application Quality of Experience (QoE). We use the response time metric in Sec. 11.4 to evaluate the penalty with and without additional processing units such as SmartNICs. 11.2.4 Need for Accelerators There has been a lot of research recently in the industry regarding using SmartNICs in cloud data center servers to boost performance by offloading computation in servers by performing network datapath processing. This section explains why SmartNICs are essential in the new generation of high-performance computing servers. The cost of building an interconnection network for a large cluster can significantly affect the choice of design decisions. With increasing network interface bandwidths, the gap between the network performance and compute performance is widening. This has resulted in increased adoption and deployment of SmartNICs. If SmartNICs were leveraged to offload only network functionalities, it would add 30% more computational capacity to the current servers [269]. Typically, SoC-based SmartNICs are priced at 25-30% the cost of Data Center Servers. Therefore, adding a SmartNIC to perform only network functions is a wise decision. However, the SmartNICs can do more than network functions. As per our initial analysis, the compute capacity of an SoC-based SmartNIC is generally around 40-50% of server compute capacity. If additional compute is required within this range, exploiting the total capacity of SmartNICs to manage workload spikes instead of servers is a 201 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 more economical decision. However, all that compute available on SmartNICs is currently primarily used for offloading network functions and services. In most cases, that is a severe under-utilization of the available compute power on SmartNICs. It is this under-utilized compute that AKIDA aims to harvest and make available to the applications. 11.3 System Overview We begin by providing an overview of AKIDA, an intelligent fabric software framework that can be deployed on any container orchestration supporting Operating Systems such as Servers/Server Racks, Network Switches, or Edge-systems. Figure 11.1 shows the various components of AKIDA framework. The server can host any number of SmartNICs as the number of PCIe buses available. We use Kubernetes as the container orchestration system that runs on the host and SmartNIC OS, and this specialized architecture works only on SoC-based SmartNIC architecture [259]. The major components of our core solution consist of (i) a traffic distributor module that distributes the traffic based on the service time and CPU load of each server and SmartNIC, (ii) a workload prediction module that uses the history of the workload in a window to predict the workload spikes and (iii) the AKIDA orchestrator module manages the workload spikes based on the load on the servers and SmartNICs. In the following, we describe our solution to each module. 11.3.1 Traffic Distributor The current serverless computing design assumes that all computing resource 202 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 Serverless control plane Collect Workload history and predict future spikes Managing the workload spikes using SmartNICs Innovative seamless workload manager Heterogeneous data and compute plane Service Gateway Our Proposed System Solution 3. Spike Management Solution 2. Workload Prediction Solution 1. Configure traffic distribution Serverless functions Arm OS SmartNICs Serverless functions X86 Host Server Serverless requests Spike Detection and Threshold estimation Figure 11.1: System overview. Figure 11.2: Traffic distributor. nodes are homogeneous and have the same service time and the same amount of load. In this chapter, we show that this assumption leads to degraded performance of workloads running on multiple nodes, especially when one of the compute nodes get overloaded or takes more time to service the requests. To clarify the problem, consider two serverless functions A and B that take 2/10 seconds to run on the SmartNIC and 1/5 second to run on the host OS, respectively, but when the load on the host OS gets overloaded with other workloads, the response times on the host OS changes to 3/8 for functions A and B respectively 1. In this example, it is better to run the 1We note that these numbers are subject to change time to time depending on the workload burst 203 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 two functions on the host OS when the host OS is not overloaded, and when it gets overloaded, function A can be offloaded to the SmartNIC. 11.3.2 Traffic Distributor In our design, the queries first arrive at the API gateway of the scheduler within the SmartNIC OS, where our traffic distributor distributes the traffic according to the service time of each SmartNICs’ ARM core or host OS’s core within a server. We note that the service time of each function is subject to change depending on the workload spikes. Assuming the requests arrive with the arrival rate of l and assuming each host OS and SmartNIC have a service rate of µi and have an M/M/1 queue at each server, the optimal traffic distributor that makes the sojourn time equal for each queue is as follows: l1 µ1 = l2 µ2 = ... = ln µn (11.1) In other words, the optimal traffic distribution on N servers is as follows: li = µi + l ÂN j=1 µj N i = 1, ..., N (11.2) In the evaluation, we use a heuristic approach and try to avoid distributing the traffic on a cluster node with very high service time due to workload spikes. The queries are then redirected to the appropriate containerized application pods running either on the Host or SmartNIC OS. and resource congestion on the SmartNICs and host OS servers. 204 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 11.3.3 Workload Prediction To provision the workload spikes proactively to meet the required Service Level Agreement (SLAs), we predict the future workload demands ahead of time. We propose a support vector regression (SVR) prediction model that predicts the workload bursts to trigger the traffic distribution module and also mitigate the impact of containers’ cold start latency [270, 271, 272, 273] that can generally lead to a longer response time to application queries otherwise. Our prediction model is based on the past observations of the workload over a window size of W time units. We change the window size dynamically based on the workload variations over time. We increase the training window size if the workload variation over the current window is less than 10% and decreases once the workload variation is more than 20%. 11.3.4 AKIDA’s Orchestrator AKIDA, consists of a resource monitoring module and exploits the output of the prediction module. The resource monitoring module periodically monitors each node’s CPU, memory utilization, and service rates in the serverless platform. If the CPU utilization gets higher than a specified threshold D, or if the service rate of application X on one of the nodes in the cluster gets higher than the specified SLA, we re-distribute the workload to dampen the spikes. We use the output of the workload prediction module to predict future spikes ahead of time and perform proactive spike management. Pro-active spike management that exploits the prediction module has two benefits: (i) first, we can re-distribute the traffic based on the predicted future workload, which avoids specific server nodes from getting congested, and (ii) second, it mitigates the containers’ cold start latency by starting new containers before the actual load arrives. The spike management module updates the service rate, µi of each 205 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 node in the cluster and requests arrival rates in the traffic distributor module, and triggers a new traffic distribution command if the spikes are higher than a specified threshold or the mean service rate of a node in the cluster increases and violate the specified SLA metric. 11.3.5 Auto-scaler After splitting the traffic between multiple queues, we scale up/down the number of replicas at each queue. Our auto-scaling algorithm is based on the arrival rate of the predicted workload at time t, (i.e., lt), the current number of replicas rt, and the current service rate of the replicas at each server/SmartNIC (µt). We can draw the system utilization as follows: rt = lt rtµt (11.3) Then we calculate the probability that the queue is idle as follows: Po = 1/[ rt1 Â n=0 (rtrt)n n! + (rtrt)r t rt!(1 rt ] (11.4) The queue length is Lq = r rt t rrt+1 t rt!(1 rhot)2 P0 (11.5) and the expected waiting time on the queue is Tq = Lq/lt. Given the current number of replicas and the system’s service time, we calculate the system’s latency Tq + Ts + 2d (where 2d accounts for the auto-scaling startup latency) if the latency was larger than the target SLA, we increment the number of replicas and calculate the optimal number of replicas using a binary search algorithm. If Tq + Ts + d was smaller than the target SLA latency, we scale down and find the optimal number of replicas using a binary search algorithm. 206 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 DL360 Gen9 Server external internal Network Switch DL360 Gen9 Server Figure 11.3: Real world experimental setup. 11.4 Competitive Advantages We set up the testbed of AKIDA using DL380 Gen9 Servers and two Mellanox Bluefield [259] SmartNICs per server as shown in Figure 11.4. We deployed a Kubernetes cluster over both server and SmartNIC OS to obtain heterogeneous multicore cluster nodes. We implemented a prototype based on OpenFaaS serverless infrastructure. We evaluated it on three popular serverless workloads, (i) CPU-intensive Fibonacci function, (ii) latency-sensitive key-value store, and (iii) a sentiment analysis function that uses machine learning to perform natural language processing. We build the functions to run on a multi-architecture platform, including x86 host OS and the SmartNICs’ ARM core. We first run initial experiments to find the compute capacity of SmartNICs by running Fibonacci functions on SmartNICs and Host. We observe the compute capacity close to that of the host’s resources. Figure 11.4(a) shows the execution time of running the Fibonacci function on the host OS and the SmartNIC as we increase the Fibonacci number to compute. We observe that SmartNICs have comparable compute capacity as x86-64 Hosts, which assures that the SmartNICs are capable of 207 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 running workloads and processing incoming packet traffic. We also ran initial experiments on an online prediction model to predict future workloads ahead of time to narrow down the best-performing algorithm that works well with our solution. We used 10,000 data points from real serverless workloads that provide an appropriate workload for a ride-sharing application to request a ride [274, 275]. Figure 11.4 shows the workload prediction using the RBF and linear kernel in the SVR prediction model when we train the model over a window size of 100 seconds and predict the future workload d seconds ahead of time. As shown, the RBF kernel performs better than the linear kernel. In the following sub-sections, we investigate data centers’ different design choices to manage the load spikes. a. Response time of b. Predicting the workload the SmartNIC and host OS. d seconds ahead of time where d = 10. Figure 11.4: Experimental results on the real world testbed. 11.4.1 Performance Benefits To evaluate the performance benefit of using SmartNICs in the cluster when having a high CPU load, we perform a set of experiments on the three serverless functions in our testbed using OpenFaas serverless platform with the hey HTTP(S) load generator [276] and emulate transient spikes using a stress tool[277]. Figure 11.5 208 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 a. Fibonacci. b. Key-value store. c. Sentiment analysis. Figure 11.5: Response time distribution of different functions. shows the response time distribution for different functions. The SLA threshold is specified by the application and exposed to the scheduler. We first run the default OpenFaas scheduler on one server, we introduce stress on the host server and increase the average CPU utilization to 80% by running background serverless workload with 200 average queries per second (Case 1: 1 server with background workload). The tail latency increases when the host OS has a high load, leading to SLA violations. Adding another server with uniform traffic distribution (default Kubernetes scheduler) in the baseline (2 servers, one with background workload and one without background workload) does not solve the problem since half of the queries are routed to the overloaded host. Next, we run the workload on 2 servers with load-aware proportional traffic distribution (Case 2: two servers with proportional traffic distribution similar to AKIDA’s traffic distributor). In AKIDA, we detect the overloaded node in the cluster and avoid routing the traffic to that node. We run AKIDA in two cases when having one SmartNIC and two SmarttNICs on the same server. Although the SmartNICs have lower computational power than the host OS when a transient spike overloads the CPU, AKIDA leverages SmartNIC’s compute capacity to reduce SLA violations. 209 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 11.4.2 Cost Benefits In this section, we perform a cost analysis of the cluster design choices based on the actual CPU utilization dataset in [278] to compare the network design of over-provisioning the servers to meet SLA during the workload spikes and by using the SmartNICs to manage the spikes. We assume a SmartNIC is about 15-20% of the cost of a server. We calculate CAPEX and OPEX for three resource deployment scenarios, i) two servers, ii) one server and one SmartNIC, and iii) one server and two SmartNICs at each edge node to accommodate the spikes. The x-axis shows the number of edge nodes at each case (i.e 1 edge node + 1 extra server, 1 edge node + 1 SmartNIC, and 1 edge node + 2 SmartNICs). Figure 11.6(a) shows the capital expenses for building a cluster in (i), (ii), and (iii). As shown, the SmartNICs provide an extra computational capacity to the cluster at a much lower cost. The total cost of the cluster reduces by a factor of 1.5 and 1.55 when using one or two SmartNICs at each x86 host, respectively. This section’s CAPEX and OPEX cost calculations are based on rough numbers available for cost and maximum energy consumption of the servers and SmartNICs in our testbed. Figure 11.6(b) shows operational expenses by tracking maximum power (one of the main contributors to OPEX) used in the cluster for Cases i, ii, and iii. The SmartNICs used in our testbed are 3.5x more energy efficient than the host server. Figure 11.6(b) shows that the maximum power usage of the cluster reduces by a factor of 1.5 and 1.27 when having one or two SmartNICs at each server, respectively. 210 AKIDA: Accelerating In-Network, Transient Compute Elasticity using SmartNICs Chapter 11 a. Capital expenditure. b. Operational expenditure. Figure 11.6: Operational performance as the cluster size increases.
https://www.bing.com/videos/search?...&view=detail&FORM=VIRE&form=VDRVRV&ajaxhist=0
11.3 System Overview
We begin by providing an overview of AKIDA, an intelligent fabric software framework that can be deployed on any container orchestration supporting Operating Systems such as Servers/Server Racks, Network Switches, or Edge-systems. Figure 11.1 shows the various components of AKIDA framework. The server can host any number of SmartNICs as the number of PCIe buses available. We use Kubernetes as the container orchestration system that runs on the host and SmartNIC OS, and this specialized architecture works only on SoC-based SmartNIC architecture.
This is the referenced patent:
US11436054B1 Directing queries to nodes of a cluster of a container orchestration platform distributed across a host system and a hardware accelerator of the host system
Example implementations relate to edge acceleration by offloading network dependent applications to a hardware accelerator. According to one embodiment, queries are received at a cluster of a container orchestration platform.
The cluster includes a host system and a hardware accelerator, each serving as individual worker machines of the cluster. The cluster further includes multiple worker nodes and a master node executing on the host system or the hardware accelerator.
A first worker node executes on the hardware accelerator and runs a first instance of an application.
A distribution of the queries is determined among the worker machines based on a queuing model that takes into consideration the respective compute capacities of the worker machines.
Responsive to receipt of the queries by the host system or the hardware accelerator, the queries are directed to the master node or one of the worker nodes in accordance with the distribution.
The use of the word AKIDA should be brought to the attention of BrainChip's patent attorneys to head off a potential trade mark infringement.