Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms

Abstract

This research paper is based on Microsoft Azure and it gave a detailed study on the variables in the workload prediction especially in the virtual machine environment. It analysed the characteristics of a production virtual machine.

The main idea is to study these characteristics and use the insights to help the provider's resource management system. Eg) consistent virtual machine behaviour over time. Based on the observation

Resource central is the product that they created and it learns based on the historical data and predicts the behaviour online and provide feedback to the users using a library Eg) Prediction of oversubscribing servers with over-subscribable VM types retaining VM performance

They used VM traces to evaluate that the predictions increased the utilisation and prevent the physical resource exhaustion

Introduction

Motivation

Usage of cloud computing these days and booming of service providers like Microsoft Azure, AWS, GCP

More features for better performance is the key for competition

Having new features without having too much datacenter costs

The provides has different type of workloads- internal services and external customers on the same shared datacenter

A feature with good performance, availability , reliability, scalability without complex architecture and much cost is challenging

The true characteristic of the workload in cloud env is still unclear

No prior study on lifetime (time between creation and termination)

resource consumption distribution of these providers production VMs

The till date researches are not focused on VMs in large cloud providers

Idea 1: If accurately predicting resource utilisation at VM deployment time would allow resource- contention- aware and also can evaluate the need for VM migration

Idea 2: If runtime lifetime can be predicted accurately, it will support the health management system to predict and can have a server maintenance without VM migration or downtime

Main idea is Prediction based resource management

Our work

Analysis of azure VM workload, including distribution of VM's size, lifetime, resources, consumption, utilisation pattern and deployment size.

Learning: VM behaviour is consistent over multiple lifetimes - hence we can use the old data to predict the future behaviour

Used ML algorithms to predict the values online

Related works

characterisation of VM workloads - user patterns and economics of public cloud - focused on resource demand volatility and pricing

Google cloud on bare metal container workloads

Google's previous search is entirely different because it only focuses on public clouds - in containers, they likely to live longer, produce lower resource utilization and be deployed in smaller numbers. where the cloud providers must encapsulate their customers' workload using VMS which incorporates more overheads.

For example - there is a Map reduce job, if they are using VMs they will create a bunch of VMs for all tasks and they can only remove them once the all the tasks are completed. But, in containers each map and reduce job will get each containers which can be accurately sized for the tasks

Machine learning and prediction serving systems: There are many machine learning algorithms available ( framework) this paper is introducing a new novel approach TLC ( internal microsoft framework) with many learning algorithms

A add on is here the RC caches the prediction results, model, and feature data on the clinet side. This approach enables the system to operate even when the data store of the connectivity to it is unavailable

Predicting cloud workloads: This paper predict multiple workload behaviours- source demand, resource utilization, job/task length for provisioning or scheduling purposes

Broader set of VM behaviours( lifetimes, maximum deployment sizes, and workload classes

Purposes - health management and power capping

Prediction based scheduling: The research on task/container or VM scheduling is too tricky, albeit there are many existing researches - many of them ofr resource usage or performance interference and they are not practical for large provider

And there are challenges as we discussed below. This paper's approach is to propose changes to azure's Vm scheduler that leverage predictions of long-term high percentile resource usage to implement oversubscription in a safe and practical manner

Characterising cloud VM workloads

VM Type

Virtual resource usage

VM size

Maximum deployment Size

VM lifetime

Workload class

VM inter-arrival times

Correlation between metrics

Resource Central

We have noticed that there are several VM behaviours and metrics related to resource management and the potential benefits of predicting them accurately. Which mean as VM characteristics we can find a set of features and each of them contributes to different use cases. If we can predict each of them it can be used for multiple use cases.

At this point RC predicts the behaviour of VM, but it can be used to predicting or learning server effects as well as hardware failure.

RC use cases

SMART VM SCHEDULING: Whenever a new task comes and we don't have to select the server on a blind note, instead we can consult the RC which will predict the resource utilisation, with this info they can decide how much disk we want to allocate so that there is no chance of resource exhaustion in oversubscribed servers

SMART CLUSTER SELECTION: It can use the prediction of maximum deployment size, with this info they can have enough resources

SMART POWER OVERSUBSCRIPTION AND CAPPING: When there is a power emergency use the prediction of VM workload interactivity before distributing the power around the servers and do the distribution in a more effective manner

SCHEDULING THE SERVER MAINTENANCE: Predict the lifetime of the server and can use that value to predict the expected maintenance.

RECOMMENDING VM AND DEPLOYMENT SIZES: The cloud platform could provide a server to its customers that recommends the appropriate VM size and number of VMs at the time of each deployment.- prediction of workload class and resource utilisation

In this paper only the first case is added as a case study and give accuracy of range .8-.9

Conclusion

The proposed RC is a well-put-together system for generating, storing, and efficiently using the prediction of these characteristics. And can make use of more machine learning algorithms to efficient prediction serving systems

⛔

Challenges Addressed: - Unavailability of VM workload characteristics data - offline profiling is infeasible - data will be only available when VM is in production - Online profiling: hard to determine an arbitrary VM's representation behaviour - Application-level monitoring: It needs input from the application - Live migration: content resources until they are available - also cause workload traffic out-bursts - To guarantee within-server performance isolation across, VM's we need a mechanism for interference detection( hardware counter)

💡

Research Questions Addressed: - Analysis of VM workload characteristics - Predicting behaviour on historical data ( Resource central) - Making sure that the performance is not compromised due to the new techniques - Creating new techniques without having too much data centre overhead

🛠

Implementation Methods - Random Forest - Extreme Gradient Boosting Tree

⚙

Evaluation Methods - Accuracy - Precision - Recall

📖

Datasets Microsoft Azure: VM workloads for 3 months including 3rd party VMs:https://github.com/Azure/AzurePublicDataset Google cloud: A month-long trace of 12k bare-metal servers first-party container-based workloads: https://github.com/google/cluster-data

Topic I have to revisit

Some links to refer