Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms
Abstract
This research paper is based on Microsoft Azure and it gave a detailed study on the variables in the workload prediction especially in the virtual machine environment. It analysed the characteristics of a production virtual machine.
The main idea is to study these characteristics and use the insights to help the provider's resource management system. Eg) consistent virtual machine behaviour over time. Based on the observation
Resource central is the product that they created and it learns based on the historical data and predicts the behaviour online and provide feedback to the users using a library Eg) Prediction of oversubscribing servers with over-subscribable VM types retaining VM performance
They used VM traces to evaluate that the predictions increased the utilisation and prevent the physical resource exhaustion
Introduction
Motivation
- Usage of cloud computing these days and booming of service providers like Microsoft Azure, AWS, GCP
- More features for better performance is the key for competition
- Having new features without having too much datacenter costs
- The provides has different type of workloads- internal services and external customers on the same shared datacenter
- A feature with good performance, availability , reliability, scalability without complex architecture and much cost is challenging
- The true characteristic of the workload in cloud env is still unclear
- No prior study on lifetime (time between creation and termination)
- resource consumption distribution of these providers production VMs
- The till date researches are not focused on VMs in large cloud providers
- Idea 1: If accurately predicting resource utilisation at VM deployment time would allow resource- contention- aware and also can evaluate the need for VM migration
- Idea 2: If runtime lifetime can be predicted accurately, it will support the health management system to predict and can have a server maintenance without VM migration or downtime
- Main idea is Prediction based resource management
Our work
- Analysis of azure VM workload, including distribution of VM's size, lifetime, resources, consumption, utilisation pattern and deployment size.
- Learning: VM behaviour is consistent over multiple lifetimes - hence we can use the old data to predict the future behaviour
- Used ML algorithms to predict the values online
Related works
characterisation of VM workloads - user patterns and economics of public cloud - focused on resource demand volatility and pricing
Google cloud on bare metal container workloads
- Google's previous search is entirely different because it only focuses on public clouds - in containers, they likely to live longer, produce lower resource utilization and be deployed in smaller numbers. where the cloud providers must encapsulate their customers' workload using VMS which incorporates more overheads.
- For example - there is a Map reduce job, if they are using VMs they will create a bunch of VMs for all tasks and they can only remove them once the all the tasks are completed. But, in containers each map and reduce job will get each containers which can be accurately sized for the tasks
- Machine learning and prediction serving systems: There are many machine learning algorithms available ( framework) this paper is introducing a new novel approach TLC ( internal microsoft framework) with many learning algorithms
- A add on is here the RC caches the prediction results, model, and feature data on the clinet side. This approach enables the system to operate even when the data store of the connectivity to it is unavailable
- Predicting cloud workloads: This paper predict multiple workload behaviours- source demand, resource utilization, job/task length for provisioning or scheduling purposes
- Broader set of VM behaviours( lifetimes, maximum deployment sizes, and workload classes
- Purposes - health management and power capping
- Prediction based scheduling: The research on task/container or VM scheduling is too tricky, albeit there are many existing researches - many of them ofr resource usage or performance interference and they are not practical for large provider
- And there are challenges as we discussed below. This paper's approach is to propose changes to azure's Vm scheduler that leverage predictions of long-term high percentile resource usage to implement oversubscription in a safe and practical manner
Characterising cloud VM workloads
VM Type
Virtual resource usage
VM size
Maximum deployment Size
VM lifetime
Workload class
VM inter-arrival times
Correlation between metrics
Resource Central
We have noticed that there are several VM behaviours and metrics related to resource management and the potential benefits of predicting them accurately. Which mean as VM characteristics we can find a set of features and each of them contributes to different use cases. If we can predict each of them it can be used for multiple use cases.
At this point RC predicts the behaviour of VM, but it can be used to predicting or learning server effects as well as hardware failure.
RC use cases
- SMART VM SCHEDULING: Whenever a new task comes and we don't have to select the server on a blind note, instead we can consult the RC which will predict the resource utilisation, with this info they can decide how much disk we want to allocate so that there is no chance of resource exhaustion in oversubscribed servers
- SMART CLUSTER SELECTION: It can use the prediction of maximum deployment size, with this info they can have enough resources
- SMART POWER OVERSUBSCRIPTION AND CAPPING: When there is a power emergency use the prediction of VM workload interactivity before distributing the power around the servers and do the distribution in a more effective manner
- SCHEDULING THE SERVER MAINTENANCE: Predict the lifetime of the server and can use that value to predict the expected maintenance.
- RECOMMENDING VM AND DEPLOYMENT SIZES: The cloud platform could provide a server to its customers that recommends the appropriate VM size and number of VMs at the time of each deployment.- prediction of workload class and resource utilisation
In this paper only the first case is added as a case study and give accuracy of range .8-.9
Conclusion
The proposed RC is a well-put-together system for generating, storing, and efficiently using the prediction of these characteristics. And can make use of more machine learning algorithms to efficient prediction serving systems
Some links to refer

