Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms

Abstract

This research paper is based on Microsoft Azure and it gave a detailed study on the variables in the workload prediction especially in the virtual machine environment. It analysed the characteristics of a production virtual machine.

The main idea is to study these characteristics and use the insights to help the provider's resource management system. Eg) consistent virtual machine behaviour over time. Based on the observation

Resource central is the product that they created and it learns based on the historical data and predicts the behaviour online and provide feedback to the users using a library Eg) Prediction of oversubscribing servers with over-subscribable VM types retaining VM performance

They used VM traces to evaluate that the predictions increased the utilisation and prevent the physical resource exhaustion

Introduction

Motivation

Our work

Related works

characterisation of VM workloads - user patterns and economics of public cloud - focused on resource demand volatility and pricing

Google cloud on bare metal container workloads

Characterising cloud VM workloads

VM Type

Virtual resource usage

VM size

Maximum deployment Size

VM lifetime

Workload class

VM inter-arrival times

Correlation between metrics

Resource Central

We have noticed that there are several VM behaviours and metrics related to resource management and the potential benefits of predicting them accurately. Which mean as VM characteristics we can find a set of features and each of them contributes to different use cases. If we can predict each of them it can be used for multiple use cases.

At this point RC predicts the behaviour of VM, but it can be used to predicting or learning server effects as well as hardware failure.

RC use cases

  1. SMART VM SCHEDULING: Whenever a new task comes and we don't have to select the server on a blind note, instead we can consult the RC which will predict the resource utilisation, with this info they can decide how much disk we want to allocate so that there is no chance of resource exhaustion in oversubscribed servers
  1. SMART CLUSTER SELECTION: It can use the prediction of maximum deployment size, with this info they can have enough resources
  1. SMART POWER OVERSUBSCRIPTION AND CAPPING: When there is a power emergency use the prediction of VM workload interactivity before distributing the power around the servers and do the distribution in a more effective manner
  1. SCHEDULING THE SERVER MAINTENANCE: Predict the lifetime of the server and can use that value to predict the expected maintenance.
  1. RECOMMENDING VM AND DEPLOYMENT SIZES: The cloud platform could provide a server to its customers that recommends the appropriate VM size and number of VMs at the time of each deployment.- prediction of workload class and resource utilisation

In this paper only the first case is added as a case study and give accuracy of range .8-.9

Conclusion

The proposed RC is a well-put-together system for generating, storing, and efficiently using the prediction of these characteristics. And can make use of more machine learning algorithms to efficient prediction serving systems

Challenges Addressed: - Unavailability of VM workload characteristics data - offline profiling is infeasible - data will be only available when VM is in production - Online profiling: hard to determine an arbitrary VM's representation behaviour - Application-level monitoring: It needs input from the application - Live migration: content resources until they are available - also cause workload traffic out-bursts - To guarantee within-server performance isolation across, VM's we need a mechanism for interference detection( hardware counter)
💡
Research Questions Addressed: - Analysis of VM workload characteristics - Predicting behaviour on historical data ( Resource central) - Making sure that the performance is not compromised due to the new techniques - Creating new techniques without having too much data centre overhead
🛠
Implementation Methods - Random Forest - Extreme Gradient Boosting Tree
Evaluation Methods - Accuracy - Precision - Recall
📖
Datasets Microsoft Azure: VM workloads for 3 months including 3rd party VMs:https://github.com/Azure/AzurePublicDataset Google cloud: A month-long trace of 12k bare-metal servers first-party container-based workloads: https://github.com/google/cluster-data
Azure/AzurePublicDataset
This repository contains public releases of Microsoft Azure traces for the benefit of the research and academic community. There are currently two classes of traces: The first class contains two representative traces of the virtual machine (VM) workload of Microsoft Azure collected in 2017 and 2019.
https://github.com/Azure/AzurePublicDataset
google/cluster-data
This repository describes various traces from parts of the Google cluster management software and systems. Please join our (low volume) discussion group, so we can send you announcements, and you can let us know about any issues, insights, or papers you publish using these traces.
https://github.com/google/cluster-data

Some links to refer

Roundup Of Cloud Computing Forecasts And Market Estimates, 2015
Global SaaS software revenues are forecasted to reach $106B in 2016, increasing 21% over projected 2015 spending levels. A Goldman Sachs study published this month projects that spending on cloud computing infrastructure and platforms will grow at a 30% CAGR from 2013 through 2018 compared with 5% growth for the [...]
https://www.forbes.com/sites/louiscolumbus/2015/01/24/roundup-of-cloud-computing-forecasts-and-market-estimates-2015/?sh=11c0d47fdb7a