AutoML Benchmark:  AWS vs. Azure

AutoML Benchmark: AWS vs. Azure

Posted by Scott Phillips on 27th Feb 2023

We recently created two missions for Cloud Astronauts to introduce users to the Automated Machine Learning services on both AWS and Azure.  AWS’s service is SageMaker Autopilot. On Azure, the service is Automated ML in the Azure Machine Learning Studio.  We compared each of these services for you on three measures:1) Ease-of-use for the non-engineer, 2) cost to create the model, and 3) quality of predictions based on standard metrics like Accuracy, Precision, and Recall.

There were some surprises along the way and areas that both services could improve on.  Let’s start with ease of use.

Ease of use

AWS is fast and easy to setup and launch an automated ML model.  There is minimal configuration required.  You load data to an S3 bucket.  You create a SageMaker Domain.  Then, you launch the SageMaker Suite.  There is an option for AutoML.  Select that and you are into the SageMaker Autopilot setup wizard.  You go through the wizard.  Building, training, and testing a model is quick and easy.

Azure’s interface has a very pleasing design and is also easy to setup and use.  Data is uploaded to a Storage account.  Then, you create a machine learning workspace and launch Azure’s Machine Learning Studio.  You select the Automated ML option and step through the wizard.  One difference with AWS is that Azure wants to create a Datastore and it requires 5-6 extra steps to do so and that takes more time.  AWS has simplified these steps down to simply pointing at the dataset in the S3 bucket and then the rest is handled behind the scenes.  We wish Azure had a simplified data upload setup like AWS.  This was a differentiator on ease-of-use.

Selecting a compute instance is more transparent and easy on Azure.  It is not at all transparent on AWS.  AWS configures the compute behind the scenes and they provision it with an overly large instance (see our note on cost below).  We like Azure’s transparency and choice more than AWS.

Cleanup after you have created a model is easier on Azure, and it is easy to shut down the compute instance and then delete the resource group - but have to know to do these steps or costs may continue to accrue.

AWS leaves ambiguity about how long the service will run the larger instances that it deploys, but once familiar, it is easy to shutdown.  Cleanup steps take a little longer.  There is no single concept like a Resource Group that will remove all of the artifacts created across AWS.


Advantage:  AWS, but this was a narrow lead and there are some better features on Azure (like selecting a compute instance and general user interface).


Cost is a differentiator with a big gap in what creating the same model with the same dataset will cost between these two services.

On Azure, stepping through the benchmark test and picking a modest Compute Instance (can select Standard_DS11_v2, which is an instance with 2 cores, 14GB of RAM, and 28GB of storage) to build, train, and test the model was done under normal conditions for a total cost of $0.32 to run AutoML for one experiment and get benchmark results.

On AWS, using the same dataset resulted and completing cleanup in a similar amount of time resulted in a cost of $6.34.  AWS gives the user no control of the instance being provisioned and it is defaulted to a large and expensive compute instance (documentation says ml.r5.16xlarge instances).  AWS provisions two large instances by default and the combination runs at $4.83/hour.

NOTE:  AWS advertises Free Tier and indicates you get compute instance on Free Tier.  We were surprised and not pleased to find that Autopilot uses a specific type of instance that is NOT covered by Free Tier.  AWS is not transparent on this.  You have to dive into the details to find it.   We see this lack of cost transparency on AWS as a bit disappointing.

If cost is your most important variable, don’t go with AWS.  This may matter more for hobbyists and students than the Enterprise user given the cost is modest to a business.  But it is still a factor.


Advantage:  Azure (strong advantage)


The quality of the machine learning model is the bottom-line differentiator and on key measures SageMaker Autopilot created a higher quality model.

In binary classification (predicting if class is either 0 or 1), Precision measures how many predictions of the positive class (1) were correct vs incorrect.Ex.  If a model predicts 1,000 data rows are the positive class and it turns out that 750 have the correct value of 1, but 250 were actually zero, then Precision is 75% (750/(750+250).  Recall is a measure of how many of the actual positive values of 1 in the total data are predicted accurately.  If 750 are predicted accurately, but 750 others with a value of 1 were predicted incorrectly, Recall is 50%, 750/1500.

  • SageMaker Autopilot created a model with Precision of 85% and Recall of 82%.
  • Azure AutoML results in a Recall of 63% and Precision of 77%.

The dataset was the same.

AWS SageMaker Autopilot also produces a model quality PDF report that is downloadable along with a report on data quality.

Azure provides more information on explainability, but the user interface to getting metrics was less helpful than it could have been

We did not test predictions in either case.  We did not explore deeply on hyper parameters or other customizations.


Advantage:  AWS.  

This was entirely due to the higher quality Precision and Recall scores compared to Azure.    Precision and Recall were significantly higher for AWS than on Azure.


We rate AWS SageMaker Autopilot higher based on model quality.  Getting higher Precision and Recall scores is the basic test of whether you are getting a good model and AWS beats Azure on this measure when using the exact same dataset.

AWS has only a slightly stronger offering on Ease-of-Use, but the cost is dramatically higher.  If cost is important, a low-cost user should default to Azure.  However, a business user may wish to leverage AWS for the higher quality models that are produced.  Professionals are going to create Jupyter notebooks and set up pipelines for training and model deployment.  Thus, it is fair to talk about who will be using these tools.  Business analysts at a company or students and hobbyists.  The latter are more likely to be cost-sensitive.

These are just observed results and outputs from using two different services in a simple head-to-head benchmarking test with the emphasis on standard Machine Learning metrics as the primary indicator and using the exact same dataset.

To learn more:

  • Mission 9 - Mars AutoML Benchmark Testing on AWS (AWS products)
  • Mission 7 - Mars AutoML Benchmark Testing on Azure (Azure products)

Directions with step-by-step user guides allow you to experiment and try this yourself.  No technical skills required to get started.