To GPU or not to GPU?

Hello! I’m Michael Struwig – a post-grad student from Stellenbosch University. I’ve recently had the pleasure of spending the last 6 weeks interning at Praelexis, and I’ve been asked to write a blog post or two chronicling some of the work I’ve done over the course of my stay. With that quick introduction out the way and my internship drawing to a close, let’s jump in to the topic we’ll be exploring today – GPUs.

Training machine learning models on GPUs has become increasingly popular over the last couple of years. By offering a massive number of computational cores, GPUs potentially offer massive performance increases for tasks involving repeated operations across large blocks of data. Coincidentally, this applies perfectly to training Neural Networks – something we spend a lot of time doing. So a number of questions naturally arise – with the increased availability of rentable GPU instances, is it worth it to shift to a GPU-based workflow from a financial and performance viewpoint? And if you want to implement a GPU-based workflow, is it better to rent an instance or purchase and maintain your own internal GPU-accelerated local system?

These questions sparked a small internal investigation that I decided to embark on, after I observed enormous speed increases on a mid-range GPU in comparison to our own local cluster when training an LSTM to generate music in the style of Bach (a story for another time). If GPUs were such a massive advantage over the CPU I was previously using, would it be worth shifting our other tasks over to GPUs? I produced an internal report for use at Praelexis in order to answer exactly these questions, and this blog post serves as an extract of what was contained in that report.

Irregardless, with a cause to champion, I went in search of data.

Which platforms are out there?

I decided to look at a range of different on-demand providers that are available. Out of the two big providers, Amazon AWS and Microsoft Azure, both have virtually identical hardware and pricing (so I only included Amazon AWS for this investigation). At present, both only offer ageing Nvidia Tesla K80 GPU instances (they do have other options, but at ~10 times the price, it’s just too expensive to be worth considering at this point). It’s the middle-tier providers that have some interesting offers. Paperspace have Nvidia Quadro P5000 and Nvidia Quadro M4000 GPU instances available, while FloydHub offers the same Tesla K80 as the big players, but at half the price. The table below summarises each platform’s offering that was considered. For comparison’s sake, I’m also going to be including Amazon’s c4.2xlarge compute-optimized CPU instance and FloydHub’s CPU instance.

There’s one system that’s off the list above, and that’s a blue-sky build of a local system powered by one of the current undisputed kings of GPU-accelerated model training, the (consumer grade!) Nvidia GTX1080 Ti. This local system runs an ultimate build cost of about USD2500 complete. Ouch! But as we’ll see later, once you spend long enough training models, this asking price isn’t necessarily so heavy.

In search of Benchmarks

The first thing we’d need in order to make a decision is performance benchmarks of the various available GPUs out there. I was able to find two posts from two different sources. Using both of the sources I was able to piece together a rough estimate of the performance of each platform relative to an enterprise-level Intel Xeon E5-2666 v3. To the hardcore scientific-method purists who are freaking out right now, I hear you. This is simply to get a rough idea of performance relative to an enterprise-level CPU. Later on we’ll disregard this performance metric completely when doing our assessment.

Recall that the above graph is showing the performance relative to a CPU. In other words, a Performance Factor or 1 would mean performance is equivalent to running the task on a CPU. We really see some astonishing theoretical performance gains from the GPU instances – at least 10x faster, or in the case of the GTX1080 Ti, over 30x times faster.

And this potentially translates into a lot of saved time:

 

And I really mean a lot of saved time. If you’re training 24 hours a day, for a month on a CPU (720 hours total), if you could get an Nvidia GTX1080 Ti to run at its theoretical best, the same task would take just over 20 hours. That’s too much potential to be ignored.

Which service offers the most (theoretical) performance per dollar?

What if we want to see which service offers the most theoretical value? We can simply find this by dividing the Performance Factor by the cost per hour, to get a feeling for “bang for buck”.

We can also phrase the question differently, and ask ourselves “how much it would cost for each of the platforms to run a task that would take our aforementioned CPUs one hour to complete, assuming our full theoretical speed-up?”. We can easily calculated this “effective cost” per hour CPU time, giving us the following graph:

Paperspace really seems to be the clear winner here, handily blowing the competition out the water. What’s interesting to note is exactly how expensive Amazon AWS is (and by extension, Azure). Another observation I enjoyed was the (perhaps intentional) fairness of FloydHub. Their GPU and CPU instances offer exactly the same theoretical value for money. So if you have time on your hands and aren’t seeing the full promised speed-up, simply run the task (assuming it can fit into the admittedly limited 7GB of RAM) on the CPU instance.

Throwing theoretical performance out the window

Time to focus on the finer-grained detail and to dump the assumption that we’ll be seeing theoretical performance. We would all do well to remember that “In theory, there is no difference between theory and practice. But, in practice, there is”. Here at Praelexis, we do an extremely wide range of work and this naturally extends to an enormous variation in the models that we design. It would be entirely unrealistic to expect our models to match the benchmarks we looked at previously. So it’s time to change to the questions we’re asking in order to make it applicable to what a data scientist would need to know in order to make an informed decision on whether to shift to a GPU-accelerated workflow or not.

How much faster must your code benchmark on a GPU (vs. CPU) in order to save money?

A key question. Let’s jump right in with the following graph, which compares the effective cost per hour of the GPU instances to that of the Amazon AWS c4.2xlarge CPU instance:

Effective cost of each GPU instance compared to the Amazon AWS c4.2xlarge CPU instance.

It’s very clear to see that even for incredibly conservative speedup as a result of a GPU, it drastically more affordable to train on a GPU.

But what happens when we ask the same question but for a CPU instance that’s around 20x cheaper? The same graph is shown again, but this time for the FloydHub CPU instance:

Effective cost of each GPU instance compared to the FloydHub CPU instance.

Suddenly the situation changes. You’ll need to see really significant speed-ups on your GPU instance in order to actually save money. Basically, in conclusion, if you have lots of time (which is often times a critical finite resource with machine learning), and only care about saving money, the likelihood is extremely high that you’ll save money by using a CPU.

I want a GPU-enabled platform. When is it better to buy or rent based on how long my tasks take to run?

But what happens if you’re willing to pay more because time is critical or you’d like more agility and flexibility to iterate on your model design? In that case you have no choice but to use a GPU-accelerated platform. And this raises the above question. If you’re going to go GPU-accelerated anyway, should you be buying or renting? Let’s ignore speed-up for the time being and focus solely on time spend training, and try to answer that question with the graph below.

 

There’s a lot going on here, so let’s step through it steadily to make sense of it all. The y-axis shows the number of days it takes for each GPU-accelerated instance to accumulate a total cost of USD2500 – equivalent to the cost of a blue-sky Nvidia GTX1080 Ti local system. And the x-axis shows the number of hours we spend training per day. If we assume one, two or three-year upgrade cycles (which we show on the plot) we can see that depending on our daily usage in hours, at which point it makes more financial sense to purchase a blue-sky local system.

This graph was surprising to me the first time I saw it. USD2500 is not an insignificant amount of money, but if you’re spending more than 4-8 hours per day training models using a GPU and you have an extremely reasonable 2-year upgrade cycle, you’re basically better off purchasing the blue-sky local system. This explains why in academia you still see virtually everyone using these enormously powerful local systems despite the large initial investment. For rapid model iteration where you spend multiple hours a day training, it still is cheaper to buy local. For now, at least, the massive hype surrounding GPU-accelerated instances seems overblown for larger, frequent tasks such as the ones often seen in research.

How much faster should my model tasks benchmark on a GPU instance (vs. CPU) in order to warrant NOT investing in a local GPU-powered system?

Crucially, however, what happens when we throw performance back into the mix instead of simply looking at total time spent training? Can we find out if  our model tasks run fast enough on a GPU-accelerated instance to warrant not purchasing a blue-sky local system?

This question is really important for two main reasons. Firstly, your answer will depend on actual observable speed-up – just run a benchmark on the GPU-accelerated instance of your choice. And secondly, as a result of using an instance, this speed-up can be determined extremely cheaply as a once-off run. This means skirting the need to purchase a blue-sky system and answer this question empirically over time.

The next series of graphs show the effective cost per day for each platform, for tasks of various size, as a function of speedup due to the extra grunt provided by the GPU. This can then be compared to the effective cost per day for the blue-sky Nvidia GTX1080 Ti system for a one, two and three-year upgrade cycle.

 

And the first thing that you should notice is how the answer to the previous question completely changes when we start factoring in performance. For a two-year local system upgrade cycle, you only require your task to speed up by a factor of 2.5 – 5x in order to justify renting instead of buying in many cases. That’s an absolutely incredible result, and shows how critical it is to consider the same problem from multiple angles.

And the winner is?

Well, as with most things, it (annoyingly) really depends.

If you’re spending less than 4 hours per day performing model tasks, or if you’ve got lots of time time, stick with a CPU. You’re going to require unrealistic GPU speed-ups in order to save money. On the other hand, if you’re low on time or you want increased speed and agility in order to iterate and you’re seeing significant speed-up on a GPU instance, it’s better to rent a GPU instance. Lastly, if you’re spending around 8 hours or more training on a GPU instance, or you want extreme ability to iterate, it should be strongly recommended to buy a blue-sky local system.

I hope you’ve enjoyed the trip with me to GPU-land. As you can see it has the potential to be a fairly contentious issue in the future, particularly as Amazon and Azure are set to shake-up the landscape with a (hopefully lower-cost) refresh of their ageing K80 GPU offerings over the next few months.

Till next time,
Michael.

1 reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *