Deep Learning on MacOS with AMD eGPU?

2019-01-07

I’ve recently sold my Nvidia GTX 1080 eGPU¹ after two month’s waiting in vain for a compatible Nvidia video driver for MacOS 10.14 (Mojave). Either Apple’s or Nvidia’s fault, I don’t care any more. Right away, I ordered an AMD Radeon RX Vega 64 on Newegg. The card arrived two days later and it looked sexy at first sight. It’s plug-and-play as expected and performed just as good as its predecessor, regardless of serious gaming, video editing or whatever. I would have given it a 9.5/10 if not find another issue a couple of days later — wow, there is no CUDA on this card!

Of course there isn’t. Cause CUDA was developed by Nvidia who’s been paying great efforts on making a more user-friendly deep-learning environment. Compared with that, AMD (yes!) used to intentionally avoid a head-to-head competition against world’s largest GPU factory and instead keep making gaming cards with better cost-to-performance ratios. ROCm, which is an open-source HPC/Hyperscale-class platform for GPU computing that allows cards other than Nvidia’s, does make this gap much narrower than before. However, ROCm is still publicly not supporting MacOS and you have to run a Linux bootcamp to utilize the computational benefits of your AMD card, even though you can already game smoothly on you Mac. Sad it is, AMD 😰.

There are, however, several solutions if you’re people just like me who really have to run your code on a Mac and would like to accelerate those Renaissance training times with a GPU. The method I adapted was by using a framework called PlaidML, and I’d like to walk you through how I installed, and configured my GPU with it.

pip3  install plaidml-keras plaidbench

After installation, we can set up the intended device for computing by running:

plaidml-setup

PlaidML Setup (0.3.5)

Thanks for using PlaidML!

Some Notes:
  * Bugs and other issues: https://github.com/plaidml/plaidml
  * Questions: https://stackoverflow.com/questions/tagged/plaidml
  * Say hello: https://groups.google.com/forum/#!forum/plaidml-dev
  * PlaidML is licensed under the GNU AGPLv3
 
Default Config Devices:
   No devices.

Experimental Config Devices:
   llvm_cpu.0 : CPU (LLVM)
   opencl_intel_intel(r)_iris(tm)_plus_graphics_655.0 : Intel Inc. Intel(R) Iris(TM) Plus Graphics 655 (OpenCL)
   opencl_cpu.0 : Intel CPU (OpenCL)
   opencl_amd_amd_radeon_rx_vega_64_compute_engine.0 : AMD AMD Radeon RX Vega 64 Compute Engine (OpenCL)
   metal_intel(r)_iris(tm)_plus_graphics_655.0 : Intel(R) Iris(TM) Plus Graphics 655 (Metal)
   metal_amd_radeon_rx_vega_64.0 : AMD Radeon RX Vega 64 (Metal)

Using experimental devices can cause poor performance, crashes, and other nastiness.

Enable experimental device support? (y,n)[n]:

Of course we enter y. Before I choose device 4 (OpenCL with AMD) or 6 (Metal with AMD), I’d like to benchmark on the default device, CPU (LLVM). The test script (on MobileNet as an example) is

plaidbench keras mobilenet

and the result shows²

Running 1024 examples with mobilenet, batch size 1
INFO:plaidml:Opening device "llvm_cpu.0"
Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.6/mobilenet_1_0_224_tf.h5
17227776/17225924 [==============================] - 2s 0us/step
Model loaded.
Compiling network...
Warming up ...
Main timing
Example finished, elapsed: 3.0688607692718506 (compile), 61.17863607406616 (execution), 0.059744761791080236 (execution per example)
Correctness: PASS, max_error: 1.7511049009044655e-05, max_abs_error: 6.556510925292969e-07, fail_ratio: 0.0

Now we run the set-up again and choose 4 (OpenCL with AMD). The result is

Running 1024 examples with mobilenet, batch size 1
INFO:plaidml:Opening device "opencl_amd_amd_radeon_rx_vega_64_compute_engine.0"
Model loaded.
Compiling network...
Warming up ...
Main timing
Example finished, elapsed: 2.6935510635375977 (compile), 13.741217851638794 (execution), 0.01341915805824101 (execution per example)
Correctness: PASS, max_error: 1.7511049009044655e-05, max_abs_error: 1.1995434761047363e-06, fail_ratio: 0.0

Finally we run the test against the expected most powerful device, i.e. device 6 (Metal with AMD).

Running 1024 examples with mobilenet, batch size 1
INFO:plaidml:Opening device "metal_amd_radeon_rx_vega_64.0"
Model loaded.
Compiling network...
Warming up ...
Main timing
Example finished, elapsed: 2.243159055709839 (compile), 7.515545129776001 (execution), 0.007339399540796876 (execution per example)
Correctness: PASS, max_error: 1.7974503862205893e-05, max_abs_error: 1.0952353477478027e-06, fail_ratio: 0.0

As a conclusion, by utilizing the Metal core on my Mac as well as the external AMD GPU, the training runtime was roughly 87.7% down and I’m personally quite satisfied with that.

eGPU = external GPU. ↩︎
You might have noticed an extra difference: it downloads the dataset on the first run. Doesn’t really matter though. ↩︎