Performance of our models: parallelism

How model performance will improve if we’ll move model to <insert hardware name here>? This kind of questions is rather popular in our support channel when it comes to actual production deployment.

Luckily, there’s limited amount of factors affecting model performance: operations and algorithms used in a model, data dimensionality and hardware features available. Also there’s one mega-factor: potential benefits we’re able to gain from high-parallel execution. I.e. multi-core multi-cpu box, or CUDA, or MPI-like environment where our model will be serving us.

In this post we’ll take a quick look from parallelism standpoint.

Operation-level parallelism:

Our DL models operate on N-dimensional arrays (often referred as tensors). Then, there are operations applicable to array elements independent of other elements. I.e. `sin(x)` operation applies `sin()` to each element of array `x`, and result of each application is independent of `x-1` value. And obviously, there are operations that are not independent. I.e. `cumsum(x)` operation won’t be as scalable as `sin(x)`.

But then, operand length (as in “number of elements in array”) comes into play. Same operation applied to operands with different lengths might yield different parallelism capacity.

You can think of convolutional models here, and their heart: convolution, where you have lotsof relatively small image patches to be multiplied by weights. Definitely, there’s some benefits available for high-parallel systems just because there’s lots of array elements to multiply, so we can keep busy more additional cores.

On the other side of spectrum we have relatively simple models, like MLP. This kind of models use the same matrix-matrix multiplication as convolution model, but number of features used in such models usually relatively small, it significantly reduces potential gemm benefits from advanced hardware — just way smaller matrices will be typically used as operands here.

Branch-level parallelism:

Since graph is quite common model representation these days, let’s say that it might be possible to execute graph operations in parallel, or do that out of order.

If computational graph is complex enough (or even written with parallel execution in mind), there are chances you have some code branches that’s possible to execute independently from other branches or to execute out of order. Obviously, that depends on the model and framework you’re using, but parallel execution of such branches might be good option if they are long enough to justify data transfers across system and some cache spills meanwhile.

However if model is sequential, or independent branches are short — we won’t have this option, and hardware features will be the main factor for possible improvements: instruction sets available, CPU and RAM frequency, amount of RAM and CPU cache, PCIe mode etc, which I’ll cover in next post.

Data parallelism:

That’s something we’re almost always able to do when it comes to production deployment and inference: just launch “independent” copies of the model. Lots of freedom here:

  • you can have pool of workers that will exclusively use dedicated hardware resources (CPUs, GPUs,TPUs, whatever!)
  • you can do dynamic batching if there’s sudden spike of requests, and your model (and framework) allows that
  • you can use load balancer to manage resources of multiple inference servers

Hence when it comes to model serving — data parallelism is your natural friend.

— — —

Takeaway: in order to predict your model preformance on different hardware — you should know few things about the model:

  • What are the building blocks of your model? How do they scale individually?
  • What operands sizes we’ll have at each step within your model? Will data be big enough to keep additional cores or new GPU busy?
  • What are the specifications of your hardware? Both new and old specs do matter here!

Without answering questions like that first — there’s no sense trying to answer original question quoted at the beginning of this post. And vice versa: if you know what hardware is going to be used to run model — it’s possible to design model it in the way that benefits target hardware most.

Deep Learning Developer.