Performance of our models: hardware

This post is a continuation of Performance of our models: parallelism post, and today we’ll talk how hardware properties affect performance of our models, and what we can choose for better perf. In general performance talks we could discuss network IO performance, or raid configurations for better throughput etc, but since we’re discussing performance of ML/DL models, we’re fairly limited to math performance.

And main hardware factors affecting performance are:

• Instructions sets & compiler optimizations available

• Memory bandwidth

Memory bandwidth:

These days you dont really have too much choice here. Once you’ve decided which CPU will be used, you’re automatically getting chipset which going to be used. So you can’t get latest Intel Xeon with DDR2 Same applies to GPU. You can’t buy nVidia Tesla V100 with GDDR5, as well as you can’t buy GTX 1070 with HBM2 memory.

So, basically more money you pay — more bandwidth you get.

Few words about instruction sets:

FP16, AVX, FMA etc — all these words are very important to performance of your model on any hardware: desktop, server or smartphone. Let’s make a quick toor.

FMA instructions:

This is family of Fused Multiply-Add operations implemented as CPU instructions. This set is small, let’s list them:

// FMA3 instructions accept 3 operands

a = a * c + b;

a = b * a + c;

a = b * c + a;

// and 4 operand FMA4 instructions

z = a * b + c;

This set of instructions is heart of BLAS, basically. And, FMA4, as you can see, matches linear regression 🙂

AVX instructions:

Advanced Vector Extensions — is the most important instructions sets for SIMD concept on CPUs, and source of linear algebra performance gains. This family of instructions comes in 3 major parts:

AVX — original set of instructions, introduced in Intel SandyBridge and AMD Bulldozer CPUs back in 2011.

AVX2 — additional set of instructions, introduced in Intel Haswell CPUs in 2013, and AMD Excavator in 2015.

AVX-512 — latest set of instructions, available since Intel Xeon Phi which was introduced in 2015, and now also available in Intel Skylake-X and latest Intel Xeon CPUs.

Unlike FMA instructions, AVX instructions aren’t directly related to algebra. They are related to memory and registers, but they are really crucial for SIMD performance.

FP16 instructions:

Precision matters! Most of time.

But there are algorithms that can afford reduced precision in exchange of better performance. And now in addition to 32 bit, and 64 bit floating point type — we also have 16 bit type, also known as half precision. This data type is widely used on smartphones and GPUs, especially for tasks like image recognition. I.e. latest nVidia performance boost marketed as TensorCores is actually parallel mixed precision FMA engine. Input and output is FP16, and FMA is processed as FP32. And FMA is applied to multiple matrices at once. Ultra-popular mobile CPU architecture ARMvX also has support for arithmetic instructions with FP16 operands. However, math performance on mobile devices is a good candidate for separate post.

So, there’s lots of hardware factors usually stay behind the scene, but luckily — this kind of stuff almost always goes forward and backward compatibility is guaranteed. This means we can’t buy modern CPU or GPU without support for instruction sets from previous generation. And this fact simplifies decision process for us: almost always recent hardware will perform better then older hardware on the same task. Exclusions from this tasks are pretty rare, and usually can be seen at integrations level.

Now, what is funny here, is that even with that being said — answer for original question still might be different for different use cases.

GPU selection:

If original question is related to GPUs (no matter, who’s the manufacturer), the answer is usually trivial: more expensive GPU is, more performance you’ll get out of it.

It happens mostly due to smart GPU fitting, dependent of target market niche. Top of the line GPUs get loads of HBM or HBM2 memory, and bottom of the line get something like GDDR5 or SDDR4.

Both major GPU manufacturers also have specialized “professional” GPUs available: nVidia Tesla, nVidia Quadro, AMD Radeon Pro. But it’s absolutely the same story there: more money you pay — more performance you get.

Now there’s also offers for specialized “inference” devices. But that’s pretty new products, and we’re yet to see how popular they’ll be in practice.

CPU selection:

When it comes to CPUs there’s one counter-intuitive issue hidden under the price tag. In desktop CPU model lines price goes up together with frequency. So, say you have to choose between i7 8700 and i7 8700K cpus. Choice is trivial: more you pay, more you get. Same instruction sets, same number of cores, but 8700K has higher frequencies. Easy choice.

But when it comes to server grade CPUs — price grows altogether with number of cores, and frequency goes down at the same time! So, it’s possible to get into situation when cheaper CPU will be serving your model faster rather then way more expensive CPU. I.e. 6 core 3.2 GHz vs 22 cores at 2.2 GHz (base frequency both). If data dimensionality in our model isn’t big, and there’s no room for data parallelism — 6 cores with 50% higher frequency will be just faster. Despite this CPU is 3 times cheaper then 22 cores one.

Takeaway: for better performance we should always keep in mind, how exactly ours model is going to be used in production. Will it be data parallel environment? Can we expect 10 or 20 cores being available? Can we use half-precision to speed up things? Can we use GPUs in data parallel environment? What SLA we need to provide? Ask yourself these questions, and design your model with answers kept in mind.

Deep Learning Developer.