惯性张量的推理_选择合适的intel工作站处理器进行张量流推理和开发

编程入门行业动态更新时间:2024-10-20 16:08:48

惯性张量的推理

With the increasing number of data scientists using TensorFlow, it might be a good time to discuss which workstation processor to choose from Intel’s lineup. You have several options to choose from:

随着使用TensorFlow的数据科学家数量的增加，现在是讨论从Intel阵容中选择哪种工作站处理器的好时机。您可以选择以下几种方式：

Intel Core processors–with i5, i7, and i9 being the most popular
英特尔酷睿处理器-i5，i7和i9最受欢迎
Intel Xeon W processors, which are optimized for workstation workloads
Intel Xeon W处理器，针对工作站工作负载进行了优化
Intel Xeon Scalable processors (SP), which are optimized for server workloads and 24/7 operation
英特尔至强可扩展处理器(SP)，针对服务器工作负载和24/7操作进行了优化

The next logical question would be what processor to choose if TensorFlow inference performance is critical? The first thing we need to do is to look at where the performance is coming from in the TensorFlow library. One of the main influences on TensorFlow performance (and many other machine learning libraries) is the Advanced Vector Extensions (AVX), specifically those found in Intel AVX2 and Intel AVX-512. Intel’s runtime libraries use AVX, which power TensorFlow performance on Intel processors via the oneAPI Deep Neural Network Library (oneDNN). Other specialized instruction sets such as Vector Neural Network Instructions (VNNI) from Intel Deep Learning Boost are also called by oneDNN.

下一个逻辑问题是，如果TensorFlow推理性能至关重要，则应选择哪个处理器？我们要做的第一件事是查看TensorFlow库中的性能来自哪里。对TensorFlow性能(以及许多其他机器学习库)的主要影响之一是高级矢量扩展(AVX)，特别是在Intel AVX2和Intel AVX-512中发现的扩展。英特尔的运行时库使用AVX，后者通过oneAPI深度神经网络库(oneDNN)增强了Intel处理器上的TensorFlow性能。其他专用指令集，例如Intel Deep Learning Boost的矢量神经网络指令(VNNI)，也被oneDNN调用。

What other factors matter? Does the number of cores matter? Base clock speeds? Let’s benchmark a few Intel processors to get a better understanding. For this test, we have five configurations in workstation chassis (Table 1).

还有其他因素吗？核心数量重要吗？基本时钟速度？让我们对一些英特尔处理器进行基准测试以获得更好的理解。对于此测试，我们在工作站机箱中有五种配置(表1)。

Table 1. Benchmarking systems

表1.基准测试系统

We are using the ResNet-50 model with the ImageNet data set, tested with different batch sizes for inference throughput and latency. Figure 1 shows how many images the inference model can handle per second. The 18-core systems consistently deliver better throughput. What you’re seeing in these TensorFlow benchmarks is how machine learning (ML) and deep learning (DL) translate from framework to algorithm, and then algorithm to hardware. At the end of the day, there is a limit to how well many AI algorithms. Many ML and DL algorithms aren’t naturally parallel, and in a workstation configuration where the power envelope is defined by the wall socket’s maximum current, a balance between core count and core frequency must be taken into consideration.

我们将ResNet-50模型与ImageNet数据集一起使用，并针对不同的批处理量测试了推理吞吐量和延迟。图1显示了推理模型每秒可以处理多少张图像。 18核系统始终提供更高的吞吐量。您在这些TensorFlow基准测试中看到的是机器学习(ML)和深度学习(DL)如何从框架转换为算法，然后从算法转换为硬件。归根结底，人工智能算法的数量是有限的。许多ML和DL算法并非自然并行，在工作站配置中，功率包络由墙上插座的最大电流定义，必须考虑芯数和芯频率之间的平衡。

Figure 1. TensorFlow inference throughput on the benchmarking systems

图1.基准测试系统上的TensorFlow推理吞吐量

Let’s take a deeper look at Figure 1. If we compare the dual-socket Intel Xeon 6258R to the single-socket 6240L, the results show that an 18-core processor with slightly higher frequencies is better for TensorFlow inference than one with over 6x the number of cores. The lesson here is that many ML and DL don’t scale well, so more cores may not always be better.

让我们更深入地看一下图1。如果将双插槽Intel Xeon 6258R与单插槽6240L进行比较，结果表明，使用频率稍高的18核处理器比TensorFlow推理的频率高6倍的处理器更好。核心数。这里的教训是，许多ML和DL的伸缩性都不好，因此更多的内核可能并不总是更好。

Figure 2 shows the inference latency on the benchmarking systems. This is the time it takes an inference model loaded in memory to make a prediction based on new data. Inference latency is important for time-sensitive or real-time applications. The dual-socket system has slightly higher latency in FP32 but the lowest latency in INT8. The 18-core systems have similar latencies and exhibit performance in line with the throughput performance rankings in Figure 1.

图2显示了基准测试系统上的推理延迟。这是将推理模型加载到内存中以根据新数据进行预测的时间。推理延迟对于时间敏感或实时应用很重要。双路系统在FP32中具有稍高的延迟，但在INT8中具有最低的延迟。 18核系统具有类似的延迟，并且表现出与图1中的吞吐量性能排名一致的性能。

Figure 2. TensorFlow inference latency on the benchmarking systems

图2.基准测试系统上的TensorFlow推理延迟

The Intel Xeon W2295 does the best in most of the tests, but why is that? It has to do with the Intel AVX-512 base and turbo frequencies. The Intel Xeon W processor series is clocked higher than the Intel Xeon SP under AVX instructions. Under any AVX instructions, the processor moves to a different speed to offset the additional power draw, and with the vast majority of ML and DL using AVX-512, the higher base and turbo frequencies of the Intel Xeon W give faster throughput over the comparable Intel Xeon SP processor. Additionally, 18 cores appears to be the best balance between core count and AVX-512 frequency in these tests: more cores over 18 sacrificing AVX frequencies and increasing latency, and fewer cores decreasing in throughput and increasing in latency.

英特尔至强W2295在大多数测试中表现最好，但是为什么呢？它与Intel AVX-512基本频率和Turbo频率有关。根据AVX指令，Intel Xeon W处理器系列的时钟频率高于Intel Xeon SP。在任何AVX指令下，处理器将以不同的速度移动以抵消额外的功耗，并且在使用AVX-512的绝大多数ML和DL中，Intel Xeon W的较高基本频率和Turbo频率提供了比同类产品更快的吞吐量。英特尔至强SP处理器。此外，在这些测试中，18个内核似乎是内核数量与AVX-512频率之间的最佳平衡：超过18个内核牺牲了AVX频率并增加了延迟，更少的内核减少了吞吐量并增加了延迟。

Why is there such an advantage in INT8 batch inference with the Intel Xeon processors over the Intel Core i9? What you are seeing there is the use of the VNNI instructions by oneDNN, which reduce convolution operations from three instructions to one. The Intel Xeon processors used in these benchmarks support VNNI for INT8, but the Intel Core processor does not. The performance difference is quite noticeable in the previous charts.

为什么与英特尔®酷睿™i9相比，英特尔®至强®处理器在INT8批量推理中具有如此优势？您所看到的是oneDNN使用VNNI指令，这将卷积运算从三个指令减少到一个。这些基准测试中使用的Intel Xeon处理器支持INT8的VNNI，但Intel Core处理器不支持。在以前的图表中，性能差异非常明显。

Finally, let’s talk about how to choose the Intel processor to best fit your TensorFlow requirements:

最后，让我们讨论一下如何选择最适合您的TensorFlow要求的英特尔处理器：

Do you need large memory to load the data set? Do you need the ability to administer your workstation remotely? If so, get a workstation with the Intel Xeon Gold 6240L, which can be configured with up to 3.3 TB of memory using a mix of Intel Optane DC Persistent Memory and DRAM.
您是否需要大内存来加载数据集？您是否需要能够远程管理工作站？如果是这样的话，请购买配备Intel Xeon Gold 6240L的工作站，该工作站可以使用Intel Optane DC持久性内存和DRAM进行配置，最多可配置3.3 TB内存。
Need the best all-rounder with the Intel Xeon features with moderate system memory? Use the Intel Xeon W2295. In lieu of some of the server-class features like Intel Optane DCPMM and 24/7 operation, you can get equivalent inference performance at half the cost of the Intel Xeon SP configurations and over 30% less power.
是否需要具有适度系统内存的Intel Xeon功能的最佳全能产品？使用英特尔至强W2295。代替某些服务器级功能(如Intel Optane DCPMM和24/7操作)，您可以获得等效的推理性能，而成本仅为Intel Xeon SP配置的一半，而功耗却降低了30％以上。
Need a budget-friendly option? An Intel Core processor such as the i9–10900k fits the bill.
需要预算友好的选择吗？像i9–10900k这样的Intel Core处理器非常适合。
Have additional inference needs on the workstation beyond the CPU? We have products such as Intel Movidius and purpose-built AI processors from Intel’s Habana product line that can help fit those needs.
除了CPU，工作站上还有其他推理需求吗？我们拥有英特尔Movidius等产品以及英特尔Habana产品系列中的专用AI处理器，可以帮助满足这些需求。

With the performance attributes of TensorFlow detailed above, picking the right workstation CPU should be a bit easier.

有了上面详细介绍的TensorFlow的性能属性，选择合适的工作站CPU应该会容易一些。

If you want to reproduce these tests to evaluate your TensorFlow needs, use the following instructions. First download the GitHub repo (https://github/IntelAI/models) and configure the Conda (Channel: Intel, Python=3.7.7) and runtime environment:

如果要重现这些测试以评估TensorFlow需求，请使用以下说明。首先下载GitHub存储库( https://github/IntelAI/models )并配置Conda(渠道：Intel，Python = 3.7.7)和运行时环境：

Set OMP_NUM_THREADS to the number of cores
将OMP_NUM_THREADS设置为内核数
KMP_BLOCKTIME=0
KMP_BLOCKTIME=0
intra_op_parallelism_threads=<cores>
intra_op_parallelism_threads=<cores>
inter_op_parallelism_threads=2
inter_op_parallelism_threads=2
Prepend numactl --cpunodebind=0 --membind=0 to the command below for systems with two or more sockets
对于具有两个或多个套接字的系统，在下面的命令前添加numactl --cpunodebind=0 --membind=0

Finally, run the following command: python launch_benchmark.py --in-graph <built model> --model-name resnet50 --framework tensorflow --precision fp32<or int8> --mode inference --batch-size=128 --socket-id 0 --data-location <synthetic or real dataset>

最后，运行以下命令： python launch_benchmark.py --in-graph <built model> --model-name resnet50 --framework tensorflow --precision fp32<or int8> --mode inference --batch-size=128 --socket-id 0 --data-location <synthetic or real dataset>

For more information or to learn more about Intel products, please visit www.intel.

有关更多信息或要了解有关英特尔产品的更多信息，请访问www.intel 。