PyPy比Python快17倍.可以加快Python的速度吗?

编程入门行业动态更新时间:2024-10-26 22:26:12

本文介绍了PyPy比Python快17倍.可以加快Python的速度吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

解决了最近的出现了代码问题，我发现我的默认Python比慢40倍. PyPy.通过限制对len的调用，我可以使用此代码将其降低到大约17倍.通过在函数中运行来限制全局查找.

Solving a recent Advent of Code problem, I found my default Python was ~40x slower than PyPy. I was able to get that down to about 17x with this code by limiting calls to len and limiting global lookups by running it in a function.

现在，e.py在我的机器上的python 3.6.3上运行5.162秒，在PyPy上运行.297秒.

Right now, e.py runs in 5.162 seconds on python 3.6.3 and .297 seconds on PyPy on my machine.

我的问题是:这是JIT不可回避的提速，还是有某种方法可以加快CPython的答案? (缺少极端手段:我可以去Cython/Numba还是什么?)我如何说服自己我无能为力了?

My question is: is this the irreducible speedup of JIT, or is there some way to speed up the CPython answer? (short of extreme means: I could go to Cython/Numba or something?) How do I convince myself that there's nothing more I can do?

有关数字输入文件的列表，请参见要点.

See the gist for the list of numbers input file.

如问题说明中所述，它们表示跳转偏移量. position += offsets[current]，然后将当前偏移量增加1.当跳转将您带出列表之外时，就完成了.

As described in the problem statement, they represent jump offsets. position += offsets[current], and increment the current offset by 1. You're done when a jump takes you outside the list.

下面是给出的示例(耗时5秒的完整输入要长得多，并且有更大的数字):

Here's the example given (the full input that takes 5 seconds is much longer, and has larger numbers):

(0) 3 0 1 -3 - before we have taken any steps. (1) 3 0 1 -3 - jump with offset 0 (that is, don't jump at all). Fortunately, the instruction is then incremented to 1. 2 (3) 0 1 -3 - step forward because of the instruction we just modified. The first instruction is incremented again, now to 2. 2 4 0 1 (-3) - jump all the way to the end; leave a 4 behind. 2 (4) 0 1 -2 - go back to where we just were; increment -3 to -2. 2 5 0 1 -2 - jump 4 steps forward, escaping the maze.

代码:

def run(cmds): location = 0 counter = 0 while 1: try: cmd = cmds[location] if cmd >= 3: cmds[location] -= 1 else: cmds[location] += 1 location += cmd if location < 0: print(counter) break counter += 1 except: print(counter) break if __name__=="__main__": text = open("input.txt").read().strip().split("\n") cmds = [int(cmd) for cmd in text] run(cmds)

我用Cython编译并运行了代码，运行时间降至2.53s，但这仍然比PyPy慢了一个数量级.

edit: I compiled and ran the code with Cython, that dropped runtime to 2.53s, but that's still almost an order of magnitude slower than PyPy.

Numba能让我保持在2倍之内

edit:最好的我能写的Cython降为1.32 s，比pypy高4倍

edit: The best Cython I could write got down to 1.32s, a little over 4x pypy

edit:按照@viraptor的建议，将cmd变量移动到cdef中，可使Cython降低至.157秒！速度快了将近一个数量级，并且与普通的python相比没有那么远.不过，我对PyPy JIT印象深刻，它免费提供了所有这些功能！

edit: Moving the cmd variable into a cdef, as suggested by @viraptor, got the Cython down to .157 seconds! Nearly a full order of magnitude faster, and not that far from regular python. Still, I come away from this extremely impressed with the PyPy JIT, which did all this for free!

推荐答案

作为Python的基线，我用C语言编写了此代码(然后决定使用C ++来实现更方便的数组I/O).它可以使用clang ++有效地针对x86-64进行编译.在Skylake x86上，此代码的运行速度比问题中的CPython3.6.2快82倍，因此，即使您的Python速度更快，仍然离保持接近最佳的机器代码还有几个因素. (是的，我查看了编译器的asm输出，以检查它是否做得很好).

As a baseline for Python, I wrote this in C (and then decided to use C++ for more convenient array I/O). It compiles efficiently for x86-64 with clang++. This runs 82x faster than CPython3.6.2 with the code in the question, on a Skylake x86, so even your faster Python versions are still a couple factors away from keeping up with near-optimal machine code. (Yes, I looked at the compiler's asm output to check that it did a good job).

让一个好的JIT或提前编译器看到循环逻辑是提高性能的关键.问题逻辑本质上是串行的，因此没有让Python运行已经编译过的空间. C在整个数组上做某事(例如NumPy)，因为除非使用Cython或其他工具，否则不会针对此特定问题编译C.当问题的每一步都归结到CPython解释器时，就是性能的牺牲，这是因为内存瓶颈或其他任何问题都无法掩盖其缓慢性.

Letting a good JIT or ahead-of-time compiler see the loop logic is key for performance here. The problem logic is inherently serial, so there's no scope for getting Python to run already-compiled C to do something over the whole array (like NumPy), because there won't be compiled C for this specific problem unless you use Cython or something. Having each step of the problem go back to the CPython interpreter is death for performance, when its slowness isn't hidden by memory bottlenecks or anything.

更新:将偏移量数组转换为指针数组可将其速度提高1.5倍(简单的寻址模式+从关键路径循环依赖项链中删除add ，将其降低到 4周期L1D负载使用延迟简单寻址模式(如果指针来自另一个负载)，则对于索引寻址模式+ add延迟，不是6c = 5c + 1c).

Update: transforming the array of offsets into an array of pointers speeds it up by another factor of 1.5x (simple addressing mode + removing an add from the critical path loop-carried dependency chain, bringing it down to just the 4 cycle L1D load-use latency for a simple addressing mode (when the pointer comes from another load), not 6c = 5c + 1c for an indexed addressing mode + add latency).

但是我认为我们对Python可能很慷慨，并且不希望它能跟上适合现代CPU的算法转换:P(特别是因为即使在64位模式下，我也使用32位指针来确保4585元素数组仍然只有18kiB，因此它可以容纳在32kiB L1D缓存中.就像Linux x32 ABI或AArch64 ILP32 ABI一样.)

But I think we can be generous to Python and not expect it to keep up with algorithm transformations to suit modern CPUs :P (Especially because I used 32-bit pointers even in 64-bit mode to make sure the 4585 element array was still only 18kiB so it fits in the 32kiB L1D cache. Like the Linux x32 ABI does, or the AArch64 ILP32 ABI.)

此外，更新的替代表达式使gcc像无c铛一样无分支地对其进行编译. (注释并在此答案中保留了原始的perf stat输出，因为有趣的是看到无分支与有错误预测的分支的效果.)

Also, an alternate expression for the update gets gcc to compile it branchless, like clang does. (Commented out and original perf stat output left in this answer, because it's interesting the see the effect of branchless vs. branchy with mispredicts.)

unsigned jumps(int offset[], unsigned size) { unsigned location = 0; unsigned counter = 0; do { //location += offset[location]++; // simple version // >=3 conditional version below int off = offset[location]; offset[location] += (off>=3) ? -1 : 1; // branchy with gcc // offset[location] = (off>=3) ? off-1 : off+1; // branchless with gcc and clang. location += off; counter++; } while (location < size); return counter; } #include <iostream> #include <iterator> #include <vector> int main() { std::ios::sync_with_stdio(false); // makes cin faster std::istream_iterator<int> begin(std::cin), dummy; std::vector<int> values(begin, dummy); // construct a dynamic array from reading stdin unsigned count = jumps(values.data(), values.size()); std::cout << count << '\n'; }

使用clang4.0.1 -O3 -march=skylake时，内部循环是无分支的；它对>=3条件使用条件移动.我使用了? :，因为这就是我希望编译器可以执行的操作. Godbolt编译器资源管理器

With clang4.0.1 -O3 -march=skylake, the inner loop is branchless; it uses a conditional-move for the >=3 condition. I used ? : because that's what I was hoping the compiler would do. Source + asm on the Godbolt compiler explorer

.LBB1_4: # =>This Inner Loop Header: Depth=1 mov ebx, edi ; silly compiler: extra work inside the loop to save code outside mov esi, dword ptr [rax + 4*rbx] ; off = offset[location] cmp esi, 2 mov ecx, 1 cmovg ecx, r8d ; ecx = (off>=3) ? -1 : 1; // r8d = -1 (set outside the loop) add ecx, esi ; off += -1 or 1 mov dword ptr [rax + 4*rbx], ecx ; store back the updated off add edi, esi ; location += off (original value) add edx, 1 ; counter++ cmp edi, r9d jb .LBB1_4 ; unsigned compare against array size

这是我的i7-6700k Skylake CPU上perf stat ./a.out < input.txt的输出(对于clang版本):

Here's the output of perf stat ./a.out < input.txt (for the clang version), on my i7-6700k Skylake CPU:

21841249 # correct total, matches Python Performance counter stats for './a.out': 36.843436 task-clock (msec) # 0.997 CPUs utilized 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec 119 page-faults # 0.003 M/sec 143,680,934 cycles # 3.900 GHz 245,059,492 instructions # 1.71 insn per cycle 22,654,670 branches # 614.890 M/sec 20,171 branch-misses # 0.09% of all branches 0.036953258 seconds time elapsed

由于循环中的数据依赖性，平均每时钟指令数比4低得多.下一次迭代的加载地址取决于该迭代的load + add，无序执行无法将其隐藏.但是，它可能与更新当前位置的值的所有工作重叠.

The average instructions-per-clock is much lower than 4 because of the data dependency in the loop. The load address for the next iteration depends on the load+add for this iteration, and out-of-order execution can't hide that. It can overlap all the work of updating the value of the current location, though.

从int更改为short没有性能下降(如预期； movsx与Skylake上的mov具有相同的延迟时间)，但将内存消耗，因此如果需要，可以在L1D缓存中容纳更大的阵列.

Changing from int to short has no performance downside (as expected; movsx has the same latency as mov on Skylake), but halves the memory consumption so a larger array could fit in L1D cache if needed.

我尝试将数组编译到程序中(作为int offsets[] = { file contents with commas added };，因此不必读取和解析它.它也使大小成为编译时常数.这将运行时间减少到〜36.2 +/-从36.8降低到0.1毫秒，因此从文件读取的版本仍将其大部分时间都花在了实际问题上，而不是解析输入.(与Python不同，C ++的启动开销可忽略不计，而我的Skylake CPU上升到借助Skylake中的硬件P状态管理，最大时钟速度可以在1毫秒之内实现.)

I tried compiling the array into the program (as int offsets[] = { file contents with commas added }; so it doesn't have to be read and parsed. It also made the size a compile-time constant. This reduced the run time to ~36.2 +/- 0.1 milliseconds, down from ~36.8, so the version that reads from a file was still spending most of its time on the actual problem, not parsing the input. (Unlike Python, C++ has negligible startup overhead, and my Skylake CPU ramps up to max clock speed in well under a millisecond thanks to hardware P-state management in Skylake.)

如前所述，使用像[rdi]而不是[rdi + rdx*4]的简单寻址模式进行指针追赶的延迟降低了1c，并且避免了add(index += offset变为current = target). Intel自IvyBridge以来在寄存器之间具有零延迟整数mov，因此不会延长关键路径.此处的源(带注释)+ ASM此哈克优化即可.典型运行(将文本解析为std::vector): 23.26 +- 0.05 ms ，90.725 M周期(3.900 GHz)，288.724 M instructions(3.18 IPC).有趣的是，它实际上是更多的指令，但是由于循环依赖项链的等待时间较短，因此运行速度更快.

As described earlier, pointer-chasing with a simple addressing mode like [rdi] instead of [rdi + rdx*4] has 1c lower latency, and avoids the add (index += offset becomes current = target). Intel since IvyBridge has zero-latency integer mov between registers so that doesn't lengthen the critical path. Here's the source (with comments) + asm for this hacky optimization. A typical run (with text parsing into a std::vector): 23.26 +- 0.05 ms, 90.725 M cycles (3.900 GHz), 288.724 M instructions (3.18 IPC). Interestingly it's actually more total instructions, but runs much faster due to the lower latency of the loop-carried dependency chain.

gcc使用分支，它慢了大约2倍. (根据整个程序上的perf stat，有14％的分支预测有误..分支只是更新值的一部分，但分支未命中会使管道停滞，因此它们也会影响关键路径，从而导致数据依赖项不在这里.这似乎是优化程序的错误决定.)

gcc uses a branch and it's about 2x slower. (14% of branches are mispredicted according to perf stat on the whole program. It's only branching as part of updating the values, but branch misses stall the pipeline so they affect the critical path too, in a way that data dependencies don't here. This seems like a poor decision by the optimizer.)

将条件重写为offset[location] = (off>=3) ? off-1 : off+1;说服gcc使用看起来不错的asm进入无分支.

Rewriting the conditional as offset[location] = (off>=3) ? off-1 : off+1; convinces gcc to go branchless with asm that looks good.

gcc7.1.1 -O3 -march = skylake (对于使用(off <= 3) ? : -1 : +1的分支进行编译的原始源).

gcc7.1.1 -O3 -march=skylake (for the original source that compiles with a branch for (off <= 3) ? : -1 : +1).

Performance counter stats for './ec-gcc': 70.032162 task-clock (msec) # 0.998 CPUs utilized 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec 118 page-faults # 0.002 M/sec 273,115,485 cycles # 3.900 GHz 255,088,412 instructions # 0.93 insn per cycle 44,382,466 branches # 633.744 M/sec 6,230,137 branch-misses # 14.04% of all branches 0.070181924 seconds time elapsed

与CPython(Arch Linux上为Python3.6.2):

perf stat python ./orig-v2.e.py 21841249 Performance counter stats for 'python ./orig-v2.e.py': 3046.703831 task-clock (msec) # 1.000 CPUs utilized 10 context-switches # 0.003 K/sec 0 cpu-migrations # 0.000 K/sec 923 page-faults # 0.303 K/sec 11,880,130,860 cycles # 3.899 GHz 38,731,286,195 instructions # 3.26 insn per cycle 8,489,399,768 branches # 2786.421 M/sec 18,666,459 branch-misses # 0.22% of all branches 3.046819579 seconds time elapsed