Java 8流不可预测的性能下降没有明显的原因

本文介绍了Java 8流不可预测的性能下降没有明显的原因的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我正在使用Java 8流来迭代包含子列表的列表。外部列表大小在100到1000之间变化（不同的测试运行），内部列表大小始终为5.

I am using Java 8 streams to iterate over a list with sublists. The outer list size varies between 100 to 1000 (different test runs) and the inner list size is always 5.

有2个基准运行显示意外的性能偏差。

There are 2 benchmark runs which show unexpected performance deviations.

package benchmark; import org.openjdk.jmh.annotations.*; import org.openjdk.jmh.infra.Blackhole; import java.io.IOException; import java.util.concurrent.ThreadLocalRandom; import java.util.*; import java.util.function.*; import java.util.stream.*; @Threads(32) @Warmup(iterations = 25) @Measurement(iterations = 5) @State(Scope.Benchmark) @Fork(1) @BenchmarkMode(Mode.Throughput) public class StreamBenchmark { @Param({"700", "600", "500", "400", "300", "200", "100"}) int outerListSizeParam; final static int INNER_LIST_SIZE = 5; List<List<Integer>> list; Random rand() { return ThreadLocalRandom.current(); } final BinaryOperator<Integer> reducer = (val1, val2) -> val1 + val2; final Supplier<List<Integer>> supplier = () -> IntStream .range(0, INNER_LIST_SIZE) .mapToObj(ptr -> rand().nextInt(100)) .collect(Collectors.toList()); @Setup public void init() throws IOException { list = IntStream .range(0, outerListSizeParam) .mapToObj(i -> supplier.get()) .collect(Collectors.toList()); } @Benchmark public void loop(Blackhole bh) throws Exception { List<List<Integer>> res = new ArrayList<>(); for (List<Integer> innerList : list) { if (innerList.stream().reduce(reducer).orElse(0) == rand().nextInt(2000)) { res.add(innerList); } } bh.consume(res); } @Benchmark public void stream(Blackhole bh) throws Exception { List<List<Integer>> res = list .stream() .filter(innerList -> innerList.stream().reduce(reducer).orElse(0) == rand().nextInt(2000)) .collect(Collectors.toList()); bh.consume(res); } }

运行1

Benchmark (outerListSizeParam) Mode Cnt Score Error Units StreamBenchmark.loop 700 thrpt 5 22488.601 ? 1128.543 ops/s StreamBenchmark.loop 600 thrpt 5 26010.430 ? 1161.854 ops/s StreamBenchmark.loop 500 thrpt 5 361837.395 ? 12777.016 ops/s StreamBenchmark.loop 400 thrpt 5 451774.457 ? 22517.801 ops/s StreamBenchmark.loop 300 thrpt 5 744677.723 ? 23456.913 ops/s StreamBenchmark.loop 200 thrpt 5 1102075.707 ? 38678.994 ops/s StreamBenchmark.loop 100 thrpt 5 2334981.090 ? 100973.551 ops/s StreamBenchmark.stream 700 thrpt 5 22320.346 ? 496.432 ops/s StreamBenchmark.stream 600 thrpt 5 26091.609 ? 1044.868 ops/s StreamBenchmark.stream 500 thrpt 5 31961.096 ? 497.854 ops/s StreamBenchmark.stream 400 thrpt 5 377701.859 ? 11115.990 ops/s StreamBenchmark.stream 300 thrpt 5 53887.652 ? 1228.245 ops/s StreamBenchmark.stream 200 thrpt 5 78754.294 ? 2173.316 ops/s StreamBenchmark.stream 100 thrpt 5 1564899.788 ? 47369.698 ops/s

运行2

Benchmark (outerListSizeParam) Mode Cnt Score Error Units StreamBenchmark.loop 1000 thrpt 10 16179.702 ? 260.134 ops/s StreamBenchmark.loop 700 thrpt 10 22924.319 ? 329.134 ops/s StreamBenchmark.loop 600 thrpt 10 26871.267 ? 416.464 ops/s StreamBenchmark.loop 500 thrpt 10 353043.221 ? 6628.980 ops/s StreamBenchmark.loop 300 thrpt 10 772234.261 ? 10075.536 ops/s StreamBenchmark.loop 100 thrpt 10 2357125.442 ? 30824.834 ops/s StreamBenchmark.stream 1000 thrpt 10 15526.423 ? 147.454 ops/s StreamBenchmark.stream 700 thrpt 10 22347.898 ? 117.360 ops/s StreamBenchmark.stream 600 thrpt 10 26172.790 ? 229.745 ops/s StreamBenchmark.stream 500 thrpt 10 31643.518 ? 428.680 ops/s StreamBenchmark.stream 300 thrpt 10 536037.041 ? 6176.192 ops/s StreamBenchmark.stream 100 thrpt 10 153619.054 ? 1450.839 ops/s

我有两个问题：

为什么在两次测试运行中，循环+ 500和循环+ 600之间存在一致的显着性能差异？

为什么在Run1 stream + 400和Run2 stream + 300是否存在显着但不一致的性能偏差？

看起来JIT有时会做出次优的优化决策，导致巨大的性能下降。

It looks like the JIT sometimes makes suboptimal optimization decisions, causing a huge performance drop.

测试机器有128GB RAM和32个CPU核心：

The test machine has 128GB RAM and 32 CPU cores:

java version "1.8.0_45" Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Model name: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz Stepping: 4 CPU MHz: 1201.078 CPU max MHz: 3400.0000 CPU min MHz: 1200.0000 BogoMIPS: 5201.67 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31

PS没有流添加基准。这些测试（循环+流+ pureLoop）让我觉得使用流和lambdas需要大量的微优化工作，并且无论如何都不能保证一致的性能。

P.S. Added benchmark with no stream. These tests (loop + stream + pureLoop) make me think that using streams and lambdas would require a lot of micro optimisation efforts and does not guarantee consistent performance anyway.

@Benchmark public void pureLoop(Blackhole bh) throws Exception { List<List<Integer>> res = new ArrayList<>(); for (List<Integer> innerList : list) { int sum = 0; for (Integer i : innerList) { sum += i; } if (sum == rand().nextInt(2000)) res.add(innerList); } bh.consume(res); }

运行3（纯循环）

Benchmark (outerListSizeParam) Mode Cnt Score Error Units StreamBenchmark.loop 1000 thrpt 5 15848.277 ? 445.624 ops/s StreamBenchmark.loop 700 thrpt 5 22330.289 ? 484.554 ops/s StreamBenchmark.loop 600 thrpt 5 26353.565 ? 631.421 ops/s StreamBenchmark.loop 500 thrpt 5 358144.956 ? 8273.981 ops/s StreamBenchmark.loop 400 thrpt 5 591471.382 ? 17725.212 ops/s StreamBenchmark.loop 300 thrpt 5 785458.022 ? 23775.650 ops/s StreamBenchmark.loop 200 thrpt 5 1192328.880 ? 40006.056 ops/s StreamBenchmark.loop 100 thrpt 5 2330555.766 ? 73143.081 ops/s StreamBenchmark.pureLoop 1000 thrpt 5 1024629.128 ? 4387.106 ops/s StreamBenchmark.pureLoop 700 thrpt 5 1495365.029 ? 31659.941 ops/s StreamBenchmark.pureLoop 600 thrpt 5 1787432.825 ? 16611.868 ops/s StreamBenchmark.pureLoop 500 thrpt 5 2087093.023 ? 20143.165 ops/s StreamBenchmark.pureLoop 400 thrpt 5 2662946.999 ? 33326.079 ops/s StreamBenchmark.pureLoop 300 thrpt 5 3657830.227 ? 55020.775 ops/s StreamBenchmark.pureLoop 200 thrpt 5 5365706.786 ? 64404.783 ops/s StreamBenchmark.pureLoop 100 thrpt 5 10477430.730 ? 187641.413 ops/s StreamBenchmark.stream 1000 thrpt 5 15576.304 ? 250.620 ops/s StreamBenchmark.stream 700 thrpt 5 22286.965 ? 1153.734 ops/s StreamBenchmark.stream 600 thrpt 5 26109.258 ? 296.382 ops/s StreamBenchmark.stream 500 thrpt 5 31343.986 ? 1270.210 ops/s StreamBenchmark.stream 400 thrpt 5 39696.775 ? 1812.355 ops/s StreamBenchmark.stream 300 thrpt 5 536932.353 ? 41249.909 ops/s StreamBenchmark.stream 200 thrpt 5 77797.301 ? 976.641 ops/s StreamBenchmark.stream 100 thrpt 5 155387.348 ? 3182.841 ops/s

解决方案：按照apangin的建议禁用分层编译使JIT结果稳定。

Solution: as recommended by apangin disabling tiered compilation made JIT results stable.

java -XX:-TieredCompilation -jar test-jmh.jar Benchmark (outerListSizeParam) Mode Cnt Score Error Units StreamBenchmark.loop 1000 thrpt 5 160410.288 ? 4426.320 ops/s StreamBenchmark.loop 700 thrpt 5 230524.018 ? 4426.740 ops/s StreamBenchmark.loop 600 thrpt 5 266266.663 ? 9078.827 ops/s StreamBenchmark.loop 500 thrpt 5 324182.307 ? 8452.368 ops/s StreamBenchmark.loop 400 thrpt 5 400793.677 ? 12526.475 ops/s StreamBenchmark.loop 300 thrpt 5 534618.231 ? 25616.352 ops/s StreamBenchmark.loop 200 thrpt 5 803314.614 ? 33108.005 ops/s StreamBenchmark.loop 100 thrpt 5 1827400.764 ? 13868.253 ops/s StreamBenchmark.pureLoop 1000 thrpt 5 1126873.129 ? 33307.600 ops/s StreamBenchmark.pureLoop 700 thrpt 5 1560200.150 ? 150146.319 ops/s StreamBenchmark.pureLoop 600 thrpt 5 1848113.823 ? 16195.103 ops/s StreamBenchmark.pureLoop 500 thrpt 5 2250201.116 ? 130995.240 ops/s StreamBenchmark.pureLoop 400 thrpt 5 2839212.063 ? 142008.523 ops/s StreamBenchmark.pureLoop 300 thrpt 5 3807436.825 ? 140612.798 ops/s StreamBenchmark.pureLoop 200 thrpt 5 5724311.256 ? 77031.417 ops/s StreamBenchmark.pureLoop 100 thrpt 5 11718427.224 ? 101424.952 ops/s StreamBenchmark.stream 1000 thrpt 5 16186.121 ? 249.806 ops/s StreamBenchmark.stream 700 thrpt 5 22071.884 ? 703.729 ops/s StreamBenchmark.stream 600 thrpt 5 25546.378 ? 472.804 ops/s StreamBenchmark.stream 500 thrpt 5 32271.659 ? 437.048 ops/s StreamBenchmark.stream 400 thrpt 5 39755.841 ? 506.207 ops/s StreamBenchmark.stream 300 thrpt 5 52309.706 ? 1271.206 ops/s StreamBenchmark.stream 200 thrpt 5 79277.532 ? 2040.740 ops/s StreamBenchmark.stream 100 thrpt 5 161244.347 ? 3882.619 ops/s

推荐答案

此效果是由类型配置文件污染。让我解释一下简化的基准：

This effect is caused by Type Profile Pollution. Let me explain on a simplified benchmark:

@State(Scope.Benchmark) public class Streams { @Param({"500", "520"}) int iterations; @Setup public void init() { for (int i = 0; i < iterations; i++) { Stream.empty().reduce((x, y) -> x); } } @Benchmark public long loop() { return Stream.empty().count(); } }

虽然迭代这里的参数变化非常小，并且它不会影响主基准测试循环，结果会导致非常令人惊讶的2.5倍性能下降：

Though iteration parameter here changes very slightly and it does not affect the main benchmark loop, the results expose very surprising 2.5x performance degradation:

Benchmark (iterations) Mode Cnt Score Error Units Streams.loop 500 thrpt 5 29491,039 ± 240,953 ops/ms Streams.loop 520 thrpt 5 11867,860 ± 344,779 ops/ms

现在让我们用 -prof perfasm 选项运行JMH看到最热门的代码区域：

Now let's run JMH with -prof perfasm option to see the hottest code regions:

快速案例（迭代= 500）：

Fast case (iterations = 500):

....[Hottest Methods (after inlining)].................................. 48,66% bench.generated.Streams_loop::loop_thrpt_jmhStub 23,14% <unknown> 2,99% java.util.stream.Sink$ChainedReference::<init> 1,98% org.openjdk.jmh.infra.Blackhole::consume 1,68% java.util.Objects::requireNonNull 0,65% java.util.stream.AbstractPipeline::evaluate

慢速情况（迭代= 520）：

Slow case (iterations = 520):

....[Hottest Methods (after inlining)].................................. 40,09% java.util.stream.ReduceOps$ReduceOp::evaluateSequential 22,02% <unknown> 17,61% bench.generated.Streams_loop::loop_thrpt_jmhStub 1,25% org.openjdk.jmh.infra.Blackhole::consume 0,74% java.util.stream.AbstractPipeline::evaluate

看起来慢的情况花在 ReduceOp上的时间最多.evaluateSequential 未内联的方法。此外，如果我们研究这种方法的汇编代码，我们会发现最长的操作是 checkcast 。

Looks like the slow case spends the most time in ReduceOp.evaluateSequential method that is not inlined. Furthermore, if we study the assembly code for this method we'll find that the longest operation is checkcast.

您知道HotSpot编译器的工作原理：在JIT启动之前，一个方法在解释器中执行一段时间以收集配置文件数据，例如调用什么方法，看到什么类，采用什么分支等。使用分层编译，配置文件也在C1编译代码中收集。然后使用该配置文件生成C2优化代码。但是，如果应用程序在中间更改执行模式，则生成的代码可能不适合修改后的行为。

You know how HotSpot compiler works: before the JIT starts, a method is executed in interpreter for some time to collect the profile data, e.g. what methods are called, what classes are seen, what branches are taken etc. With Tiered compilation the profile is also collected in C1-compiled code. The profile is then used to generate C2-optimizied code. However if the application changes execution pattern in the middle, the generated code may be not optimal for the modified behavior.

让我们使用 -XX：+ PrintMethodData （在调试JVM中可用）比较执行配置文件：

Let's use -XX:+PrintMethodData (available in debug JVM) to compare the execution profiles:

----- Fast case ----- java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object; interpreter_invocation_count: 13382 invocation_counter: 13382 backedge_counter: 0 mdo size: 552 bytes 0 aload_1 1 fast_aload_0 2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;> 0 bci: 2 VirtualCallData count(0) entries(1) 'java/util/stream/ReduceOps$8'(12870 1.00) 5 aload_2 6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;> 48 bci: 6 VirtualCallData count(0) entries(1) 'java/util/stream/ReferencePipeline$5'(12870 1.00) 9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink> 96 bci: 9 ReceiverTypeData count(0) entries(1) 'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00) 12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;> 144 bci: 12 VirtualCallData count(0) entries(1) 'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00) 17 areturn ----- Slow case ----- java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object; interpreter_invocation_count: 54751 invocation_counter: 54751 backedge_counter: 0 mdo size: 552 bytes 0 aload_1 1 fast_aload_0 2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;> 0 bci: 2 VirtualCallData count(0) entries(2) 'java/util/stream/ReduceOps$2'(16 0.00) 'java/util/stream/ReduceOps$8'(54223 1.00) 5 aload_2 6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;> 48 bci: 6 VirtualCallData count(0) entries(2) 'java/util/stream/ReferencePipeline$Head'(16 0.00) 'java/util/stream/ReferencePipeline$5'(54223 1.00) 9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink> 96 bci: 9 ReceiverTypeData count(0) entries(2) 'java/util/stream/ReduceOps$2ReducingSink'(16 0.00) 'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00) 12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;> 144 bci: 12 VirtualCallData count(0) entries(2) 'java/util/stream/ReduceOps$2ReducingSink'(16 0.00) 'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00) 17 areturn

你看，初始化循环运行时间过长其统计信息出现在执行配置文件中：所有虚拟方法都有两个实现，checkcast还有两个不同的条目。在快速的情况下，配置文件没有被污染：所有站点都是单态的，JIT可以轻松地内联和优化它们。

You see, the initialization loop ran too long that its statistics appeared in the execution profile: all virtual methods have two implementations and checkcast has also two different entries. In the fast case the profile is not polluted: all sites are monomorphic, and JIT can easily inline and optimize them.

原始基准测试也是如此：更长的流 init（）方法中的操作污染了配置文件。如果您使用配置文件和分层编译选项，结果可能会大不相同。例如，尝试

The same is true for your original benchmark: longer stream operations in init() method polluted the profile. If you play with profile and tiered compilation options, the results can be quite different. For example, try

-XX：-ProfileInterpreter

-XX：Tier3InvocationThreshold = 1000

-XX：-TieredCompilation

-XX:-ProfileInterpreter

-XX:Tier3InvocationThreshold=1000

-XX:-TieredCompilation

最后，这个问题不是唯一的。由于配置文件污染，已经有多个与性能回归相关的JVM错误： JDK-8015416 ， JDK-8015417 ， JDK-8059879 ...希望这将在Java 9中得到改进。

Finally, this problem is not unique. There are already multiple JVM bugs related to performance regressions due to profile pollution: JDK-8015416, JDK-8015417, JDK-8059879... Hope this will be improved in Java 9.

更多推荐

Java 8流不可预测的性能下降没有明显的原因

Java 8流不可预测的性能下降没有明显的原因

发布评论取消回复

最近发表

热门文章

标签列表