使用Apache Beam进行Dataflow批量加载时的性能问题

编程入门 行业动态 更新时间:2024-10-26 12:31:53
本文介绍了使用Apache Beam进行Dataflow批量加载时的性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在做一个数据流批量加载的性能基准测试,并发现与BigQuery命令行工具上的相同加载相比,加载速度太慢。

I was doing a performance benchmarking of dataflow batch loads and found that the loads were just too slow when compared against the same loads on Bigquery command line tool.

文件大小约为20 MB,拥有数百万条记录。我尝试了不同的机器类型,并在加载目标BQ表时大约加载时间为8分钟,在 n1-highmem-4 上获得了最佳加载性能。

The file size was around 20 MB with millions of records. I tried different machine types and got the best load performance on n1-highmem-4 with the approx load time of 8 minutes in loading the target BQ table.

当通过在命令行实用程序上运行BQ命令应用相同的表加载时,几乎不用花2分钟来处理和加载相同的数据量。 有关使用Dataflow作业的不良负载性能的任何见解?如何提高性能以使其与BQ命令行实用程序相媲美?

When the same table load was applied by running BQ command on the command-line utility, it hardly took 2 minutes to process and load the same volume of data. Any insights about this poor load performance using Dataflow jobs? How to improve the performance to make it comparable to BQ command line utility?

推荐答案

很可能,几分钟的时间都花在启动和关闭虚拟机上。如果你正在做一些可以直接使用BQ CLI完成的事情,那么为此目的使用Dataflow可能是过度的。但是,您可以更新您的问题并提供更多详细信息(例如您的代码和Dataflow作业ID) - 也许还有其他效率不高的问题。

Most likely, a few minutes are being spent on starting and shutting down VMs. If you're doing something that can directly be done using BQ CLI, then using Dataflow for that purpose is likely overkill. However, you can update your question with more details (e.g. your code and the Dataflow job id) - maybe there's something else inefficient going on.

更多推荐

使用Apache Beam进行Dataflow批量加载时的性能问题

本文发布于:2023-10-26 10:04:40,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1529819.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:批量   加载   性能   Apache   Beam

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!