如何使用多个工作人员加速批量导入谷歌云数据存储?

编程入门行业动态更新时间:2024-10-28 13:16:16

本文介绍了如何使用多个工作人员加速批量导入谷歌云数据存储?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

限时送ChatGPT账号..

我有一个基于 apache-beam 的数据流作业要使用 .我怀疑(如果我理解正确的话)当输入是单个文件时我总是得到一个工人的原因是由于中还有一个相关的方法，其中vcfio 已显着扩展.它似乎支持使用单个输入 vcf 文件的多个工人.我希望该项目中的更改也可以合并到梁项目中.

I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the datastore sink. The workflow works fine but the cons I noticed is that:

The write speed into datastore is at most around 25-30 entities per second. I tried to use --autoscalingAlgorithm=THROUGHPUT_BASED --numWorkers=10 --maxNumWorkers=100 but the execution seems to prefer one worker (see graph below: the target workers once increased to 2 but reduced to 1 "based on the ability to parallelize the work in the currently running step").

I did not use ancestor path for the keys; all the entities are the same kind.

The pipeline code looks like below:

def write_to_datastore(project, user_options, pipeline_options):
"""Creates a pipeline that writes entities to Cloud Datastore."""
  with beam.Pipeline(options=pipeline_options) as p:
  (p
   | 'Read vcf files' >> vcfio.ReadFromVcf(user_options.input)
   | 'Create my entity' >> beam.ParDo(
     ToEntityFn(), user_options.kind)
   | 'Write to datastore' >> WriteToDatastore(project))

Because I have millions of rows to write into the datastore, it would take too long to write with a speed of 30 entities/sec.

Question: The input is just one huge gzipped file. Do I need to split it into multiple small files to trigger multiple workers? Is there any other way I can make the importing faster? Do I miss something in the num_workers setup? Thanks!

解决方案

I looked into the design of vcfio. I suspect (if I understand correctly) that the reason I always get one worker when the input is a single file is due to the limit of the _VcfSource and the VCF format constraint. This format has a header part that defines how to translate the non-header lines. This causes that each worker that reads the source file has to work on an entire file. When I split the single file into 5 separate files that share the same header, I successfully get up to 5 workers (but not any more probably due to the same reason).

One thing I don't understand is that the number of workers that read can be limited to 5 (in this case). But why we are limited to have only 5 workers to write? Anyway, I think I have found the alternative way to trigger multiple workers with beam Dataflow-Runner (use pre-split VCF files). There is also a related approach in gcp variant transforms project, in which the vcfio has been significantly extended. It seems to support the multiple workers with a single input vcf file. I wish the changes in that project could be merged into the beam project too.

这篇关于如何使用多个工作人员加速批量导入谷歌云数据存储?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

更多推荐

[db:关键词]

本文发布于:2023-04-19 23:24:07，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/971315.html