如何在Apache Spark作业中执行阻止IO?

编程入门 行业动态 更新时间:2024-10-27 04:38:02
本文介绍了如何在Apache Spark作业中执行阻止IO?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

如果遍历RDD时需要调用外部(阻塞)服务来计算数据集中的值怎么办?您认为如何实现?

What if, when I traverse RDD, I need to calculate values in dataset by calling external (blocking) service? How do you think that could be achieved?

val值:Future[RDD[Double]] = Future sequence tasks

我试图创建一个期货列表,但是由于RDD ID不可遍历,因此Future.sequence不适合.

I've tried to create a list of Futures, but as RDD id not Traversable, Future.sequence is not suitable.

我只是想知道,是否有人遇到过这样的问题,您是如何解决的? 我要实现的目标是在单个工作程序节点上获得并行性,因此我可以每秒将该外部服务调用 3000 次.

I just wonder, if anyone had such a problem, and how did you solve it? What I'm trying to achieve is to get a parallelism on a single worker node, so I can call that external service 3000 times per second.

可能还有另一种更适合于火花的解决方案,例如在单个主机上具有多个工作节点.

Probably, there is another solution, more suitable for spark, like having multiple working nodes on single host.

很有趣的是,您如何应对这样的挑战?谢谢.

It's interesting to know, how do you cope with such a challenge? Thanks.

推荐答案

以下是我自己的问题的答案:

Here is answer to my own question:

val buckets = sc.textFile(logFile, 100) val tasks: RDD[Future[Object]] = buckets map { item => future { // call native code } } val values = tasks.mapPartitions[Object] { f: Iterator[Future[Object]] => val searchFuture: Future[Iterator[Object]] = Future sequence f Await result (searchFuture, JOB_TIMEOUT) }

这里的想法是,我们得到分区的集合,其中每个分区都发送给特定的工作程序,并且是最小的工作.每一项工作都包含数据,可以通过调用本机代码并发送该数据来对其进行处理.

The idea here is, that we get the collection of partitions, where each partition is sent to the specific worker and is the smallest piece of work. Each that piece of work contains data, that could be processed by calling native code and sending that data.

'values'集合包含数据,这些数据是从本机代码返回的,并且在整个集群中都已完成.

'values' collection contains the data, that is returned from the native code and that work is done across the cluster.

更多推荐

如何在Apache Spark作业中执行阻止IO?

本文发布于:2023-11-24 03:14:30,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1623767.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:作业   如何在   Spark   Apache   IO

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!