Spark Streaming + Kinesis:初始记录消耗(Spark Streaming + Kinesis : Initial record consumption)

编程入门 行业动态 更新时间:2024-10-27 10:26:03
Spark Streaming + Kinesis:初始记录消耗(Spark Streaming + Kinesis : Initial record consumption)

我正在用Kinesis Stream提供Spark流。 我的项目在第一批中使用1s批次(队列包含几百万个项目,并且任务被告知从流的开始处开始),火花流开始消耗批量的10K记录。 这每10/20发生一次。

即:

t0 -> records : 0 t1 -> records : 0 ..... t10 -> records: 10.000 -> total process time is 0.8s (lower than batch time) t11 -> recods : 0 .. t15 ->records : 0 .. t20 -> records: 10.000

这种情况一直发生,直到火花塞满了水流的顶端。 htat之后,每批都会每秒处理一次元素。

感觉就像在起始点一样,它应该一直处理每批次的记录数量,而不需要处理大批量处理没有记录的htat。

我忽略的任何设置? 是预期的行为吗?

I'm feeding Spark streaming with a Kinesis Stream. My project is using 1s batches, during the first batches (the queue contains a few million items, and task is told to start from the beginning of the stream) spark streaming start consuming batches of 10K records. This happens every 10/20s.

i.e:

t0 -> records : 0 t1 -> records : 0 ..... t10 -> records: 10.000 -> total process time is 0.8s (lower than batch time) t11 -> recods : 0 .. t15 ->records : 0 .. t20 -> records: 10.000

this beaviour occurs until spark cathces up with the top of the stream. After htat every batch will process elements every second.

It feels like at the start point it should consistently process a number of records per batch, without having htat high numberof batches processing no records.

Any setting that I'm ignoring? is the described behaviour expected?

最满意答案

造成这个问题的原因是这个bug:spark-kinesis使用者中的https://issues.apache.org/jira/browse/SPARK-18620 ,它并未正确设置maxRate。

The cause for this issue is this bug: https://issues.apache.org/jira/browse/SPARK-18620 in the spark-kinesis consumer, which does not set maxRate correctly.

更多推荐

本文发布于:2023-07-23 01:21:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1225605.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:消耗   Kinesis   Streaming   Spark   record

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!