我正在用Kinesis Stream提供Spark流。 我的项目在第一批中使用1s批次(队列包含几百万个项目,并且任务被告知从流的开始处开始),火花流开始消耗批量的10K记录。 这每10/20发生一次。
即:
t0 -> records : 0 t1 -> records : 0 ..... t10 -> records: 10.000 -> total process time is 0.8s (lower than batch time) t11 -> recods : 0 .. t15 ->records : 0 .. t20 -> records: 10.000
这种情况一直发生,直到火花塞满了水流的顶端。 htat之后,每批都会每秒处理一次元素。
感觉就像在起始点一样,它应该一直处理每批次的记录数量,而不需要处理大批量处理没有记录的htat。
我忽略的任何设置? 是预期的行为吗?
I'm feeding Spark streaming with a Kinesis Stream. My project is using 1s batches, during the first batches (the queue contains a few million items, and task is told to start from the beginning of the stream) spark streaming start consuming batches of 10K records. This happens every 10/20s.
i.e:
t0 -> records : 0 t1 -> records : 0 ..... t10 -> records: 10.000 -> total process time is 0.8s (lower than batch time) t11 -> recods : 0 .. t15 ->records : 0 .. t20 -> records: 10.000
this beaviour occurs until spark cathces up with the top of the stream. After htat every batch will process elements every second.
It feels like at the start point it should consistently process a number of records per batch, without having htat high numberof batches processing no records.
Any setting that I'm ignoring? is the described behaviour expected?
最满意答案
造成这个问题的原因是这个bug:spark-kinesis使用者中的https://issues.apache.org/jira/browse/SPARK-18620 ,它并未正确设置maxRate。
The cause for this issue is this bug: https://issues.apache.org/jira/browse/SPARK-18620 in the spark-kinesis consumer, which does not set maxRate correctly.
更多推荐
发布评论