我们从Kafka那里使用结构化流,并将处理后的数据集写入s3.
We are consuming from Kafka using structured streaming and writing the processed data set to s3.
我们还希望将处理后的数据向前写入Kafka,是否可以从同一流查询中进行处理?(火花版本2.1.1)
We also want to write the processed data to Kafka moving forward, is it possible to do it from the same streaming query ? (spark version 2.1.1)
在日志中,我看到了流式查询进度输出,并且从日志中获得了一个示例持续时间JSON,能否有人请更清楚地说明 addBatch 和 getBatch ?
In the logs, I see the streaming query progress output and I have a sample duration JSON from the log, can some one please provide more clarity on what the difference is between addBatch and getBatch?
TriggerExecution-处理获取的数据和写入接收器都需要时间吗?
TriggerExecution - is it the time take to both process the fetched data and writing to the sink?
"durationMs" : { "addBatch" : 2263426, "getBatch" : 12, "getOffset" : 273, "queryPlanning" : 13, "triggerExecution" : 2264288, "walCommit" : 552 },
是.
在Spark 2.1.1中,您可以使用 writeStream.foreach 将数据写入Kafka.此博客中有一个示例: databricks/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html
In Spark 2.1.1, you can use writeStream.foreach to write your data into Kafka. There is an example in this blog: databricks/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html
或者您可以使用Spark 2.2.0,该版本添加了Kafka接收器以支持正式写入Kafka.
Or you can use Spark 2.2.0 which adds Kafka sink to support writing to Kafka officially.
getBatch 测量从源创建DataFrame的时间.这通常非常快. addBatch 测量在接收器中运行DataFrame的时间.
getBatch measures how long to create a DataFrame from source. This is usually pretty fast. addBatch measures how long to run the DataFrame in a sink.
triggerExecution 测量执行一次触发器执行的时间,通常与 getOffset + getBatch + 几乎相同addBatch .
triggerExecution measures how long to run a trigger execution, is usually almost the same as getOffset + getBatch + addBatch.
更多推荐
Spark结构化流:多个接收器
发布评论