Kafka Storm HDFS / S3数据流(Kafka Storm HDFS/S3 data flow)

编程入门行业动态更新时间:2024-10-24 22:20:26

目前还不清楚你是否可以像在Flume那样在卡夫卡进行扇出（复制）。

我想让Kafka将数据保存到HDFS或S3并将该数据的副本发送到Storm进行实时处理。 Storm聚合/分析的输出将存储在Cassandra中。我看到一些实现将所有数据从Kafka流入Storm，然后是Storm的两个输出。但是，我想消除Storm对原始数据存储的依赖性。

这可能吗？你知道这样的任何文档/示例/实现吗？

此外，Kafka是否对S3存储有很好的支持？

我看到Camus存储到HDFS - 您是否只是通过cron运行这项工作来不断将数据从Kafka加载到HDFS？如果作业的第二个实例在上一个实例完成之前开始会发生什么？最后，Camus会使用S3吗？

谢谢，我很感激！

It is unclear if you can do a fan-out (duplication) in Kafka like you can in Flume.

I'd like to have Kafka save data to HDFS or S3 and send a duplicate of that data to Storm for real time processing. The output of Storm aggregations/analysis will be stored in Cassandra. I see some implementations flowing all data from Kafka into Storm and then two outputs from Storm. However, I'd like to eliminate the dependency of Storm for the raw data storage.

Is this possible? Are you aware of any documentation/examples/implementations like this?

Also, does Kafka have good support for S3 storage?

I saw Camus for storing to HDFS -- do you just run this job via cron to continually load data from Kafka to HDFS? What happens if a second instance of the job starts before the previous has finished? Finally, would Camus work with S3?

Thanks -- I appreciate it!

最满意答案

关于加缪，是的，启动工作的调度员应该工作。他们在LinkedIn上使用的是Azkaban，你也可以看一下。

如果在另一个完成之前启动，则会读取一些数据量两次。由于第二个作业将从第一个作业使用的相同偏移开始读取。

关于加缪与S3，目前我不认为这已经到位。

Kafka actually retains events for a configurable period of time -- events are not purged immediately upon consumption like other message or queue systems. This allows you to have multiple consumers that can read from Kafka either at the beginning (per the configurable retention time) or from an offset.

For the use case described, you would use Camus to batch load events to hadoop, and Storm to read events off the same Kafka output. Just ensure both processes read new events before the configurable retention time expires.

Regarding Camus, ggupta1612 answered this aspect best

A scheduler that launches the job should work. What they use at LinkedIn is Azkaban, you can look at that too.

If one launches before the other finishes, some amount of data will be read twice. Since the second job will start reading from the same offsets used by the first one.

更多推荐

本文发布于:2023-08-01 02:04:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1351737.html