使用Spark处理请求

编程入门 行业动态 更新时间:2024-10-12 14:22:41
本文介绍了使用Spark处理请求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我想了解以下内容是否是Spark的正确用例.

I would like to understand if the following would be a correct use case for Spark.

在消息队列或包含一批请求的文件中接收到对应用程序的请求.对于消息队列,当前每秒大约有100个请求,尽管这可能会增加.一些文件仅包含一些请求,但更多情况下有数百个甚至数千个.

Requests to an application are received either on a message queue, or in a file which contains a batch of requests. For the message queue, there are currently about 100 requests per second, although this could increase. Some files just contain a few requests, but more often there are hundreds or even many thousands.

每个请求的处理包括请求过滤,验证,查找参考数据和计算.一些计算引用了规则引擎.这些完成后,新消息将发送到下游系统.

Processing for each request includes filtering of requests, validation, looking up reference data, and calculations. Some calculations reference a Rules engine. Once these are completed, a new message is sent to a downstream system.

我们想使用Spark在多个节点之间分配处理,以获得可伸缩性,弹性和性能.

We would like to use Spark to distribute the processing across multiple nodes to gain scalability, resilience and performance.

我设想它将像这样工作:

I am envisaging that it would work like this:

  • 以RDD的形式将一批请求加载到Spark中(消息队列上接收到的请求可能使用Spark Streaming).
  • 将编写单独的Scala函数以进行过滤,验证,参考数据查找和数据计算.
  • 第一个函数将传递给RDD,并返回一个新的RDD.
  • 接下来的功能将针对前一个功能的RDD输出运行.
  • 所有功能完成后,将针对最终的RDD运行for循环理解,以将每个修改后的请求发送到下游系统.
  • 以上听起来是否正确,或者这不是使用Spark的正确方法?

    Does the above sound correct, or would this not be the right way to use Spark?

    谢谢

    推荐答案

    我们在一个小的IOT项目上做了类似的工作.我们测试了在3个节点上每秒接收和处理约50K mqtt消息的过程,这很容易.我们的处理包括解析每个JSON消息,对创建的对象进行某种处理并将所有记录保存到时间序列数据库中.我们将批处理时间定义为1秒,处理时间约为300ms,RAM约为100sKB.与流媒体有关的一些问题.确保您的下游系统是异步的,这样您就不会陷入内存问题.火花支持背压的事实是正确的,但您需要实现它.另一件事,请尝试使状态保持最小.更具体地说,您不应保持任何状态随输入的增长而线性增长.这对于您的系统可伸缩性非常重要.

    We have done something similar working on a small IOT project. we tested receiving and processing around 50K mqtt messages per second on 3 nodes and it was a breeze. Our processing included parsing of each JSON message, some manipulation of the object created and saving of all the records to a time series database. We defined the batch time for 1 second, the processing time was around 300ms and RAM ~100sKB. A few concerns with streaming. Make sure your downstream system is asynchronous so you wont get into memory issue. Its True that spark supports back pressure, but you will need to make it happen. another thing, try to keep the state to minimal. more specifically, your should not keep any state that grows linearly as your input grows. this is extremely important for your system scalability.

    最让我印象深刻的是,使用火花进行缩放的难易程度.随着我们添加的每个节点,我们可以处理的消息的频率呈线性增长.

    what impressed me the most is how easy you can scale with spark. with each node we added we grew linearly in the frequency of messages we could handle.

    我希望这会有所帮助.祝你好运

    I hope this helps a little. Good luck

    更多推荐

    使用Spark处理请求

    本文发布于:2023-10-28 04:44:29,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1535561.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:Spark

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!