Concepts:Timely Stream Processing

编程入门 行业动态 更新时间:2024-10-09 15:16:45

Concepts:<a href=https://www.elefans.com/category/jswz/34/1749221.html style=Timely Stream Processing"/>

Concepts:Timely Stream Processing

Timely Stream Processing 实时流处理

Introduction 简介

Timely stream processing is an extension of stateful stream processing in which time plays some role in the computation. Among other things, this is the case when you do time series analysis, when doing aggregations based on certain time periods (typically called windows), or when you do event processing where the time when an event occurred is important.
实时流处理是有状态流处理的扩展,其中时间在计算中会发挥一定作用。例如当您进行时间序列分析、基于特定时间段(通常称为窗口)进行聚合时,或者在事件发生时间很重要的情况下进行事件处理时。

In the following sections we will highlight some of the topics that you should consider when working with timely Flink Applications.
在以下几节中,我们将重点介绍使用实时Flink应用程序时应考虑的一些事项。

Notions of Time: Event Time and Processing Time 时间概念:事件时间和处理时间

When referring to time in a streaming program (for example to define windows), one can refer to different notions of time:
当在流式应用程序中提及时间时(例如,定义窗口),可以表示不同的时间概念:

  • Processing time: Processing time refers to the system time of the machine that is executing the respective operation.
    When a streaming program runs on processing time, all time-based operations (like time windows) will use the system clock of the machines that run the respective operator. An hourly processing time window will include all records that arrived at a specific operator between the times when the system clock indicated the full hour. For example, if an application begins running at 9:15am, the first hourly processing time window will include events processed between 9:15am and 10:00am, the next window will include events processed between 10:00am and 11:00am, and so on.
    Processing time is the simplest notion of time and requires no coordination between streams and machines. It provides the best performance and the lowest latency. However, in distributed and asynchronous environments processing time does not provide determinism, because it is susceptible to the speed at which records arrive in the system (for example from the message queue), to the speed at which the records flow between operators inside the system, and to outages (scheduled, or otherwise).
    处理时间:处理时间是指执行相应操作的机器的系统时间。
    当流式程序以处理时间运行时,所有基于时间的操作(如时间窗口)将使用运行相应operator的机器的系统时钟。每小时处理时间窗口将包括在系统时钟指示的整小时之间到达特定operator的所有记录。例如,如果应用程序在上午9:15开始运行,第一个小时处理时间窗口将包括在上午9点15分至上午10点之间处理的事件,下一个窗口将包括上午10点至上午11点之间处理的事件,依此类推。
    处理时间是最简单的时间概念,不需要流和机器之间的协调。它提供了最佳性能和最低延迟。然而,在分布式和异步环境中,处理时间不具有确定性,因为它容易受到记录到达系统的速度(例如从消息队列)、记录在系统内算子之间流动的速度以及中断(调度或其他)的影响。

  • Event time: Event time is the time that each individual event occurred on its producing device. This time is typically embedded within the records before they enter Flink, and that event timestamp can be extracted from each record. In event time, the progress of time depends on the data, not on any wall clocks. Event time programs must specify how to generate Event Time Watermarks, which is the mechanism that signals progress in event time. This watermarking mechanism is described in a later section, below.
    In a perfect world, event time processing would yield completely consistent and deterministic results, regardless of when events arrive, or their ordering. However, unless the events are known to arrive in-order (by timestamp), event time processing incurs some latency while waiting for out-of-order events. As it is only possible to wait for a finite period of time, this places a limit on how deterministic event time applications can be.
    Assuming all of the data has arrived, event time operations will behave as expected, and produce correct and consistent results even when working with out-of-order or late events, or when reprocessing historic data. For example, an hourly event time window will contain all records that carry an event timestamp that falls into that hour, regardless of the order in which they arrive, or when they are processed. (See the section on lateness for more information.)
    Note that sometimes when event time programs are processing live data in real-time, they will use some processing time operations in order to guarantee that they are progressing in a timely fashion.
    事件时间:事件时间是每个事件在其生成设备上发生的时间。该时间通常在记录进入Flink之前被嵌入到记录中,并且可以从每个记录中提取该事件时间戳。在事件时间中,时间的进度取决于数据,而不是任何时钟。事件时间程序必须指定如何生成事件时间水印,这是事件时间进度的信号机制。该水印机制将在下面的一节中描述。
    理想情况下,无论事件何时到达,或其顺序如何,事件时间处理将产生完全一致和确定性的结果。但是,除非已知事件按顺序(按时间戳)到达,否则事件时间处理会在等待无序事件时产生一些延迟。由于只能等待有限的一段时间,这就限制了其可用性。
    假设所有数据都已到达,事件时间操作将按预期进行,即使处理无序或延迟的事件,或重新处理历史数据,也会产生正确且一致的结果。例如,每小时事件时间窗口将包含事件时间戳属于该小时的所有记录,而不管它们到达的顺序或处理时间。(有关更多信息,请参阅“迟到”一节)
    请注意,有时,当事件时间程序实时处理实时数据时,它们会使用一些处理时间操作,以确保它们能够及时进行。

Event Time and Watermarks 事件时间和水印

Note: Flink implements many techniques from the Dataflow Model. For a good introduction to event time and watermarks, have a look at the articles below.
注意:Flink实现了Dataflow模型中的许多技术。有关事件时间和水印的详细介绍,请参阅下面的文章。

  • Streaming 101 by Tyler Akidau
  • The Dataflow Model paper

A stream processor that supports event time needs a way to measure the progress of event time. For example, a window operator that builds hourly windows needs to be notified when event time has passed beyond the end of an hour, so that the operator can close the window in progress.
支持事件时间的流处理需要一种测量事件时间进度的方法。例如,当事件时间超过每小时时,需要通知构建每小时窗口的窗口operator,以便operator可以关闭正在进行的窗口。

Event time can progress independently of processing time (measured by wall clocks). For example, in one program the current event time of an operator may trail slightly behind the processing time (accounting for a delay in receiving the events), while both proceed at the same speed. On the other hand, another streaming program might progress through weeks of event time with only a few seconds of processing, by fast-forwarding through some historic data already buffered in a Kafka topic (or another message queue).
事件时间可以独立于处理时间(通过系统时钟测量)进行。例如,在一个程序中,当以相同的速度进行时,operator的当前事件时间可能略微落后于处理时间(考虑到接收事件的延迟)。另一方面,其他流式程序可能只需几秒钟的处理就可以搞定事件时间跨度数周的数据(通过快速转发kafka topic(或其他消息队列)中已缓存的一些历史数据)。

The mechanism in Flink to measure progress in event time is watermarks. Watermarks flow as part of the data stream and carry a timestamp t. A Watermark(t) declares that event time has reached time t in that stream, meaning that there should be no more elements from the stream with a timestamp t’ <= t (i.e. events with timestamps older or equal to the watermark).
Flink中测量事件时间进度的机制是水印。水印作为数据流的一部分流动,并带有时间戳t。水印(t)声明该流中的事件时间已达到时间t,这意味着流中不应再有时间戳t’<=t的元素(即时间戳早于或等于水印的事件)。

The figure below shows a stream of events with (logical) timestamps, and watermarks flowing inline. In this example the events are in order (with respect to their timestamps), meaning that the watermarks are simply periodic markers in the stream.
下图显示了一个事件流,构成其的事件具有时间戳并且水印在其中流动。在本例中,事件是有序的(对应时间戳),这意味着水印只是流中的周期性标记。

Watermarks are crucial for out-of-order streams, as illustrated below, where the events are not ordered by their timestamps. In general a watermark is a declaration that by that point in the stream, all events up to a certain timestamp should have arrived. Once a watermark reaches an operator, the operator can advance its internal event time clock to the value of the watermark.
水印对于无序流是至关重要的,如下图所示,其中事件不按时间戳排序。一般来说,水印是一种声明:该点之前,某个时间戳之前的所有事件都应该已经到达。一旦水印到达operator,operator可以将其内部事件时钟提前到水印的值。

Note that event time is inherited by a freshly created stream element (or elements) from either the event that produced them or from watermark that triggered creation of those elements.
请注意,新创建的流元素从生成它们的事件或触发这些元素被创建的水印继承事件时间。

Watermarks in Parallel Streams 并行流中的水印

Watermarks are generated at, or directly after, source functions. Each parallel subtask of a source function usually generates its watermarks independently. These watermarks define the event time at that particular parallel source.
水印是在source函数处生成或在source函数后立刻生成的。source函数的每个并行子任务通常独立生成其水印,这些水印定义了事件时间。

As the watermarks flow through the streaming program, they advance the event time at the operators where they arrive. Whenever an operator advances its event time, it generates a new watermark downstream for its successor operators.
当水印在流式程序中流动时,它们会提前所到达operators的事件时间。每当operators提前其事件时间时,它都会为后续operators生成一个新的下游水印。

Some operators consume multiple input streams; a union, for example, or operators following a keyBy(…) or partition(…) function. Such an operator’s current event time is the minimum of its input streams’ event times. As its input streams update their event times, so does the operator.
一些operators使用多个输入流:例如union, 或在keyBy(…)或partition(…)函数之后的operators。此类operators的当前事件时间是其输入流事件时间的最小值。当其输入流更新其事件时间时,该operator也会更新。

The figure below shows an example of events and watermarks flowing through parallel streams, and operators tracking event time.
下图显示了流经并行流的事件和水印的示例,以及跟踪事件时间的operators。

Lateness 延迟

It is possible that certain elements will violate the watermark condition, meaning that even after the Watermark(t) has occurred, more elements with timestamp t’ <= t will occur. In fact, in many real world setups, certain elements can be arbitrarily delayed, making it impossible to specify a time by which all elements of a certain event timestamp will have occurred. Furthermore, even if the lateness can be bounded, delaying the watermarks by too much is often not desirable, because it causes too much delay in the evaluation of event time windows.
某些元素可能会违反水印条件,这意味着即使在水印(t)出现之后,也会出现时间戳为t’<=t的元素。事实上,在许多实际情况中,某些元素可以任意延迟,从而无法指定具有特定事件时间戳的所有元素已经发生的时间。此外,即使延迟是有界的,延迟过多的水印通常也是不可取的,因为它会导致事件时间窗口的评估延迟过多。

For this reason, streaming programs may explicitly expect some late elements. Late elements are elements that arrive after the system’s event time clock (as signaled by the watermarks) has already passed the time of the late element’s timestamp. See Allowed Lateness for more information on how to work with late elements in event time windows.
出于这个原因,流式程序可能会显式地期待一些延迟元素。延迟元素是指在系统的事件时钟(如水印所示)已经超过延迟元素的时间戳之后到达的元素。有关如何在事件时间窗口中使用延迟元素的更多信息,请参见Allowed Lateness。

Windowing 开窗

Aggregating events (e.g., counts, sums) works differently on streams than in batch processing. For example, it is impossible to count all elements in a stream, because streams are in general infinite (unbounded). Instead, aggregates on streams (counts, sums, etc), are scoped by windows, such as “count over the last 5 minutes”, or “sum of the last 100 elements”.
聚合事件(例如计数、求和)在流上的工作方式与在批处理中的工作方式不同。例如,不可能计算流中的所有元素,因为流通常是无限的(无界的)。相反,流上的聚合(计数、总和等)由窗口确定范围,例如“过去5分钟内的计数”或“最后100个元素的总和”。

Windows can be time driven (example: every 30 seconds) or data driven (example: every 100 elements). One typically distinguishes different types of windows, such as tumbling windows (no overlap), sliding windows (with overlap), and session windows (punctuated by a gap of inactivity).
窗口可以是时间驱动的(例如:每30秒一次)或数据驱动的(示例:每100个元素一次)。人们通常会区分不同类型的窗口,例如滚动窗口(无重叠)、滑动窗口(有重叠)和会话窗口(以不活跃间隙为分隔)。

Please check out this blog post for additional examples of windows or take a look a window documentation of the DataStream API.
有关窗口的其他示例,请查看此博客文章,或查看DataStream API的窗口文档。

更多推荐

Concepts:Timely Stream Processing

本文发布于:2024-02-06 11:57:49,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1748809.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:Timely   Concepts   Processing   Stream

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!