Flink检查点失败

编程入门 行业动态 更新时间:2024-10-28 08:18:17
本文介绍了Flink检查点失败-检查点在10分钟后超时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我们每天在处理数据时遇到一两个CheckPoint故障.数据量很低,例如不到10k,我们的间隔设置为"2分钟".(处理速度非常慢的原因是我们需要将数据下沉到另一个API端点,而这在flink作业结束时需要花费一些时间来处理,所以时间就是将数据+接收到流到外部API端点.)

We got one or two CheckPoint Failure during processing data every day. The data volume is low, like under 10k, and our interval setting is '2 minutes'. (The reason for processing very slow is we need to sink the data to another API endpoint which take some time to process at the end of flink job, so the time is Streaming data + Sink to external API endpoint).

根本问题是: 检查点在10分钟后超时,这是由于数据处理时间超过10分钟而导致的,因此检查点超时.我们可能会提高并行度以加快处理速度,但是如果数据变大,我们就必须再次提高并行度,所以不想使用这种方式.

The root issue is: Checkpoints time out after 10 mins, this caused by the data processing time longer than 10 mins, so the checkpoint time out. We might increase the parallelism to fast the processing, but if the data become bigger, we have to increase the parallelism again, so don't want to use this way.

建议的解决方案: 我看到有人建议在新检查点与新检查点之间设置暂停,但是我在这里有一个问题,如果我在此处设置暂停时间,新检查点是否会在暂停时间内丢失状态?

Suggested solution: I saw someone suggest to set the pause between old and new checkpoint, but I have some question here is, if I set the pause time there, will the new checkpoint missing the state in the pause time?

目标: 如何避免此问题并记录不丢失任何数据的正确状态?

Aim: How to avoid this issue and record the correct state that doesn't miss any data?

检查点失败:在此处输入图片描述

完成的检查点:在此处输入图片描述

子任务没有响应在此处输入图片描述

谢谢

推荐答案

您可以设置几个相关的配置变量-例如检查点间隔,检查点之间的暂停以及并发检查点的数量.这些设置的任何组合都不会导致为检查点跳过数据.

There are several related configuration variables you can set -- such as the checkpoint interval, the pause between checkpoints, and the number of concurrent checkpoints. No combination of these settings will result in data being skipped for checkpointing.

在检查点之间设置时间间隔意味着Flink在上一个检查点完成(或失败)后要过一段时间才会启动新的检查点,但这对超时没有影响.

Setting an interval between checkpoints means that Flink won't initiate a new checkpoint until some time has passed since the completion (or failure) of the previous checkpoint -- but this has no effect on the timeout.

听起来应该延长超时时间,您可以这样做:

Sounds like you should extend the timeout, which you can do like this:

env.getCheckpointConfig().setCheckpointTimeout(n);

其中 n 以毫秒为单位.请参阅启用和配置检查点以获取更多详细信息.

where n is measured in milliseconds. See the section of the Flink docs on enabling and configuring checkpointing for more details.

更多推荐

Flink检查点失败

本文发布于:2023-07-27 23:20:36,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1225258.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:检查点   Flink

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!