admin管理员组文章数量:1566355
有个flink实时任务,上周升级了版本,早上过来看下任务,发现任务凌晨4点左右的时候重启了。flink ui查看异常日志如下
异常日志
2020-08-10 04:07:23
org.apache.flink.runtime.ioworkty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/9.150.12.175:39365'. This might indicate that the remote task manager was lost.
at org.apache.flink.runtime.ioworkty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)
at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at org.apache.flink.shadedty4.ioty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390)
at org.apache.flink.shadedty4.ioty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355)
at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at org.apache.flink.shadedty4.ioty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1429)
at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at org.apache.flink.shadedty4.ioty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at org.apache.flink.shadedty4.ioty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:947)
at org.apache.flink.shadedty4.ioty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:826)
at org.apache.flink.shadedty4.ioty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at org.apache.flink.shadedty4.ioty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
at org.apache.flink.shadedty4.ioty.channel.nio.NioEventLoop.run(NioEventLoop.java:474)
at org.apache.flink.shadedty4.ioty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
at java.lang.Thread.run(Thread.java:748)
关键信息
2020-08-10 04:07:23
org.apache.flink.runtime.ioworkty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/9.150.12.175:39365'. This might indicate that the remote task manager was lost.
初步判断可能是9.150.12.175机器出了问题。
看看yarn资源管理界面,进一步判断是机器问题。
一般常见的是内存不足、磁盘空间不足,或者其他问题。
登陆问题机器,jps查看进程,只有yarn nodemanager还在,但启动时间还是很早之前,没有重启过,其他任务已经被干掉了
查看yarn nodemanager日志,日志提示磁盘使用率超过90%
查看当前磁盘使用率
跟yarn的日志一致,磁盘使用率超过yarn的配置阀值。查看日志,有历史生成的大日志文件,清理过期日志,重新启动,任务重新分配到问题机器,一切恢复正常。同时让运维同事将所有集群节点磁盘加上监控,使用率达到85%时告警。
版权声明:本文标题:flink任务重启原因分析 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://www.elefans.com/dianzi/1727247174a1104769.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论