admin管理员组

文章数量:1618704

SpringBoot查询Doris报错

ERROR [http-nio-10020-exec-12] [http-nio-10020-exec-12raceId] [] [5] @@GlobalExceptionAdvice@@ | server error 
org.springframework.dao.RecoverableDataAccessException: 
### Error querying database.  Cause: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

The last packet successfully received from the server was 426 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.
; Communications link failure

The last packet successfully received from the server was 426 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.; nested exception is com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

The last packet successfully received from the server was 426 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.

Doris定时调度的的insert into select 任务报错

ERROR 2013 (HY000) at line 7: Lost connection to MySQL server during query

分析

可能慢查询导致
慢查询导致集群压力巨大
有好几个慢查询达到120s-400s,这对于Doris集群来说是不能承受的,因为全局的query_timeout参数是60,推测有人的任务会话变量设置为600s或更高

让开发下线慢查询任务以及调优SQL
100多秒的慢查询任务下线后就正常了

但是过了一会SpringBoot服务告警。报错又有了

doris参数

interactive_timeout=3880000

wait_timeout=3880000

doris FE服务节点告警日志

2021-06-03 16:00:08,398 WARN (Connect-Scheduler-Check-Timer-0|79) [ConnectContext.checkTimeout():365] kill wait timeout connection, remote: 1.1.1.1:57399, wait timeout: 3880000
2021-06-03 16:00:08,398 WARN (Connect-Scheduler-Check-Timer-0|79) [ConnectContext.kill():339] kill timeout query, 1.1.1.1.1:57399, kill connection: true

Doris监控

由此看出,15:44的连接数骤降

#ELK日志
也能看到SpringBoot服务查询Doris的告警报错也是从15:44开始的
所以15:44到底有什么操作变量影响了集群呢?

根据报错
看waite_time时间为3880000s 为44天,但是源码里默认的是28800s

interactive_timeout=3880000

wait_timeout=3880000

没人上线,没人割接,集群管理员也掌握在我手里,没有改参数,但是还是不确定参数为啥会变,去fe.audit审计日志查看操作记录,果然
有人(内鬼)在用 2020.2.3版本的DataGrip,15:44进行了set GLOBAL参数的修改,修改了

interactive_timeout=3880000

wait_timeout=3880000

将两个参数回调至28800s,集群的connections连接数立马恢复了上来
这里需要注意的是,跟社区讨论,Doris中只有wait_timeout有作用,另外的interactive_timeout为了兼容mysql没作用

疑问:为什么Doris中wait_timeout参数在特别大的时候会导致连接报错Communications link failure?
反而调小后就能恢复正常呢,需要梳理代码看下逻辑了…

包含图片完整文档请查看
连接doris报错Communications link failure

本文标签: 报错DorisCommunicationsfailureLINK