前言
线上环境,客户提出来一个紧急问题,是报表数据生产失败的问题,经过排查得知是feign调用超时问题,现在讲本次线上问题排查思路梳理如下:
一.问题还原
A服务 调用B服务 接口,传递参数有 租户id
,开始时间
,结束时间
,B服务会查询租户的数据返回,但是从A服务看,等到60秒后会超时,如下所示:
2024-04-02 10:36:04.456 ERROR 28305 --- [http-nio-5678-exec-4] c.d.x.c.b.e.GlobalExceptionHandler : [ControllerErrorHandler系统错误], errorStack: com.netflix.hystrix.exception.HystrixRuntimeException: CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available. at com.netflix.hystrix.AbstractCommand$22.call(AbstractCommand.java:822) at com.netflix.hystrix.AbstractCommand$22.call(AbstractCommand.java:807) at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$4.onError(OperatorOnErrorResumeNextViaFunction.java:140) at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87) at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87) at com.netflix.hystrix.AbstractCommand$DeprecatedOnFallbackHookApplication$1.onError(AbstractCommand.java:1472) at com.netflix.hystrix.AbstractCommand$FallbackHookApplication$1.onError(AbstractCommand.java:1397) at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87) at rx.observers.Subscribers$5.onError(Subscribers.java:230) at rx.internal.operators.OnSubscribeThrow.call(OnSubscribeThrow.java:44) at rx.internal.operators.OnSubscribeThrow.call(OnSubscribeThrow.java:28) at rx.Observable.unsafeSubscribe(Observable.java:10327) at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:51) at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:35) at rx.Observable.unsafeSubscribe(Observable.java:10327) at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41) at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30) at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48) at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30) at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48) at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30) at rx.Observable.unsafeSubscribe(Observable.java:10327) at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41) at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30) at rx.Observable.unsafeSubscribe(Observable.java:10327) at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41) at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30) at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48) at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30) at rx.Observable.unsafeSubscribe(Observable.java:10327) at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41) at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30) at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48) at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30) at rx.Observable.unsafeSubscribe(Observable.java:10327) at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$4.onError(OperatorOnErrorResumeNextViaFunction.java:142) at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87) at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87) at com.netflix.hystrix.AbstractCommand$HystrixObservableTimeoutOperator$1.run(AbstractCommand.java:1142) at com.netflix.hystrix.strategy.concurrency.HystrixContextRunnable$1.call(HystrixContextRunnable.java:41) at com.netflix.hystrix.strategy.concurrency.HystrixContextRunnable$1.call(HystrixContextRunnable.java:37) at com.alibaba.ttl.TtlCallable.call(TtlCallable.java:65) at com.netflix.hystrix.strategy.concurrency.HystrixContextRunnable.run(HystrixContextRunnable.java:57) at com.netflix.hystrix.AbstractCommand$HystrixObservableTimeoutOperator$2.tick(AbstractCommand.java:1159) at com.netflix.hystrix.util.HystrixTimer$1.run(HystrixTimer.java:99) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.concurrent.TimeoutException at com.netflix.hystrix.AbstractCommand.handleTimeoutViaFallback(AbstractCommand.java:997) at com.netflix.hystrix.AbstractCommand.access$500(AbstractCommand.java:60) at com.netflix.hystrix.AbstractCommand$12.call(AbstractCommand.java:609) at com.netflix.hystrix.AbstractCommand$12.call(AbstractCommand.java:601) at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$4.onError(OperatorOnErrorResumeNextViaFunction.java:140) ... 16 more
经过排查,系统中feign调用的超时配置的确实是60秒,使用trace方法进一步验证A服务的接口超时问题:
[arthas@28305]$ trace com.xx.CommonFeignClient getHostListReport Press Q or Ctrl+C to abort. Affect(class count: 2 , method count: 1) cost in 3086 ms, listenerId: 1 `---ts=2024-04-02 10:41:07;thread_name=http-nio-5678-exec-5;id=d6;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c `---[60008.446404ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception] `---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.] `---ts=2024-04-02 10:41:22;thread_name=http-nio-5678-exec-7;id=d8;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c `---[60005.811154ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception] `---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.] `---ts=2024-04-02 10:43:24;thread_name=http-nio-5678-exec-8;id=d9;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c `---[60006.744468ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception] `---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.] `---ts=2024-04-02 10:44:40;thread_name=http-nio-5678-exec-10;id=db;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c `---[60008.893705ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception] `---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.] `---ts=2024-04-02 10:53:06;thread_name=http-nio-5678-exec-3;id=d4;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c `---[60006.06974ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception] `---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.] `---ts=2024-04-02 10:57:26;thread_name=http-nio-5678-exec-6;id=d7;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c `---[60005.968829ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception] `---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.]
能看出每次接口超过60秒就会报超时,接下来,就需要排查B服务是否收到请求
2.B服务排查
经过查看日志,B服务这个接口是没有添加任何日志的,导致如果只是简单的查询日志,是看不到任何请求的,只能通过arthas排查,使用trace命令查看接口的请求耗时
`---ts=2024-04-02 10:48:27;thread_name=http-nio-5668-exec-59;id=3385c;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@7c2b6087 `---[191015.262468ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl$$EnhancerBySpringCGLIB$$4ebc7a2b:getHostListReport() `---[191015.224267ms] org.springframework.cglib.proxy.MethodInterceptor:intercept() `---[191015.188767ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostListReport() +---[0.00905ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:<init>() #2273 +---[0.00713ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setHostStatus() #2276 +---[0.00642ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2277 +---[0.228765ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2278 +---[10.255409ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectList() #2280 +---[0.00865ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setNewInstallCount() #2283 +---[0.0069ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setStarteDate() #2286 +---[0.006421ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setEndDate() #2287 +---[0.0064ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setHostStatus() #2288 +---[0.00662ms] com.dbapp.ahcloud.network.element.enums.AgentStatusEnum:getStatus() #2289 +---[0.006791ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2289 +---[0.037051ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2290 +---[9.937732ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectCount() #2291 +---[0.00788ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setTotalCount() #2292 +---[0.016691ms] com.dbapp.ahcloud.network.element.enums.HostStatusEnum:getStatus() #2295 +---[0.00705ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setHostStatus() #2295 +---[0.0067ms] com.dbapp.ahcloud.network.element.enums.AgentStatusEnum:getStatus() #2296 +---[0.00676ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2296 +---[0.048871ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2297 +---[8.386791ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectCount() #2298 +---[0.00827ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setOnLineCount() #2299 +---[0.00656ms] com.dbapp.ahcloud.network.element.enums.HostStatusEnum:getStatus() #2302 +---[0.00607ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setHostStatus() #2302 +---[0.00549ms] com.dbapp.ahcloud.network.element.enums.AgentStatusEnum:getStatus() #2303 +---[0.00588ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2303 +---[0.056201ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2304 +---[7.170546ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectCount() #2305 +---[0.02316ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setOutLineCount() #2306 +---[0.00639ms] com.dbapp.ahcloud.network.element.enums.AgentStatusEnum:getStatus() #2309 +---[0.00558ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2309 +---[0.005ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setHostStatus() #2310 +---[0.034811ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2311 +---[5.88444ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectCount() #2312 +---[0.00755ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setUnloadCount() #2313 +---[0.00567ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2315 +---[0.03573ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2316 +---[115.476815ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectList() #2318 `---[0.01645ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setHostList() #2341
从这个里面看到这个接口是有请求进来的,显示这个方法耗时191015.262468ms
,也就是说耗时190秒,但是通过这个方法下面的参数,最长的一个也就耗时 115.476815ms(0.115秒)
,从这里看到这上面的时间相加,也就一两秒,为啥这里会耗时190秒呢? 继续分析
从这个图上看,执行到最后一步之前,一共耗时也就1秒左右,但是执行到最后一步后,却耗时190秒,所以很大概率是最后一步在setList的时候,有问题,所以接着看代码:
这个人前面会查出来几千条数据,然后通过for循环去查询这个租户的数据,本地数据库里面3041条,就是要调用3千多次......而且这些数据是一样的,What the fuck ~~~~
这还是一个做了十几年java的人写的代码3.再次验证B服务返回的结果
watch com.xxx.ConsoleHostController getHostListReport "{params,returnObj,throwExp}" -x 3
method=com.xxx.ConsoleHostController.getHostListReport location=AtExit ts=2024-04-02 11:00:44; [cost=198116.839392ms] result=@ArrayList[ @Object[][ @HostReportQueryDTO[ starteDate=null, endDate=null, tenantId=@Long[160890167693314], vmType=@String[av], agentStatus=null, hostStatus=null, ], ], @MsspResponse[ data=@HostReportResDTO[ totalCount=@Integer[2864], newInstallCount=@Integer[4], activeCount=null, onLineCount=@Integer[836], outLineCount=@Integer[2028], unloadCount=@Integer[177], HostList=@ArrayList[isEmpty=false;size=3041], ], msg=@String[操作成功], debugMsg=null, code=@Integer[0], ], null, ]
从返回的结果 HostList=@ArrayList[isEmpty=false;size=3041],
,也能证明本次返回的数据是3041条,然后不停的循环去调用feign来获取同一条数据........
二.总结
如果线上环境,代码中没有手动打印任何日志的情况,合理使用arthas能进行快速的问题排查