前言

线上环境,客户提出来一个紧急问题,是报表数据生产失败的问题,经过排查得知是feign调用超时问题,现在讲本次线上问题排查思路梳理如下:

一.问题还原

A服务 调用B服务 接口,传递参数有 租户id,开始时间,结束时间,B服务会查询租户的数据返回,但是从A服务看,等到60秒后会超时,如下所示:

2024-04-02 10:36:04.456 ERROR 28305 --- [http-nio-5678-exec-4] c.d.x.c.b.e.GlobalExceptionHandler       : [ControllerErrorHandler系统错误], errorStack: com.netflix.hystrix.exception.HystrixRuntimeException: CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.
        at com.netflix.hystrix.AbstractCommand$22.call(AbstractCommand.java:822)
        at com.netflix.hystrix.AbstractCommand$22.call(AbstractCommand.java:807)
        at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$4.onError(OperatorOnErrorResumeNextViaFunction.java:140)
        at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
        at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
        at com.netflix.hystrix.AbstractCommand$DeprecatedOnFallbackHookApplication$1.onError(AbstractCommand.java:1472)
        at com.netflix.hystrix.AbstractCommand$FallbackHookApplication$1.onError(AbstractCommand.java:1397)
        at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
        at rx.observers.Subscribers$5.onError(Subscribers.java:230)
        at rx.internal.operators.OnSubscribeThrow.call(OnSubscribeThrow.java:44)
        at rx.internal.operators.OnSubscribeThrow.call(OnSubscribeThrow.java:28)
        at rx.Observable.unsafeSubscribe(Observable.java:10327)
        at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:51)
        at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:35)
        at rx.Observable.unsafeSubscribe(Observable.java:10327)
        at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41)
        at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
        at rx.Observable.unsafeSubscribe(Observable.java:10327)
        at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41)
        at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30)
        at rx.Observable.unsafeSubscribe(Observable.java:10327)
        at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41)
        at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
        at rx.Observable.unsafeSubscribe(Observable.java:10327)
        at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41)
        at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
        at rx.Observable.unsafeSubscribe(Observable.java:10327)
        at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$4.onError(OperatorOnErrorResumeNextViaFunction.java:142)
        at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
        at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
        at com.netflix.hystrix.AbstractCommand$HystrixObservableTimeoutOperator$1.run(AbstractCommand.java:1142)
        at com.netflix.hystrix.strategy.concurrency.HystrixContextRunnable$1.call(HystrixContextRunnable.java:41)
        at com.netflix.hystrix.strategy.concurrency.HystrixContextRunnable$1.call(HystrixContextRunnable.java:37)
        at com.alibaba.ttl.TtlCallable.call(TtlCallable.java:65)
        at com.netflix.hystrix.strategy.concurrency.HystrixContextRunnable.run(HystrixContextRunnable.java:57)
        at com.netflix.hystrix.AbstractCommand$HystrixObservableTimeoutOperator$2.tick(AbstractCommand.java:1159)
        at com.netflix.hystrix.util.HystrixTimer$1.run(HystrixTimer.java:99)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException
        at com.netflix.hystrix.AbstractCommand.handleTimeoutViaFallback(AbstractCommand.java:997)
        at com.netflix.hystrix.AbstractCommand.access$500(AbstractCommand.java:60)
        at com.netflix.hystrix.AbstractCommand$12.call(AbstractCommand.java:609)
        at com.netflix.hystrix.AbstractCommand$12.call(AbstractCommand.java:601)
        at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$4.onError(OperatorOnErrorResumeNextViaFunction.java:140)
        ... 16 more

经过排查,系统中feign调用的超时配置的确实是60秒,使用trace方法进一步验证A服务的接口超时问题:

[arthas@28305]$ trace com.xx.CommonFeignClient getHostListReport
Press Q or Ctrl+C to abort.
Affect(class count: 2 , method count: 1) cost in 3086 ms, listenerId: 1
`---ts=2024-04-02 10:41:07;thread_name=http-nio-5678-exec-5;id=d6;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c
    `---[60008.446404ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception]
        `---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.]

`---ts=2024-04-02 10:41:22;thread_name=http-nio-5678-exec-7;id=d8;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c
    `---[60005.811154ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception]
        `---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.]

`---ts=2024-04-02 10:43:24;thread_name=http-nio-5678-exec-8;id=d9;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c
    `---[60006.744468ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception]
        `---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.]

`---ts=2024-04-02 10:44:40;thread_name=http-nio-5678-exec-10;id=db;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c
    `---[60008.893705ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception]
        `---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.]

`---ts=2024-04-02 10:53:06;thread_name=http-nio-5678-exec-3;id=d4;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c
    `---[60006.06974ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception]
        `---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.]

`---ts=2024-04-02 10:57:26;thread_name=http-nio-5678-exec-6;id=d7;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c
    `---[60005.968829ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception]
        `---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.]

能看出每次接口超过60秒就会报超时,接下来,就需要排查B服务是否收到请求

2.B服务排查

经过查看日志,B服务这个接口是没有添加任何日志的,导致如果只是简单的查询日志,是看不到任何请求的,只能通过arthas排查,使用trace命令查看接口的请求耗时

`---ts=2024-04-02 10:48:27;thread_name=http-nio-5668-exec-59;id=3385c;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@7c2b6087
    `---[191015.262468ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl$$EnhancerBySpringCGLIB$$4ebc7a2b:getHostListReport()
        `---[191015.224267ms] org.springframework.cglib.proxy.MethodInterceptor:intercept()
            `---[191015.188767ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostListReport()
                +---[0.00905ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:<init>() #2273
                +---[0.00713ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setHostStatus() #2276
                +---[0.00642ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2277
                +---[0.228765ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2278
                +---[10.255409ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectList() #2280
                +---[0.00865ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setNewInstallCount() #2283
                +---[0.0069ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setStarteDate() #2286
                +---[0.006421ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setEndDate() #2287
                +---[0.0064ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setHostStatus() #2288
                +---[0.00662ms] com.dbapp.ahcloud.network.element.enums.AgentStatusEnum:getStatus() #2289
                +---[0.006791ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2289
                +---[0.037051ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2290
                +---[9.937732ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectCount() #2291
                +---[0.00788ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setTotalCount() #2292
                +---[0.016691ms] com.dbapp.ahcloud.network.element.enums.HostStatusEnum:getStatus() #2295
                +---[0.00705ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setHostStatus() #2295
                +---[0.0067ms] com.dbapp.ahcloud.network.element.enums.AgentStatusEnum:getStatus() #2296
                +---[0.00676ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2296
                +---[0.048871ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2297
                +---[8.386791ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectCount() #2298
                +---[0.00827ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setOnLineCount() #2299
                +---[0.00656ms] com.dbapp.ahcloud.network.element.enums.HostStatusEnum:getStatus() #2302
                +---[0.00607ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setHostStatus() #2302
                +---[0.00549ms] com.dbapp.ahcloud.network.element.enums.AgentStatusEnum:getStatus() #2303
                +---[0.00588ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2303
                +---[0.056201ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2304
                +---[7.170546ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectCount() #2305
                +---[0.02316ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setOutLineCount() #2306
                +---[0.00639ms] com.dbapp.ahcloud.network.element.enums.AgentStatusEnum:getStatus() #2309
                +---[0.00558ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2309
                +---[0.005ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setHostStatus() #2310
                +---[0.034811ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2311
                +---[5.88444ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectCount() #2312
                +---[0.00755ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setUnloadCount() #2313
                +---[0.00567ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2315
                +---[0.03573ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2316
                +---[115.476815ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectList() #2318
                `---[0.01645ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setHostList() #2341

从这个里面看到这个接口是有请求进来的,显示这个方法耗时191015.262468ms,也就是说耗时190秒,但是通过这个方法下面的参数,最长的一个也就耗时 115.476815ms(0.115秒),从这里看到这上面的时间相加,也就一两秒,为啥这里会耗时190秒呢? 继续分析

image.png

从这个图上看,执行到最后一步之前,一共耗时也就1秒左右,但是执行到最后一步后,却耗时190秒,所以很大概率是最后一步在setList的时候,有问题,所以接着看代码:

这个人前面会查出来几千条数据,然后通过for循环去查询这个租户的数据,本地数据库里面3041条,就是要调用3千多次......而且这些数据是一样的,What the fuck ~~~~ 这还是一个做了十几年java的人写的代码3.再次验证B服务返回的结果

watch com.xxx.ConsoleHostController getHostListReport "{params,returnObj,throwExp}" -x 3

method=com.xxx.ConsoleHostController.getHostListReport location=AtExit
ts=2024-04-02 11:00:44; [cost=198116.839392ms] result=@ArrayList[
    @Object[][
        @HostReportQueryDTO[
            starteDate=null,
            endDate=null,
            tenantId=@Long[160890167693314],
            vmType=@String[av],
            agentStatus=null,
            hostStatus=null,
        ],
    ],
    @MsspResponse[
        data=@HostReportResDTO[
            totalCount=@Integer[2864],
            newInstallCount=@Integer[4],
            activeCount=null,
            onLineCount=@Integer[836],
            outLineCount=@Integer[2028],
            unloadCount=@Integer[177],
            HostList=@ArrayList[isEmpty=false;size=3041],
        ],
        msg=@String[操作成功],
        debugMsg=null,
        code=@Integer[0],
    ],
    null,
]

从返回的结果     HostList=@ArrayList[isEmpty=false;size=3041],,也能证明本次返回的数据是3041条,然后不停的循环去调用feign来获取同一条数据........

二.总结

如果线上环境,代码中没有手动打印任何日志的情况,合理使用arthas能进行快速的问题排查