使用 Arthas 排查线上问题

发表于 2019-08-17 更新于 2025-07-07 分类于 Java

一般来说，查问题有以下几个层次：

看服务器指标
看日志
review 代码
debug

debug 可以说是撒手锏了，一般不到万不得已的情况不会 debug，费时费力，而且上线后谁还能在服务器上开个 debug 端口？印象中，遇到非常棘手的问题时，只能 review 代码然后在关键位置加日志，究其根本原因，是没法看到进程运行时的内存状况。Arthas 就是为了解决这种问题而诞生的。

开始使用 Arthas

# 为了方便，直接下载jar包：
wget https://alibaba.github.io/arthas/arthas-boot.jar
# 切换到进程的所有者，比如如果是worker用户创建了该进程：
sudo su worker
# 执行jar包后可以看到，输入对应序号即可：
java -jar arthas-boot.jar
# 拦截某个方法调用，并打印出返回值
watch demo.Main test returnObj
# 退出当前连接，Attach到目标进程上的arthas还会继续运行，端口会保持开放，下次连接时可以直接连接上。
quit
exit
# 退出arthas进程
shutdown

注意事项

watch命令不能捕获方法的递归调用；
在生产环境使用完毕后，最好用shutdown命令退出 arthas 进程，否则会占用服务器进程资源。

arthas 问题排查

如果 arthas 出现问题，可以查看一下 arthas 本身的日志：

1	less ~/logs/arthas

查看更多命令（help）

# 查看启动参数
java -jar arthas-boot.jar -h
# 查看命令列表
help

比较代码（jad）

1 2	# 反编译某个类，用于检查编译执行的字节码和本地代码是否一致 jad demo.Main

打印线程堆栈（thread）

打印某个线程的线程堆栈，如果线程 hang 住了可以通过这种方式来找到线程阻塞到了哪个方法调用上

$ vim A.java
package com.tallate.localcache;

import java.util.HashMap;
import java.util.Map;

public class A {

    private static int x = 1;
    private static Map<Integer, Integer> m = new HashMap<>();

    public static class Param {

        private int a1;
        private int a2;

        public int getA1() {
            return a1;
        }

        public Param setA1(int a1) {
            this.a1 = a1;
            return this;
        }

        public int getA2() {
            return a2;
        }

        public Param setA2(int a2) {
            this.a2 = a2;
            return this;
        }
    }

    public static void main(String[] args) {
        m.put(1, 1);
        m.put(2, 2);
        while (true) {
            try {
                Thread.sleep(1000);
            } catch (Exception e) {
                e.printStackTrace();
            }
            test(new Param().setA1(1).setA2(2), new Param().setA1(3).setA2(4));
        }
    }

    public static void test(Param param1, Param param2) {
        System.out.println("abc");
    }
}

# 展示当前JVM进程信息
dashboard
$ thread 1 | grep 'main('
$ thread 1 
"main" Id=1 TIMED_WAITING
    at java.lang.Thread.sleep(Native Method)
    at A.main(A.java:5)

Affect(row-cnt:0) cost in 12 ms.

分析方法调用参数 & 返回值（watch）

默认情况下 watch 会把每次调用的结果都打印出来、非常混乱，其实是可以用 ognl 表达式来过滤的，源码中，搜索表达式的核心对象是 Advice 对象。
仍然是使用上面给出的例子：

# 查看第一个参数
watch com.tallate.localcache.A test "params[0]"
# 调用某个参数的方法
watch com.tallate.localcache.A test "params[0].getA1()"
# 在方法参数或返回值类型是嵌套类的情况下，查看对象的内部结构
watch com.tallate.localcache.A test "params[0].{ #this.a1 }"
# 解决方法重载问题（如果目标方法被重载了，单纯用上边的命令会把这些重载方法的调用也拦下来）
watch com.tallate.localcache.A test "params.length==1"
watch com.tallate.localcache.A test "params[1] instanceof Integer"
# 按条件过滤参数
watch com.tallate.localcache.A test "params[0].{? #this.a1 == 1 }" -x 2
watch com.tallate.localcache.A test "params[0].{? #this.a1 == null }" -x 2
watch com.tallate.localcache.A test "params[0].{? #this.a1 != null }" -x 2
watch com.tallate.localcache.A test "{params[0], params[1], returnObj}" "params[0] == '过滤条件'"
watch com.tallate.localcache.A test "{params[0], params[1], returnObj}" | grep "过滤条件"
# 过滤后统计，注意{? expr }的结果是ArrayList类型的
watch com.tallate.localcache.A test "params[0].{? #this.a1 != null }.size()" -x 2
# 子表达式求值
watch com.tallate.localcache.A test "params[0].{? #this.a1 < 10 }.size().(#this >= 2 ? #this - 10 : 'other condition')" -x 2
# 选择第一个满足条件，注意例子中的方法有两个参数
watch com.tallate.localcache.A test "params.{^ #this.a1 != null }" -x 2
# 选择最后一个满足条件
watch com.tallate.localcache.A test "params.{$ #this.a1 != null }" -x 2
# 访问静态变量
watch com.tallate.localcache.A test "@com.tallate.localcache.A@x"
# 上面这种方式受到classloader限制，不推荐使用
# 使用新版getstatic命令，通过-c指定classloader，可以查看任意static变量，同时支持ognl表达式
getstatic com.tallate.localcache.A x
getstatic com.tallate.localcache.A m 'entrySet().iterator.{? #this.key == 1 }'
getstatic com.tallate.localcache.A m 'entrySet().iterator.{? #this.key == "1" }'
# 调用静态方法
watch com.tallate.localcache.A test "@java.lang.Thread@currentThread()"
watch com.tallate.localcache.A test "@java.lang.Thread@currentThread().getContextClassLoader()"

分析方法调用链路（trace）

trace com.tallate.localcache.A test
# 跳过jdk方法
trace -j com.tallate.localcache.A test
# 按方法的执行耗时进行过滤（为了方便下面过滤出大于0.01ms的调用路径，相当于没有过滤）
trace *A test '#cost > 0.01'

一些特殊结果的说明：
[0,0,0ms,11]xxx:yyy() [throws Exception]，对该方法中相同的方法调用进行了合并，0,0,0ms,11 表示方法调用耗时，min,max,total,count；throws Exception 表明该方法调用中存在异常返回

trace 的一些局限：

只能打印一级的调用，因为全打出来会非常乱，如果有这样的需求最好还是考虑用 pinpoint 之类的全链路追踪工具解决，如果是复杂的链路，最好大致定位下产生性能瓶颈的位置；

trace 执行时本身有一定的性能开销，所以结果会略微不准确，但是这点消耗基本不会影响最后结论；

1
2

# 匹配线程&正则多个类多个方法：trace -E com.test.ClassA|org.test.ClassB method1|method2|method3
trace -E 'io\.netty\.channel\.nio\.NioEventLoop | io\.netty\.util\.concurrent\.SingleThreadEventExecutor'  'select | processSelectedKeys | runAllTasks' '@Thread@currentThread().getName().contains("IO-HTTP-WORKER-IOPool") && #cost>500'

记录方法执行的快照（tt）

# 查看tt命令的参数和示例
tt -h
# 记录每次执行情况
tt -t com.tallate.localcache.A test
# 记录达到3次即中断命令，不然，如果方法调用量非常大，很有可能瞬间把JVM内存撑爆
tt -n 3 com.tallate.localcache.A test

 INDEX   TIMESTAMP            COST(ms)  IS-RET  IS-EXP  OBJECT          CLASS                         METHOD                        
------------------------------------------------------------------------------------------------------------------------------------
 1000    2019-08-17 18:05:45  0.546613  true    false   NULL            A                             test

只是用 exit 或 quit 退出不会清除记录下来的调用，任然可以用 tt 命令查询它们：

# 筛选出某个方法的调用信息，搜索表达式的核心对象依旧是 Advice 对象
tt -s 'method.name=="test"'
# 找到某个编号的调用信息
tt -i 1003

因为 tt 命令保存了当时调用的所有现场信息，所以可以甚至可以直接通过指定调用编号重放一次调用：

1 2	# -p：执行；--replay-times：重放次数；--replay-interval：重放时间间隔，单位ms，默认1000ms tt -i 1003 -p --replay-times 1 --replay-interval 1000

但是使用 tt 命令需要注意：

ThreadLocal
因为调用线程发生了变化，如果程序使用到的一些数据是保存到 ThreadLocal 里的，那么执行 tt 时这些对象会丢失。
引用的对象
tt 命令仅仅将当前环境对象的引用保存起来，如果方法对入参作了变更、或者返回对象做了别的修改，那重复的执行也是不准确的，这种情况下，就只能用watch命令查看调用时的情况了。

查看类加载信息 - sc

search-class，能搜索出所有已经加载到JVM中的Class信息

# 模糊搜索
sc demo.*
# 打印类的详细信息，注意能找到该类是从哪个jar包被加载进来的
# 如果某个类的方法找不到（NoSuchMethodException）可能是加载错了类，可以用-d来排查
sc -d demo.Main
# 打印类的详细信息，并输出类的成员变量信息
sc -d -f demo.Main

Arthas 原理

Arthas 的原理是Java Agent。

参考

线上问题排查

线上服务 CPU 100%？一键定位 so easy！