【磐维数据库】3.0运行"show events;"命令,引发集群主节点CoreDump

佚名 / 2024-11-11 / 原文

背景

江西移动现场准备割接上线的磐维数据库集群,在迁移测试阶段应用侧有创建定时作业的需求,DBA测试创建定时作业用例。

其他同事检查数据库对象,集群主节点发生CoreDump。

一、环境介绍

数据库    PanWeiDB V2.0-S3.0.0_B01
兼容性 B(MySQL) 架构 Intel + x86_64 操作系统 BCLinux-for-Euler-21.10 内核 4.19.90-2107.6.0.0192.8.oe1.bclinux.x86_64

二、故障场景

1、客户环境可以稳定复现

gsql -r
show events;
\c bomcdb;          --------  业务库名称
show events;        --------  database coredump

2、故障截图

 

 

3、查看数据库中的定时作业

三、研发分析

1、公司内部申请"符号表",获取后上传到数据库节点主机,参考《如何使用gdb分析数据库实例crash问题(文档编号 10381.1)》完成"符号表"配置。
2、错误分析定位

gdb堆栈信息:
[Current thread is 1 (LWP 4164310)]
(gdb) bt
#0  0x0000000001b60b16 in heap_form_minimal_tuple (tupleDescriptor=0x14fd369c1c50, values=0x14fd3cc15680, isnull=0x14fd3cc15720, inTuple=0x0) at heaptuple.cpp:2047
#1  0x000000000129b54e in tuplestore_puttupleslot (state=0x14fd36b14050, slot=<optimized out>, need_transform_anyarray=<optimized out>) at tuplestore.cpp:778
#2  0x0000000001914670 in do_tup_output (tstate=tstate@entry=0x14fd3cc15530, values=values@entry=0x14fd1e9c1eb0, values_len=values_len@entry=10, is_null=is_null@entry=0x14fd1e9c1ea6,
    is_null_len=is_null_len@entry=10) at execTuples.cpp:1220
#3  0x000000000174f878 in ShowEventCommand (stmt=stmt@entry=0x14fd3cc60cc8, dest=dest@entry=0x14fd3cc15490) at eventcmds.cpp:913
#4  0x000000000186a5b9 in standard_ProcessUtility (processutility_cxt=<optimized out>, dest=0x14fd3cc15490, sent_to_remote=<optimized out>, completion_tag=0x14fd1e9c27a0 "",
    context=PROCESS_UTILITY_TOPLEVEL, isCTAS=<optimized out>) at utility.cpp:3793
#5  0x000014ffef1cf18f in pgss_ProcessUtility (processutility_cxt=0x14fd1e9c2730, dest=0x14fd3cc15490, sentToRemote=<optimized out>, completionTag=0x14fd1e9c27a0 "",
    context=PROCESS_UTILITY_TOPLEVEL, isCTAS=<optimized out>) at pg_stat_statements.cpp:787
#6  0x000000000187429b in pgaudit_ProcessUtility (processutility_cxt=0x14fd1e9c2730, dest=<optimized out>, sentToRemote=<optimized out>, completionTag=<optimized out>,
    context=<optimized out>, isCTAS=<optimized out>) at auditfuncs.cpp:1532
#7  0x000000000186e78a in ProcessUtility (processutility_cxt=0x14fd1e9c2730, dest=0x14fd3cc15490, sent_to_remote=false, completion_tag=0x14fd1e9c27a0 "", context=<optimized out>,
    isCTAS=<optimized out>) at utility.cpp:1664
#8  0x0000000001860823 in PortalRunUtility (portal=portal@entry=0x14fd36b1a050, utilityStmt=0x14fd3cc60cc8, isTopLevel=isTopLevel@entry=true, dest=dest@entry=0x14fd3cc15490,
    completionTag=completionTag@entry=0x14fd1e9c27a0 "") at pquery.cpp:1777
#9  0x00000000018616f3 in FillPortalStore (portal=portal@entry=0x14fd36b1a050, isTopLevel=isTopLevel@entry=true) at pquery.cpp:1571
#10 0x0000000001862ad5 in PortalRun (portal=portal@entry=0x14fd36b1a050, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, dest=dest@entry=0x14fd3cc60d78,
    altdest=altdest@entry=0x14fd3cc60d78, completionTag=completionTag@entry=0x14fd1e9c2a80 "") at pquery.cpp:1174
#11 0x0000000001856c1c in exec_simple_query (query_string=<optimized out>, query_string@entry=0x14fd3cc60050 "show events ;", msg=msg@entry=0x14fd1e9c2bf0, messageType=QUERY_MESSAGE)
    at postgres.cpp:3399
#12 0x000000000185ccb2 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x14fd3bc39d90, dbname=<optimized out>, username=<optimized out>) at postgres.cpp:9894
#13 0x00000000017ae2a6 in BackendRun (port=port@entry=0x14fd1e9c3170) at postmaster.cpp:10046
#14 0x00000000017d60ec in GaussDbThreadMain<(knl_thread_role)1> (arg=0x14fdb14f2a60) at postmaster.cpp:14871
#15 0x00000000017ae331 in InternalThreadFunc (args=<optimized out>) at postmaster.cpp:15520
#16 0x000014ffdd17df1b in ?? () from /usr/lib64/libpthread.so.0
#17 0x000014ffdd0b333f in clone () from /usr/lib64/libc.so.6

四、结论

宕机原因分析如下
"show events"命令,values值=空
系统库 该命令 正常运行
业务库 从0开始 第8列 有个datum=0值

根因定位,和第8列failure_msg信息的数据有关,如果不为NULL的数据中间夹着NULL数据,就会导致实例内核core崩溃。

pg_job的数据有问题,正常job_name,end_date,enable都不应该为空。
但是在创建时,未对这几个字段做非空限制。

临时解决方案:
PKG_SERVICE.job_cancel把pg_job里面的测试任务删除掉。

永久解决方案:
PKG_SERVICE.JOB_SUBMIT创建作业时,若这几个字段未赋值,报错提示非空。
在PanWeiDB_V2.0-S3.0.2_B01版本修复。