部署repmgr PG14集群<无autofailover>

Jeona / 2023-05-07 / 原文

系统环境

OS: CentOS7
192.168.56.11 node1
192.168.56.12 node2

部署操作流程

1. 系统环境初始化


    echo redhat | passwd root --stdin 

    echo -e "192.168.56.11  node1 " >>/etc/hosts

    echo -e "192.168.56.12  node2 " >>/etc/hosts

    sed -i 's/vault.centos.org/mirrors.aliyun.com/g' /etc/yum.repos.d/CentOS*
 

2. 所有节点安装PG14 <root执行>

    sudo yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm

    sudo yum -y install epel-release yum-utils
  
    sudo yum-config-manager --enable pgdg14

    sudo yum install postgresql14-server postgresql14 -y

    echo -e " export PATH=/usr/pgsql-14/bin/:$PATH" >>/var/lib/pgsql/.bash_profile


3. 所有节点安装repmgr<root执行>

## on all nodes
sudo yum -y  install repmgr_14*

4. 关闭防火墙或者放行PG端口(默认5432)<root执行>

systemctl stop firewalld

5. 主节点创建repmgr账户 , pg_hba.cfg 以及postgresql.conf 参数调整 , repmgr.conf文件配置 <PG owner执行>

5.1 创建repmgr

createuser repmgr -P -s 

createdb  repmgr -O repmgr;

5.2 调整pg_hba.cfg 以及postgresql.conf参数

##  -- postgresql.conf
max_wal_senders = 10
max_replication_slots = 10
wal_level = 'hot_standby'
hot_standby = on
archive_mode = on
archive_command = '/bin/true'
shared_preload_libraries = 'repmgr'

##  -- pg_hba.cfg
--- 这里采用了trust模式,因此无需.pgpass
local   replication   repmgr                              trust
host    replication   repmgr      127.0.0.1/32            trust
host    replication   repmgr      192.168.56.11/32       trust
host    replication   repmgr      192.168.56.12/32       trust

local   repmgr        repmgr                              trust
host    repmgr        repmgr      127.0.0.1/32            trust
host    repmgr        repmgr      192.168.56.11/32       trust
host    repmgr        repmgr      192.168.56.12/32       trust

若有需要 .pgpass ,则需要如下配置:su - postgres
cd /var/lib/pgsql
touch .pgpass
chmod 0600 .pgpass
echo -e "192.168.56.11:5432:repmgr:repmgr:passwd_123456">>.pgpass
echo -e "192.168.56.12:5432:repmgr:repmgr:passwd_123456">>.pgpass

5.3 repmgr.conf文件配置

#-- 主库

node_id=1
node_name=node1
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/14/data/'

#-- 从库
node_id=2
node_name=node2
conninfo='host=node2 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/14/data/'

6. 启动主节点PG 并 注册为primary server <PG owner执行>

# -- 首先启动主库PG
sudo /usr/pgsql-14/bin/postgresql-14-setup initdb

sudo systemctl enable postgresql-14

sudo systemctl start postgresql-14

# -- 先模拟运行注册
su - postgres
repmgr -f /etc/repmgr/14/repmgr.conf primary register --dry-run

# -- 将上面错误全部解决后,执行注册
repmgr -f /etc/repmgr/14/repmgr.conf primary register 


7. 清空备节点PG data目录, 搭建/clone备节点 <PG owner执行>

# - 确保从库PG以及PG 所有data目录是空的,如果不是,则手动删除即可

systemctl stop postgresql-14


rm -rf /var/lib/pgsql/14/data

8. 将备节点注册成standby server 并启动PG实例 <PG owner执行>

repmgr -f /etc/repmgr/14/repmgr.conf standby register  --force

9. 校验集群信息

-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=node2 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=192.168.56.11 port=5432 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | host=node2 user=repmgr port=5432 dbname=repmgr connect_timeout=2

10. 常见错误

10.1 connection to server at "192.168.56.11", port 5432 failed: Connection refused

# -- 明明PG 实例都启动了,但是死活显示如题,最后排查到 默认情况下 postgresql只监听local,因此修改参数listen-addresses ='*'
-bash-4.2$ repmgr -f /etc/repmgr/14/repmgr.conf primary register --dry-run
INFO: connecting to primary database...
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL:
connection to server at "192.168.56.11", port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

DETAIL: attempted to connect using:
  user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path=

#-- solution 修改如下配置即可
查看 postgresql.conf 
listen-addresses ='*'

10.2.ERROR: "repmgr" is not a superuser and no superuser name supplied

# 需要确保做repmgr的账户拥有超级权限
postgres=# \du
                                   List of roles
 Role name |                         Attributes                         | Member of
-----------+------------------------------------------------------------+-----------
 postgres  | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
 repmgr    | Superuser                                                  | {}

10.3. Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

# 系统级别的问题, 配置passwordless时,在节点node1上跑 ssh node2 date会直接拒绝

# - solution
修改如下 PasswordAuthentication = yes
PasswordAuthentication  yes

10.4.网络有问题

-- 各种查看,发现没有问题


-bash-4.2$ repmgr -hnode1 -p5432 -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/var/lib/pgsql/14/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node1 port=5432 user=repmgr dbname=repmgr
ERROR: connection to database failed
DETAIL:
connection to server at "node1" (192.168.56.11), port 5432 failed: No route to host
        Is the server running on that host and accepting TCP/IP connections?

-- 最后发现是防火墙需要关掉 - systemctl stop firewalld 或者通过firewall-cmd放行PG 实例端口