High Availability Clustering with Zabbix on Debian


Later in this document:


Edmunds Vesmanis had a presentation in Zabbix Summit 2019 about Zabbix HA setups (video in Youtube), and he also wrote a post in Zabbix blog titled High Availability cluster building with Zabbix for continued service: 3 database servers, 3 Zabbix servers and 3 Zabbix frontend servers. The configuration was based on RHEL or CentOS, and the commands are somewhat different in Debian 10 Buster. Thus, here is my version about Edmunds’s setup on Debian.

You still need to see the original post or the video (linked above) for more information about the background and the idea, as I’m going to mostly show just commands here. I also changed some bits here and there, like used GTID for replication.

Versions used (everything from Debian repo except Zabbix):

IP addresses and hostnames to be used and appended to /etc/hosts on every server:

# VIPs
192.168.7.87 zabbix-ha-app
192.168.7.88 zabbix-ha-fe
192.168.7.89 zabbix-ha-db

# Front-end nodes
192.168.7.90 zabbix-ha-fe1
192.168.7.91 zabbix-ha-fe2
192.168.7.92 zabbix-ha-fe3

# Zabbix server nodes
192.168.7.93 zabbix-ha-srv1
192.168.7.94 zabbix-ha-srv2
192.168.7.95 zabbix-ha-srv3

# Database nodes
192.168.7.96 zabbix-ha-db1
192.168.7.97 zabbix-ha-db2
192.168.7.99 zabbix-ha-db3

Setting up the database servers

On every database server:

sudo -i
vi /etc/hosts    # set the hosts file as mentioned above
apt install corosync pacemaker pcs
echo hacluster:Zabbix123 | chpasswd
# Debian has a cluster configured already, ignore the config:
mv /etc/corosync/corosync.conf /etc/corosync/corosync.conf.orig
apt install mariadb-server
systemctl stop mariadb

On every database server, create a configuration file /etc/mysql/mariadb.conf.d/90-zabbix.cnf:

[mysqld]
skip_name_resolve
bind_address = 0.0.0.0
log_slave_updates
max_binlog_size = 1G
expire_logs_days = 5
innodb_buffer_pool_size = 1G # 70-80% of total RAM
innodb_buffer_pool_instances = 1 # each instance should be at least 1GB
innodb_flush_log_at_trx_commit = 2 # default = 1
innodb_flush_method = O_DIRECT # default = fsync
innodb_io_capacity = 500 # HDD = 500-800, SSD = 2000
query_cache_size = 0

# Change the following values for each server accordingly!
log_basename = zabbix-ha-db1
log_bin = zabbix-ha-db1-bin
server_id = 96 # The last number of the server IP address

Start MariaDB again on every server:

systemctl start mariadb

On the first server only:

pcs host auth zabbix-ha-db1 zabbix-ha-db2 zabbix-ha-db3 -u hacluster -p Zabbix123
pcs cluster setup zabbix_db_cluster zabbix-ha-db1 zabbix-ha-db2 zabbix-ha-db3 --force
pcs cluster start --all
systemctl enable corosync pacemaker
pcs property set stonith-enabled=false
pcs resource defaults resource-stickiness=100
pcs resource create virtual_ip ocf:heartbeat:IPaddr2 ip=192.168.7.89 op monitor interval=5s --group zabbix_db_cluster

On the second and third server, enable the cluster services:

systemctl enable corosync pacemaker

Check the cluster status with pcs status command, output example:

root@zabbix-ha-db1:~# pcs status
Cluster name: zabbix_db_cluster
Stack: corosync
Current DC: zabbix-ha-db3 (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Sat Jan 11 19:03:44 2020
Last change: Sat Jan 11 18:48:10 2020 by root via cibadmin on zabbix-ha-db2

3 nodes configured
1 resource configured

Online: [ zabbix-ha-db1 zabbix-ha-db2 zabbix-ha-db3 ]

Full list of resources:

 Resource Group: zabbix_db_cluster
     virtual_ip (ocf::heartbeat:IPaddr2):       Started zabbix-ha-db1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Now let’s configure the database replication.

On the first server, start mysql, and enter commands:

stop slave;
grant replication slave on *.* to 'replicator'@'192.168.7.97' identified by 'Password456';
show global variables like 'gtid_current_pos';

Output example for the GTID position:

MariaDB [(none)]> show global variables like 'gtid_current_pos';
+------------------+--------+
| Variable_name    | Value  |
+------------------+--------+
| gtid_current_pos | 0-96-1 |
+------------------+--------+
1 row in set (0.002 sec)

MariaDB [(none)]>

Make a note of the position (“0-96-1” in this example).

On the second server (zabbix-ha-db2), start mysql, and enter commands:

stop slave;
set global gtid_slave_pos = '0-96-1';   # The GTID you noted earlier
change master to master_host='192.168.7.96', master_user='replicator', master_password='Password456', master_use_gtid=slave_pos;
grant replication slave on *.* to 'replicator'@'192.168.7.99' identified by 'Password456';
reset master;
start slave;
show slave status\G

Output example for the slave status:

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: 192.168.7.96
                   Master_User: replicator
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: zabbix-ha-db1-bin.000002
           Read_Master_Log_Pos: 350
                Relay_Log_File: zabbix-ha-db2-relay-bin.000002
                 Relay_Log_Pos: 657
         Relay_Master_Log_File: zabbix-ha-db1-bin.000002
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 350
               Relay_Log_Space: 974
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 96
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-96-1
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 0
1 row in set (0.000 sec)

MariaDB [(none)]>

On the third server (zabbix-ha-db3), start mysql, and enter commands:

stop slave;
set global gtid_slave_pos = '0-96-1';    # The same as noted earlier
change master to master_host='192.168.7.97', master_user='replicator', master_password='Password456', master_use_gtid=slave_pos;
grant replication slave on *.* to 'replicator'@'192.168.7.96' identified by 'Password456';
reset master;
start slave;
show slave status\G

Output example on zabbix-ha-db3:

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: 192.168.7.97
                   Master_User: replicator
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: zabbix-ha-db2-bin.000001
           Read_Master_Log_Pos: 336
                Relay_Log_File: zabbix-ha-db3-relay-bin.000002
                 Relay_Log_Pos: 643
         Relay_Master_Log_File: zabbix-ha-db2-bin.000001
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
...
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-96-1
...

On the first database server again (zabbix-ha-db1), to complete the ring replication, start mysql, and enter commands:

stop slave;
set global gtid_slave_pos = '0-96-1';
change master to master_host='192.168.7.99', master_user='replicator', master_password='Password456', master_use_gtid=slave_pos;
start slave;
show slave status\G

Output example on the first server:

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: 192.168.7.99
                   Master_User: replicator
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: zabbix-ha-db3-bin.000001
           Read_Master_Log_Pos: 336
                Relay_Log_File: zabbix-ha-db1-relay-bin.000002
                 Relay_Log_Pos: 643
         Relay_Master_Log_File: zabbix-ha-db3-bin.000001
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
...
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-96-1
...

Now, still on the first server, create the Zabbix database and user:

create database zabbix character set utf8 collate utf8_bin;
grant all privileges on zabbix.* to 'zabbix'@'192.168.7.%' identified by 'Password789';
quit;

We just created an empty Zabbix database, and we will import the schema later from the Zabbix application server.


Setting up the Zabbix servers

On every Zabbix server:

sudo -i
vi /etc/hosts    # set the hosts file as mentioned in the start
apt install corosync pacemaker pcs
echo hacluster:Zabbix123 | chpasswd
mv /etc/corosync/corosync.conf /etc/corosync/corosync.conf.orig

# This is the current release file, please check the latest in https://www.zabbix.com/download
wget https://repo.zabbix.com/zabbix/4.4/debian/pool/main/z/zabbix-release/zabbix-release_4.4-1+buster_all.deb
dpkg -i zabbix-release_4.4-1+buster_all.deb
apt update
# We skip MariaDB and snmpd (note the minus characters)
apt install zabbix-server-mysql mariadb-server-10.3- snmpd-

On every Zabbix server, edit /etc/zabbix/zabbix_server.conf:

SourceIP=192.168.7.87
DBHost=192.168.7.89
DBPassword=Password789

On the first server only (zabbix-ha-srv1), prepare the Zabbix database, enter the previously set zabbix user password (Password789) when asked:

zcat /usr/share/doc/zabbix-server-mysql*/create.sql.gz | mysql -h 192.168.7.96 -u zabbix -p zabbix

Note: If you want to implement Zabbix database table partitioning, this would be the time for that.

Going on, still on the first Zabbix server, set up the cluster:

pcs host auth zabbix-ha-srv1 zabbix-ha-srv2 zabbix-ha-srv3 -u hacluster -p Zabbix123
pcs cluster setup zabbix_server_cluster zabbix-ha-srv1 zabbix-ha-srv2 zabbix-ha-srv3 --force
pcs cluster start --all
systemctl enable corosync pacemaker
pcs property set stonith-enabled=false
pcs resource defaults resource-stickiness=100
pcs resource create virtual_ip_server ocf:heartbeat:IPaddr2 ip=192.168.7.87 op monitor interval=5s --group zabbix_server_cluster
pcs resource create ZabbixServer systemd:zabbix-server op monitor interval=10s --group zabbix_server_cluster
pcs constraint colocation add virtual_ip_server with ZabbixServer
pcs constraint order virtual_ip_server then ZabbixServer
# To edit the start/stop timeouts we need to delete them first
pcs resource op delete ZabbixServer start
pcs resource op delete ZabbixServer stop
pcs resource op add ZabbixServer start interval=0s timeout=60s
pcs resource op add ZabbixServer stop interval=0s timeout=120s

Check the cluster status with pcs status command, output example:

root@zabbix-ha-srv1:~# pcs status
Cluster name: zabbix_server_cluster
Stack: corosync
Current DC: zabbix-ha-srv2 (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Sat Jan 11 19:25:31 2020
Last change: Sat Jan 11 19:21:09 2020 by root via cibadmin on zabbix-ha-srv1

3 nodes configured
2 resources configured

Online: [ zabbix-ha-srv1 zabbix-ha-srv2 zabbix-ha-srv3 ]

Full list of resources:

 Resource Group: zabbix_server_cluster
     virtual_ip_server (ocf::heartbeat:IPaddr2): Started zabbix-ha-srv1
     ZabbixServer      (systemd:zabbix-server):  Started zabbix-ha-srv1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Note that we didn’t excplicitly enable the zabbix-server service, and it is not enabled by default. The clustering service will take care of starting the service on the active node.


Setting up the frontend web servers

On every web server:

sudo -i
vi /etc/hosts    # set the hosts file as mentioned in the start
apt install corosync pacemaker pcs
echo hacluster:Zabbix123 | chpasswd
mv /etc/corosync/corosync.conf /etc/corosync/corosync.conf.orig

# This is the current release file, please check the latest in https://www.zabbix.com/download
wget https://repo.zabbix.com/zabbix/4.4/debian/pool/main/z/zabbix-release/zabbix-release_4.4-1+buster_all.deb
dpkg -i zabbix-release_4.4-1+buster_all.deb
apt update
apt install zabbix-frontend-php zabbix-apache-conf apache2
systemctl stop apache2
systemctl disable apache2

On every web server, edit /etc/zabbix/apache.conf to set the time zone in the PHP7 settings, for example:

...
    <IfModule mod_php7.c>
        ...
        php_value date.timezone Europe/Helsinki
    </IfModule>
...

On every web server, create /etc/apache2/conf-available/serverstatus.conf:

Listen 127.0.0.1:8080
<VirtualHost 127.0.0.1:8080>
        <Location /server-status>
                SetHandler server-status
                Require local
        </Location>
</VirtualHost>

On every web server, activate the server status configuration:

a2enconf serverstatus

On every web server, edit /etc/apache2/ports.conf, change Listen 80 to include the cluster IP address:

Listen 192.168.7.88:80

On the first web server, let’s now configure Zabbix frontend:

systemctl start apache2

Using a browser, go to http://192.168.7.90/zabbix/, and configure the Zabbix frontend as requested. Note to enter the database cluster IP address 192.168.7.89 when asked, and Zabbix server cluster IP address 192.168.7.87.

When Zabbix frontend has been successfully configured, copy the resulted configuration file /etc/zabbix/web/zabbix.conf.php to the second and third web servers.

On the first web server, stop Apache, and configure the cluster:

systemctl stop apache2
pcs host auth zabbix-ha-fe1 zabbix-ha-fe2 zabbix-ha-fe3 -u hacluster -p Zabbix123
pcs cluster setup zabbix_fe_cluster zabbix-ha-fe1 zabbix-ha-fe2 zabbix-ha-fe3 --force
pcs cluster start --all
systemctl enable corosync pacemaker
pcs property set stonith-enabled=false
pcs resource defaults resource-stickiness=100
pcs resource create virtual_ip_fe ocf:heartbeat:IPaddr2 ip=192.168.7.88 op monitor interval=5s --group zabbix_fe_cluster
pcs resource create zabbix_fe ocf:heartbeat:apache configfile=/etc/apache2/apache2.conf statusurl="http://localhost:8080/server-status" op monitor interval=30s --group zabbix_fe_cluster
pcs constraint colocation add virtual_ip_fe with zabbix_fe
pcs constraint order virtual_ip_fe then zabbix_fe
# To edit the start/stop timeouts we need to delete them first
pcs resource op delete zabbix_fe start
pcs resource op delete zabbix_fe stop
pcs resource op add zabbix_fe start interval=0s timeout=60s
pcs resource op add zabbix_fe stop interval=0s timeout=120s

Finally, on the second and third web servers, enable the cluster services:

systemctl enable corosync pacemaker

You can check the web cluster status with pcs status:

root@zabbix-ha-fe1:~# pcs status
Cluster name: zabbix_fe_cluster
Stack: corosync
Current DC: zabbix-ha-fe3 (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Sat Jan 11 20:26:42 2020
Last change: Sat Jan 11 19:33:19 2020 by root via cibadmin on zabbix-ha-fe1

3 nodes configured
2 resources configured

Online: [ zabbix-ha-fe1 zabbix-ha-fe2 zabbix-ha-fe3 ]

Full list of resources:

 Resource Group: zabbix_fe_cluster
     virtual_ip_fe    (ocf::heartbeat:IPaddr2):  Started zabbix-ha-fe1
     zabbix_fe  (ocf::heartbeat:apache):        Started zabbix-ha-fe1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

You can also see which addresses and ports the active web server is listening to:

root@zabbix-ha-fe1:~# ss -ntul
Netid  State   Recv-Q  Send-Q   Local Address:Port    Peer Address:Port
udp    UNCONN  0       0         192.168.7.90:5405         0.0.0.0:*
tcp    LISTEN  0       128            0.0.0.0:22           0.0.0.0:*
tcp    LISTEN  0       128          127.0.0.1:8080         0.0.0.0:*
tcp    LISTEN  0       128       192.168.7.88:80           0.0.0.0:*
tcp    LISTEN  0       128            0.0.0.0:2224         0.0.0.0:*
tcp    LISTEN  0       128               [::]:22              [::]:*
tcp    LISTEN  0       128               [::]:2224            [::]:*

As you can see, the server status service (port 8080) is only listening on the localhost address 127.0.0.1, and web server on port 80 is listening on the cluster IP address 192.168.7.88. It is left as an excercise for the reader to enable also IPv6 and/or TLS connectivity on the web server.


That’s it

As Edmunds said, this was just the bare minimum, but a good start anyway.

Some cluster commands useful in troubleshooting or management:

man pcs
pcs config
pcs node standby    # see "pcs node --help"
pcs node unstandby
pcs quorum status

See also: ClusterLabs

2 Comments

Add a Comment
  1. hi can you help me please I’m getting below erro while:

    # pcs resource create virtual_ip ocf:heartbeat:IPaddr2 ip=192.168.7.89 op monitor interval=5s –group zabbix_db_cluster

    Error: When using ‘op’ you must specify an operation name and at least one option

    1. Markku Leiniö

      What pcs version are you using? You can see my version in the beginning of the post.

Leave a Reply