Later in this document:
Edmunds Vesmanis had a presentation in Zabbix Summit 2019 about Zabbix HA setups (video in Youtube), and he also wrote a post in Zabbix blog titled High Availability cluster building with Zabbix for continued service: 3 database servers, 3 Zabbix servers and 3 Zabbix frontend servers. The configuration was based on RHEL or CentOS, and the commands are somewhat different in Debian 10 Buster. Thus, here is my version about Edmunds’s setup on Debian.
You still need to see the original post or the video (linked above) for more information about the background and the idea, as I’m going to mostly show just commands here. I also changed some bits here and there, like used GTID for replication.
Versions used (everything from Debian repo except Zabbix):
- Debian 10.2 Buster
- MariaDB 10.3.18-0+deb10u1
- Corosync 3.0.1-2
- Pacemaker 2.0.1-5
- pcs 0.10.1-2
- Apache 2.4.38
- Zabbix 4.4.4 (from Zabbix repo)
IP addresses and hostnames to be used and appended to /etc/hosts
on every server:
# VIPs 192.168.7.87 zabbix-ha-app 192.168.7.88 zabbix-ha-fe 192.168.7.89 zabbix-ha-db # Front-end nodes 192.168.7.90 zabbix-ha-fe1 192.168.7.91 zabbix-ha-fe2 192.168.7.92 zabbix-ha-fe3 # Zabbix server nodes 192.168.7.93 zabbix-ha-srv1 192.168.7.94 zabbix-ha-srv2 192.168.7.95 zabbix-ha-srv3 # Database nodes 192.168.7.96 zabbix-ha-db1 192.168.7.97 zabbix-ha-db2 192.168.7.99 zabbix-ha-db3
Setting up the database servers
On every database server:
sudo -i vi /etc/hosts # set the hosts file as mentioned above apt install corosync pacemaker pcs echo hacluster:Zabbix123 | chpasswd # Debian has a cluster configured already, ignore the config: mv /etc/corosync/corosync.conf /etc/corosync/corosync.conf.orig apt install mariadb-server systemctl stop mariadb
On every database server, create a configuration file /etc/mysql/mariadb.conf.d/90-zabbix.cnf
:
[mysqld] skip_name_resolve bind_address = 0.0.0.0 log_slave_updates max_binlog_size = 1G expire_logs_days = 5 innodb_buffer_pool_size = 1G # 70-80% of total RAM innodb_buffer_pool_instances = 1 # each instance should be at least 1GB innodb_flush_log_at_trx_commit = 2 # default = 1 innodb_flush_method = O_DIRECT # default = fsync innodb_io_capacity = 500 # HDD = 500-800, SSD = 2000 query_cache_size = 0 # Change the following values for each server accordingly! log_basename = zabbix-ha-db1 log_bin = zabbix-ha-db1-bin server_id = 96 # The last number of the server IP address
Start MariaDB again on every server:
systemctl start mariadb
On the first server only:
pcs host auth zabbix-ha-db1 zabbix-ha-db2 zabbix-ha-db3 -u hacluster -p Zabbix123 pcs cluster setup zabbix_db_cluster zabbix-ha-db1 zabbix-ha-db2 zabbix-ha-db3 --force pcs cluster start --all systemctl enable corosync pacemaker pcs property set stonith-enabled=false pcs resource defaults resource-stickiness=100 pcs resource create virtual_ip ocf:heartbeat:IPaddr2 ip=192.168.7.89 op monitor interval=5s --group zabbix_db_cluster
On the second and third server, enable the cluster services:
systemctl enable corosync pacemaker
Check the cluster status with pcs status
command, output example:
root@zabbix-ha-db1:~# pcs status Cluster name: zabbix_db_cluster Stack: corosync Current DC: zabbix-ha-db3 (version 2.0.1-9e909a5bdd) - partition with quorum Last updated: Sat Jan 11 19:03:44 2020 Last change: Sat Jan 11 18:48:10 2020 by root via cibadmin on zabbix-ha-db2 3 nodes configured 1 resource configured Online: [ zabbix-ha-db1 zabbix-ha-db2 zabbix-ha-db3 ] Full list of resources: Resource Group: zabbix_db_cluster virtual_ip (ocf::heartbeat:IPaddr2): Started zabbix-ha-db1 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
Now let’s configure the database replication.
On the first server, start mysql
, and enter commands:
stop slave; grant replication slave on *.* to 'replicator'@'192.168.7.97' identified by 'Password456'; show global variables like 'gtid_current_pos';
Output example for the GTID position:
MariaDB [(none)]> show global variables like 'gtid_current_pos'; +------------------+--------+ | Variable_name | Value | +------------------+--------+ | gtid_current_pos | 0-96-1 | +------------------+--------+ 1 row in set (0.002 sec) MariaDB [(none)]>
Make a note of the position (“0-96-1
” in this example).
On the second server (zabbix-ha-db2), start mysql
, and enter commands:
stop slave; set global gtid_slave_pos = '0-96-1'; # The GTID you noted earlier change master to master_host='192.168.7.96', master_user='replicator', master_password='Password456', master_use_gtid=slave_pos; grant replication slave on *.* to 'replicator'@'192.168.7.99' identified by 'Password456'; reset master; start slave; show slave status\G
Output example for the slave status:
MariaDB [(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.7.96 Master_User: replicator Master_Port: 3306 Connect_Retry: 60 Master_Log_File: zabbix-ha-db1-bin.000002 Read_Master_Log_Pos: 350 Relay_Log_File: zabbix-ha-db2-relay-bin.000002 Relay_Log_Pos: 657 Relay_Master_Log_File: zabbix-ha-db1-bin.000002 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 350 Relay_Log_Space: 974 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 96 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 0-96-1 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it Slave_DDL_Groups: 0 Slave_Non_Transactional_Groups: 0 Slave_Transactional_Groups: 0 1 row in set (0.000 sec) MariaDB [(none)]>
On the third server (zabbix-ha-db3), start mysql
, and enter commands:
stop slave; set global gtid_slave_pos = '0-96-1'; # The same as noted earlier change master to master_host='192.168.7.97', master_user='replicator', master_password='Password456', master_use_gtid=slave_pos; grant replication slave on *.* to 'replicator'@'192.168.7.96' identified by 'Password456'; reset master; start slave; show slave status\G
Output example on zabbix-ha-db3:
MariaDB [(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.7.97 Master_User: replicator Master_Port: 3306 Connect_Retry: 60 Master_Log_File: zabbix-ha-db2-bin.000001 Read_Master_Log_Pos: 336 Relay_Log_File: zabbix-ha-db3-relay-bin.000002 Relay_Log_Pos: 643 Relay_Master_Log_File: zabbix-ha-db2-bin.000001 Slave_IO_Running: Yes Slave_SQL_Running: Yes ... Using_Gtid: Slave_Pos Gtid_IO_Pos: 0-96-1 ...
On the first database server again (zabbix-ha-db1), to complete the ring replication, start mysql
, and enter commands:
stop slave; set global gtid_slave_pos = '0-96-1'; change master to master_host='192.168.7.99', master_user='replicator', master_password='Password456', master_use_gtid=slave_pos; start slave; show slave status\G
Output example on the first server:
MariaDB [(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.7.99 Master_User: replicator Master_Port: 3306 Connect_Retry: 60 Master_Log_File: zabbix-ha-db3-bin.000001 Read_Master_Log_Pos: 336 Relay_Log_File: zabbix-ha-db1-relay-bin.000002 Relay_Log_Pos: 643 Relay_Master_Log_File: zabbix-ha-db3-bin.000001 Slave_IO_Running: Yes Slave_SQL_Running: Yes ... Using_Gtid: Slave_Pos Gtid_IO_Pos: 0-96-1 ...
Now, still on the first server, create the Zabbix database and user:
create database zabbix character set utf8 collate utf8_bin; grant all privileges on zabbix.* to 'zabbix'@'192.168.7.%' identified by 'Password789'; quit;
We just created an empty Zabbix database, and we will import the schema later from the Zabbix application server.
Setting up the Zabbix servers
On every Zabbix server:
sudo -i vi /etc/hosts # set the hosts file as mentioned in the start apt install corosync pacemaker pcs echo hacluster:Zabbix123 | chpasswd mv /etc/corosync/corosync.conf /etc/corosync/corosync.conf.orig # This is the current release file, please check the latest in https://www.zabbix.com/download wget https://repo.zabbix.com/zabbix/4.4/debian/pool/main/z/zabbix-release/zabbix-release_4.4-1+buster_all.deb dpkg -i zabbix-release_4.4-1+buster_all.deb apt update # We skip MariaDB and snmpd (note the minus characters) apt install zabbix-server-mysql mariadb-server-10.3- snmpd-
On every Zabbix server, edit /etc/zabbix/zabbix_server.conf
:
SourceIP=192.168.7.87 DBHost=192.168.7.89 DBPassword=Password789
On the first server only (zabbix-ha-srv1), prepare the Zabbix database, enter the previously set zabbix user password (Password789
) when asked:
zcat /usr/share/doc/zabbix-server-mysql*/create.sql.gz | mysql -h 192.168.7.96 -u zabbix -p zabbix
Note: If you want to implement Zabbix database table partitioning, this would be the time for that.
Going on, still on the first Zabbix server, set up the cluster:
pcs host auth zabbix-ha-srv1 zabbix-ha-srv2 zabbix-ha-srv3 -u hacluster -p Zabbix123 pcs cluster setup zabbix_server_cluster zabbix-ha-srv1 zabbix-ha-srv2 zabbix-ha-srv3 --force pcs cluster start --all systemctl enable corosync pacemaker pcs property set stonith-enabled=false pcs resource defaults resource-stickiness=100 pcs resource create virtual_ip_server ocf:heartbeat:IPaddr2 ip=192.168.7.87 op monitor interval=5s --group zabbix_server_cluster pcs resource create ZabbixServer systemd:zabbix-server op monitor interval=10s --group zabbix_server_cluster pcs constraint colocation add virtual_ip_server with ZabbixServer pcs constraint order virtual_ip_server then ZabbixServer # To edit the start/stop timeouts we need to delete them first pcs resource op delete ZabbixServer start pcs resource op delete ZabbixServer stop pcs resource op add ZabbixServer start interval=0s timeout=60s pcs resource op add ZabbixServer stop interval=0s timeout=120s
Check the cluster status with pcs status
command, output example:
root@zabbix-ha-srv1:~# pcs status Cluster name: zabbix_server_cluster Stack: corosync Current DC: zabbix-ha-srv2 (version 2.0.1-9e909a5bdd) - partition with quorum Last updated: Sat Jan 11 19:25:31 2020 Last change: Sat Jan 11 19:21:09 2020 by root via cibadmin on zabbix-ha-srv1 3 nodes configured 2 resources configured Online: [ zabbix-ha-srv1 zabbix-ha-srv2 zabbix-ha-srv3 ] Full list of resources: Resource Group: zabbix_server_cluster virtual_ip_server (ocf::heartbeat:IPaddr2): Started zabbix-ha-srv1 ZabbixServer (systemd:zabbix-server): Started zabbix-ha-srv1 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
Note that we didn’t excplicitly enable the zabbix-server
service, and it is not enabled by default. The clustering service will take care of starting the service on the active node.
Setting up the frontend web servers
On every web server:
sudo -i vi /etc/hosts # set the hosts file as mentioned in the start apt install corosync pacemaker pcs echo hacluster:Zabbix123 | chpasswd mv /etc/corosync/corosync.conf /etc/corosync/corosync.conf.orig # This is the current release file, please check the latest in https://www.zabbix.com/download wget https://repo.zabbix.com/zabbix/4.4/debian/pool/main/z/zabbix-release/zabbix-release_4.4-1+buster_all.deb dpkg -i zabbix-release_4.4-1+buster_all.deb apt update apt install zabbix-frontend-php zabbix-apache-conf apache2 systemctl stop apache2 systemctl disable apache2
On every web server, edit /etc/zabbix/apache.conf
to set the time zone in the PHP7 settings, for example:
... <IfModule mod_php7.c> ... php_value date.timezone Europe/Helsinki </IfModule> ...
On every web server, create /etc/apache2/conf-available/serverstatus.conf
:
Listen 127.0.0.1:8080 <VirtualHost 127.0.0.1:8080> <Location /server-status> SetHandler server-status Require local </Location> </VirtualHost>
On every web server, activate the server status configuration:
a2enconf serverstatus
On every web server, edit /etc/apache2/ports.conf
, change Listen 80
to include the cluster IP address:
Listen 192.168.7.88:80
On the first web server, let’s now configure Zabbix frontend:
systemctl start apache2
Using a browser, go to http://192.168.7.90/zabbix/, and configure the Zabbix frontend as requested. Note to enter the database cluster IP address 192.168.7.89
when asked, and Zabbix server cluster IP address 192.168.7.87
.
When Zabbix frontend has been successfully configured, copy the resulted configuration file /etc/zabbix/web/zabbix.conf.php
to the second and third web servers.
On the first web server, stop Apache, and configure the cluster:
systemctl stop apache2 pcs host auth zabbix-ha-fe1 zabbix-ha-fe2 zabbix-ha-fe3 -u hacluster -p Zabbix123 pcs cluster setup zabbix_fe_cluster zabbix-ha-fe1 zabbix-ha-fe2 zabbix-ha-fe3 --force pcs cluster start --all systemctl enable corosync pacemaker pcs property set stonith-enabled=false pcs resource defaults resource-stickiness=100 pcs resource create virtual_ip_fe ocf:heartbeat:IPaddr2 ip=192.168.7.88 op monitor interval=5s --group zabbix_fe_cluster pcs resource create zabbix_fe ocf:heartbeat:apache configfile=/etc/apache2/apache2.conf statusurl="http://localhost:8080/server-status" op monitor interval=30s --group zabbix_fe_cluster pcs constraint colocation add virtual_ip_fe with zabbix_fe pcs constraint order virtual_ip_fe then zabbix_fe # To edit the start/stop timeouts we need to delete them first pcs resource op delete zabbix_fe start pcs resource op delete zabbix_fe stop pcs resource op add zabbix_fe start interval=0s timeout=60s pcs resource op add zabbix_fe stop interval=0s timeout=120s
Finally, on the second and third web servers, enable the cluster services:
systemctl enable corosync pacemaker
You can check the web cluster status with pcs status
:
root@zabbix-ha-fe1:~# pcs status Cluster name: zabbix_fe_cluster Stack: corosync Current DC: zabbix-ha-fe3 (version 2.0.1-9e909a5bdd) - partition with quorum Last updated: Sat Jan 11 20:26:42 2020 Last change: Sat Jan 11 19:33:19 2020 by root via cibadmin on zabbix-ha-fe1 3 nodes configured 2 resources configured Online: [ zabbix-ha-fe1 zabbix-ha-fe2 zabbix-ha-fe3 ] Full list of resources: Resource Group: zabbix_fe_cluster virtual_ip_fe (ocf::heartbeat:IPaddr2): Started zabbix-ha-fe1 zabbix_fe (ocf::heartbeat:apache): Started zabbix-ha-fe1 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
You can also see which addresses and ports the active web server is listening to:
root@zabbix-ha-fe1:~# ss -ntul Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port udp UNCONN 0 0 192.168.7.90:5405 0.0.0.0:* tcp LISTEN 0 128 0.0.0.0:22 0.0.0.0:* tcp LISTEN 0 128 127.0.0.1:8080 0.0.0.0:* tcp LISTEN 0 128 192.168.7.88:80 0.0.0.0:* tcp LISTEN 0 128 0.0.0.0:2224 0.0.0.0:* tcp LISTEN 0 128 [::]:22 [::]:* tcp LISTEN 0 128 [::]:2224 [::]:*
As you can see, the server status service (port 8080) is only listening on the localhost address 127.0.0.1
, and web server on port 80 is listening on the cluster IP address 192.168.7.88
. It is left as an excercise for the reader to enable also IPv6 and/or TLS connectivity on the web server.
That’s it
As Edmunds said, this was just the bare minimum, but a good start anyway.
Some cluster commands useful in troubleshooting or management:
man pcs pcs config pcs node standby # see "pcs node --help" pcs node unstandby pcs quorum status
See also: ClusterLabs
hi can you help me please I’m getting below erro while:
# pcs resource create virtual_ip ocf:heartbeat:IPaddr2 ip=192.168.7.89 op monitor interval=5s –group zabbix_db_cluster
Error: When using ‘op’ you must specify an operation name and at least one option
What pcs version are you using? You can see my version in the beginning of the post.
Hi, I cannot start the apache server after including the cluster IP into the listening port
Listen 192.168.7.88:80
The apache start failed with the error:
(99)Cannot assign requested address: AH00072: make_sock: could not bind to address 192.168.7.88:80
no listening sockets available, shutting down
Could you help? thanks!
Hi, most probably means that something is already listening to port 80. First, make sure that Apache is really not running (= stop it completely). Then, check that your Apache configuration files don’t have any other Listen statements with port 80 (for example “Listen *:80”), and start Apache. If you don’t know which software is using port 80, use “sudo ss -ntlp” to show the listening ports and their processes.
Hi, Can I create 9 servers HA in docker?
THanks for keeping these instructions up. They ahve been quite helpful. I have everything set up, but still have the dreaded “MySQL server has gone away” message. pcs status is showing everything operational and I am able to connect using mysql-client from every server to the database with the zabbix user. I have spent the past 4 hours reading every article on the Zabbix forums, and even reviewing everything in the MySQL documentation. I even went so far as to rebuild servers, but I always get to the same spot. FML.
But the error is all mine somewhere, and I will eventually figure it out. I at least have it working properly at the office. My only issue is my home setup.
Love the great instructions you provided.