
A problem with archiving
During a recent client issue, the customer reported to me as TAM that their database archiving was not working as expeted.
The customer could see the build up of WAL files in the pg_wal folder, and the archive errors listed in the Fujitsu Enterprise Postgres log file. The archive_command was using pgBackRest with archive-push. The pgBackRest repo-type was using an s3 interface to write to an IBM Spectrum Protect Storage Container (formerly known as Tivoli Storage Manager aka TSM).
The customer storage solution was considered highly specialized infrastructure, with a team of highly experienced engineers operating it, and who were reporting it as healthy, so they felt confident that any issues must be on the database side.
The errors seen in the database log file were:
archive command failed with exit code 42
and
ERROR: [042]: raised from local-1 protocol: timeout after 60000ms waiting for read from X.X.X.X:XXXX
The issue was that the WAL archive process was constantly timing out, and asking the customer to run the below SQL showed that the failed count was in high 1000s.SELECT pg_walfile_name(pg_current_wal_lsn()),last_archived_wal, last_archived_time::timestamp(0),
now()::timestamp(0) ,last_failed_wal, last_failed_time::timestamp(0),failed_count,
('x'||substring(pg_walfile_name(pg_current_wal_lsn()),9,8))::bit(32)::int*256 +
('x'||substring(pg_walfile_name(pg_current_wal_lsn()),17))::bit(32)::int -
('x'||substring(last_archived_wal,9,8))::bit(32)::int*256 -
('x'||substring(last_archived_wal,17))::bit(32)::int as lag
FROM pg_stat_archiver;
pg_walfile_name | last_archived_wal | last_archived_time | now | last_failed_wal | last_failed_time | failed# | lag
-------------------------+--------------------------+---------------------+---------------------+--------------------------+--------------------+---------+----
000000010000078200000067 | 000000010000078100000080 | 2024-01-22 13:36:56 | 2024-01-22 14:05:50 | 000000010000078100000081 |2024-01-22 14:06:39 | 8096 | 231
Next step: Tweaking pgBackRest configuration
To shed more light, I recommended increasing both the pgBackRest archive-timeout and the io- timeout options (both default to 60 seconds) to higher but different values, after which we were able to see a different error.
SONAR_AUDIT=1;TIMESTAMP="2024-01-23 08:29:50.839 CET";APPNAME="";USER="";DB="";
CLIENTHOSTPORT="";ERR="00000";SESSIONID="65af43e7.339e8";
SESSIONSTART="2024-01-23 05:43:19 CET";PID="211432";VTRANSID="";TRANSID="0";
LOG: archive command failed with exit code 101 (11263) SONAR_AUDIT=1;TIMESTAMP="2024-01-23 08:29:50.839 CET";APPNAME="";USER="";DB="";
CLIENTHOSTPORT="";ERR="00000";SESSIONID="65af43e7.339e8";
SESSIONSTART="2024-01-23 05:43:19 CET";PID="211432";VTRANSID="";TRANSID="0";
DETAIL: The failed archive command was: pgbackrest --stanza=stanza_prodd7dx_20104
--config=/var/lib/postgres_prodd7dx_20104/postgres_backup/pgbackrest/pgbackrest.conf
archive-push pg_wal/0000000100000781000000E3 (14792) pgbackrest: /opt/fsepv12server64/lib/libpq.so.5: no version information available (required by pgbackrest)
ERROR: [101]: raised from local-1 protocol: [507] Insufficient Storage
Although insufficient storage usually means that the storage volume is full, it can also indicate that storage is offline or unavailable. However, the information above gave us evidence that something was not right with the TSM storage, and we were able to request the TSM team to take another look.
So, what was the issue?
The storage team raised a ticket with the vendor, who advised that the tool they were using to calculate utilization was flawed and reporting significantly more free space than was there - they then recommended another tool that showed the TSM storage was in fact nearly full. Given that, additional storage was added, and the archiving flowed normally.
So after all, Fujitsu Enterprise Postgres archiving was operating as expected, which did not surprise me, considering how robust it is.
If you want to use pgBackRest on the same server as the database server, refer to Installation and Setup Guide for Server > Setup > pgBackRest
If you want to use pgBackRest on a different server, refer to Installation and Setup Guide for Client > Setup > pgBackRest.