Using pg_squeeze
Suggest editsUse pg_squeeze
to remove unused space from a table and optionally sort tuples according to a particular index.
Understanding lock levels while using pg_squeeze
While using pg_squeeze
, the extension performs a series of actions, and some of these actions place a lock on the squeezed table, as described in the following steps. The table being squeezed is available for both read and write operations by other transactions most of the time.
When squeezing a table, the extension:
Creates a transient table and copies the contents of the source table using the snapshot.
pg_squeeze
holds anACCESS SHARE
lock on the source table during this step.Builds indexes on the transient table.
pg_squeeze
holds anACCESS SHARE
lock on the source table during this step.Decodes and applies the concurrent changes that occurred on the source table while the initial load is in progress.
Applies any concurrent changes that take place while waiting for the lock and swaps the storage of the transient and the source table.
pg_squeeze
holds anACCESS EXCLUSIVE
lock for a short period during this step.Drops the transient table.
Registering a table for regular processing
Before using pg_squeeze
, ensure that your table has either a primary key or a replica identity. This is necessary to process changes resulting from other transactions while pg_squeeze
is doing its work.
To make the pg_squeeze
extension aware of the table, you must insert a record into the squeeze.tables
table. After it is added, table statistics are checked periodically. When the table meets the necessary criteria to be squeezed, a task is added to a queue. The tasks are processed sequentially, in the order they are created.
Here is an example of a simple registration:
Additional columns can be specified optionally, for example:
The following list describes the table metadata:
tabschema
— Schema name.tabname
— Table name.schedule
— Specifies when the table should be checked and possibly squeezed. The schedule is described by a value of the following composite data type, which resembles a crontab entry:In this data type
minutes
(0-59) andhours
(0-23) determine the time when the check occurs within a day, whiledays_of_month
(1-31),months
(1-12) anddays_of_week
(0-7, where both 0 and 7 stand for Sunday) determine the day of the check.The check is performed if the
minutes
,hours
, andmonths
all match the current timestamp.NULL
means any minute, hour, and month, respectively. Eitherdays_of_month
ordays_of_week
must match the current timestamp or both must beNULL
for the check to take place.For example, the entries in the sample registration specify to check the table
public.bar
every Wednesday and Friday at 22:30.free_space_extra
— Minimum percentage of extra free space needed to trigger processing of the table. Theextra
adjective refers to the fact that free space derived fromfillfactor
isn't a reason to squeeze the table.For example, if
fillfactor
equals 60, then at least 40 percent of each page stays free during normal operation. If you want to ensure that 70 percent of free space makespg_squeeze
interested in the table, setfree_space_extra
to 30 (that is, 70 percent required to be free minus the 40 percent free due to thefillfactor
).The default value of
free_space_extra
is 50.min_size
— Minimum disk space in megabytes that the table must occupy to be eligible for processing. The default value is 8.vacuum_max_age
— Maximum time since the completion of the lastVACUUM
to consider the free space map (FSM) fresh. After this interval has elapsed, the portion of dead tuples might be significant. In this case, you must spend more effort to evaluate the potential effect ofpg_squeeze
beyond checking the FSM. The default value is 1 hour.max_retry
— Maximum number of extra attempts to squeeze a table if the first processing of the corresponding task fails. The typical reason to retry processing is that the table definition changed while the table was being squeezed. If the number of retries is achieved, processing of the table is considered complete. The next task is created at the next scheduled time. The default value ofmax_retry
is 0 (that is, don't retry).
Note
The squeeze.table
is the only table to modify. If you want to change anything else, make sure you understand what you're doing.
Ad hoc processing for any table
It's possible to squeeze tables manually without registering (that is, without inserting the corresponding record into squeeze.tables
) and without prior checking of the actual bloat.
Function signature:
The following list describes the table metadata for ad hoc processing:
clustering_index
— Index of the processed table. After processing finishes, tuples of the table are physically sorted by the key of this index.rel_tablespace
— Existing tablespace into which to move the table.NULL
means to leave the table where it is.ind_tablespaces
— Two-dimensional array in which each row specifies tablespace mapping of an index. The first and the second columns represent the index name and tablespace name, respectively. All indexes for which no mapping is specified remain in the original tablespace. !!! note If a tablespace is specified for the table but not for indexes, the table gets moved to that tablespace, but the indexes remain in the original tablespace. In other words, the tablespace of the table isn't the default for indexes.
Sample execution
Enabling or disabling table processing
To enable processing of bloated tables, run this statement as superuser:
The function starts a background worker (scheduler worker
) that periodically checks which of the registered tables to check for bloat and creates a task for each. Another worker (squeeze worker
) is launched whenever a task exists for particular database.
If the scheduler worker is already running for the current database, the function doesn't report any error, and the new worker exits immediately.
If the workers are running for the current database, you can use the following statement to stop them:
Note
Only the functions mentioned in this documentation are considered part of the user interface. If you want to call any other function, make sure you understand what you're doing.
If you want the background workers to start during startup of the whole PostgreSQL cluster, add entries like the following to the postgresql.conf
file:
The next time you start the cluster, two or more workers (that is, one scheduler worker and one or more squeeze workers) are launched for my_database
and the same for your_database
. If you take this approach, any worker doesn't start or stop without doing any work if either:
The
pg_squeeze
extension doesn't exist in the database.The
squeeze.worker_role
parameter specifies a role that doesn't have superuser privileges.
Although there are actually two workers, the functions and configuration variables described here use a singular form of the word worker
. This is because only one worker
existed in the previous versions of pg_squeeze
, which ensured both scheduling and execution of the tasks. This implementation change, then, doesn't force all users to adjust their configuration files during upgrade.
Controlling impact on other backends
Although the table being squeezed is available for both read and write operations by other transactions most of the time, an exclusive lock is needed to finalize processing. If pg_squeeze
occasionally seems to block access to tables, consider setting the squeeze.max_xlock_time
GUC parameter. For example:
This example specifies not to hold the exclusive lock for more than 0.1 second (100 milliseconds). If more time is needed for the final stage, pg_squeeze
releases the exclusive lock, processes changes committed by other transactions in between, and tries the final stage again. An error is reported if the lock duration is exceeded a few more times. If that happens, either increase the setting or schedule processing of the problematic table for a different time when write activity is lower.
Running multiple workers per database
If you think that a single squeeze worker doesn't cope with the load, consider setting the squeeze.workers_per_database
configuration variable to a value higher than 1. Then the pg_squeeze
extension can process multiple tables simultaneously: one table per squeeze worker.
However, be aware that this setting affects all databases in which you actively use the pg_squeeze
extension. The total number of all the squeeze workers in the cluster (including the scheduler workers) can't exceed the in-core configuration variable max_worker_processes
.
Monitoring
The squeeze.log
table contains one entry per successfully squeezed table.
The columns tabschema
and tabname
identify the processed table. The columns started
and finished
report when the processing started and finished. ins_initial
is the number of tuples inserted into the new table storage during the initial load stage, that is, the number of tuples present in the table before the processing started. On the other hand, ins
, upd
, and del
are the numbers of tuples inserted, updated, and deleted by applications during the table processing. (These concurrent data changes must also be incorporated into the squeezed table. Otherwise they'd get lost.)
The squeeze.errors
table contains errors that happen during squeezing. A common problem is that someone changes the definition of the table whose processing is in progress (that is, someone adds or removes a column).
The squeeze.get_active_workers()
function returns a table of squeeze workers that are processing tables in the current database.
The pid
column contains the system PID of the worker process. The other columns have the same meaning as their counterparts in the squeeze.log
table. While the squeeze.log
table shows information only on the completed squeeze operations, the squeeze.get_active_workers()
function lets you check the progress during processing.
Unregistering a table
If a particular table is no longer subject to periodical squeezes, delete the corresponding row from the squeeze.tables
table.
It's also a good practice to unregister a table that you're going to drop, although the background worker does unregister non-existing tables periodically.
Concurrency
The pg_squeeze
extension doesn't prevent other transactions from altering a table at certain stages of the processing. If a disruptive command (that is, ALTER TABLE
, VACUUM FULL
, CLUSTER
, or TRUNCATE
) manages to commit before the squeeze finishes, the squeeze_table()
function aborts and all changes made to the table are rolled back. The max_retry
column of the squeeze.tables
table determines how many times the squeeze worker retries. Changing your schedule might help to avoid these kind of disruptions.
Like pg_repack
, pg_squeeze
also changes visibility of rows and thus allows for the MVCC-unsafe behavior described in the first paragraph of Caveats for MVCC.
Disk space requirements
Performing a full-table squeeze requires twice as much free disk space as the target table and its indexes. For example, if the total size of the tables and indexes to be squeezed is 1GB, you need an additional 2GB of disk space.
- On this page
- Understanding lock levels while using pg_squeeze
- Registering a table for regular processing
- Ad hoc processing for any table
- Enabling or disabling table processing
- Controlling impact on other backends
- Running multiple workers per database
- Monitoring
- Unregistering a table
- Concurrency
- Disk space requirements
Could this page be better? Report a problem or suggest an addition!