PostgreSQL 可靠性和一致性 代码分析

本文涉及的产品
云原生数据库 PolarDB MySQL 版,Serverless 5000PCU 100GB
云原生数据库 PolarDB 分布式版,标准版 2核8GB
云数据库 RDS MySQL Serverless,0.5-2RCU 50GB
简介:

PostgreSQL 的数据可靠性是依赖XLOG的实现的,所有的对数据块的变更操作在write到磁盘前,一定是确保这个变更产生的REDO会先写到XLOG,并保证XLOG已落盘。
也就是说流程是这样的:
.1. 首先将需要变更的块从文件读入shared buffer
1
.2. 变更shared buffer中block的内容
2
.3. 将shared buffer中block变更的内容写入XLOG,如果是checkpoint后第一次变更该块,则写full page。(通过参数控制是否要写full page)。
3
.4. 在bgwriter将shared buffer中的脏块write到os dirty page前,会确保它对应的XLOG已经落盘,通过脏块的LSN来确保。
4
所以问题来了,如果用户使用了异步提交,即synchronous_commit=off,会怎样呢?
也没有问题,因为在第四步,一定是会保证造成脏页的XLOG是先落盘的。
所以开启synchronous_commit=off,只会造成丢XLOG,绝对不会造成数据不一致。
确保可靠性和一致性的代码如下:


/*
 * Main entry point for bgwriter process
 *
 * This is invoked from AuxiliaryProcessMain, which has already created the
 * basic execution environment, but not enabled signals yet.
 */
void
BackgroundWriterMain(void)
{
...
        /*
         * Do one cycle of dirty-buffer writing.
         */
        can_hibernate = BgBufferSync();

...



/*
 * BgBufferSync -- Write out some dirty buffers in the pool.
 *
 * This is called periodically by the background writer process.
 *
 * Returns true if it's appropriate for the bgwriter process to go into
 * low-power hibernation mode.  (This happens if the strategy clock sweep
 * has been "lapped" and no buffer allocations have occurred recently,
 * or if the bgwriter has been effectively disabled by setting
 * bgwriter_lru_maxpages to 0.)
 */
bool
BgBufferSync(void)
{
...

    /* Execute the LRU scan */
    while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
    {
        int            buffer_state = SyncOneBuffer(next_to_clean, true);

...


/*
 * SyncOneBuffer -- process a single buffer during syncing.
 *
 * If skip_recently_used is true, we don't write currently-pinned buffers, nor
 * buffers marked recently used, as these are not replacement candidates.
 *
 * Returns a bitmask containing the following flag bits:
 *    BUF_WRITTEN: we wrote the buffer.
 *    BUF_REUSABLE: buffer is available for replacement, ie, it has
 *        pin count 0 and usage count 0.
 *
 * (BUF_WRITTEN could be set in error if FlushBuffers finds the buffer clean
 * after locking it, but we don't care all that much.)
 *
 * Note: caller must have done ResourceOwnerEnlargeBuffers.
 */
static int
SyncOneBuffer(int buf_id, bool skip_recently_used)
{

...

    FlushBuffer(bufHdr, NULL);
...


/*
 * FlushBuffer
 *        Physically write out a shared buffer.
 *
 * NOTE: this actually just passes the buffer contents to the kernel; the
 * real write to disk won't happen until the kernel feels like it.  This
 * is okay from our point of view since we can redo the changes from WAL.
 * However, we will need to force the changes to disk via fsync before
 * we can checkpoint WAL.
 *
 * The caller must hold a pin on the buffer and have share-locked the
 * buffer contents.  (Note: a share-lock does not prevent updates of
 * hint bits in the buffer, so the page could change while the write
 * is in progress, but we assume that that will not invalidate the data
 * written.)
 *
 * If the caller has an smgr reference for the buffer's relation, pass it
 * as the second parameter.  If not, pass NULL.
 */
static void
FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
{

...

    /*
     * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
     * rule that log updates must hit disk before any of the data-file changes
     * they describe do.
     *
     * However, this rule does not apply to unlogged relations, which will be
     * lost after a crash anyway.  Most unlogged relation pages do not bear
     * LSNs since we never emit WAL records for them, and therefore flushing
     * up through the buffer LSN would be useless, but harmless.  However,
     * GiST indexes use LSNs internally to track page-splits, and therefore
     * unlogged GiST pages bear "fake" LSNs generated by
     * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
     * LSN counter could advance past the WAL insertion point; and if it did
     * happen, attempting to flush WAL through that location would fail, with
     * disastrous system-wide consequences.  To make sure that can't happen,
     * skip the flush if the buffer isn't permanent.
     */
    if (buf->flags & BM_PERMANENT)
        XLogFlush(recptr);

...




/*
 * Ensure that all XLOG data through the given position is flushed to disk.
 *
 * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
 * already held, and we try to avoid acquiring it if possible.
 */
void
XLogFlush(XLogRecPtr record)
{
    XLogRecPtr    WriteRqstPtr;
    XLogwrtRqst WriteRqst;

...
        XLogWrite(WriteRqst, false);

...



/*
 * Write and/or fsync the log at least as far as WriteRqst indicates.
 *
 * If flexible == TRUE, we don't have to write as far as WriteRqst, but
 * may stop at any convenient boundary (such as a cache or logfile boundary).
 * This option allows us to avoid uselessly issuing multiple writes when a
 * single one would do.
 *
 * Must be called with WALWriteLock held. WaitXLogInsertionsToFinish(WriteRqst)
 * must be called before grabbing the lock, to make sure the data is ready to
 * write.
 */
static void
XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{

...
    /*
     * If asked to flush, do so
     */
    if (LogwrtResult.Flush < WriteRqst.Flush &&
        LogwrtResult.Flush < LogwrtResult.Write)

    {
        /*
         * Could get here without iterating above loop, in which case we might
         * have no open file or the wrong one.  However, we do not need to
         * fsync more than one file.
         */
        if (sync_method != SYNC_METHOD_OPEN &&
            sync_method != SYNC_METHOD_OPEN_DSYNC)
        {
            if (openLogFile >= 0 &&
                !XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo))
                XLogFileClose();
            if (openLogFile < 0)
            {
                XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo);
                openLogFile = XLogFileOpen(openLogSegNo);
                openLogOff = 0;
            }

            issue_xlog_fsync(openLogFile, openLogSegNo);
        }

        /* signal that we need to wakeup walsenders later */
        WalSndWakeupRequest();

        LogwrtResult.Flush = LogwrtResult.Write;
    }
...

异步提交代码如下

/*
     * Check if we want to commit asynchronously.  We can allow the XLOG flush
     * to happen asynchronously if synchronous_commit=off, or if the current
     * transaction has not performed any WAL-logged operation or didn't assign
     * a xid.  The transaction can end up not writing any WAL, even if it has
     * a xid, if it only wrote to temporary and/or unlogged tables.  It can
     * end up having written WAL without an xid if it did HOT pruning.  In
     * case of a crash, the loss of such a transaction will be irrelevant;
     * temp tables will be lost anyway, unlogged tables will be truncated and
     * HOT pruning will be done again later. (Given the foregoing, you might
     * think that it would be unnecessary to emit the XLOG record at all in
     * this case, but we don't currently try to do that.  It would certainly
     * cause problems at least in Hot Standby mode, where the
     * KnownAssignedXids machinery requires tracking every XID assignment.  It
     * might be OK to skip it only when wal_level < hot_standby, but for now
     * we don't.)
     *
     * However, if we're doing cleanup of any non-temp rels or committing any
     * command that wanted to force sync commit, then we must flush XLOG
     * immediately.  (We must not allow asynchronous commit if there are any
     * non-temp tables to be deleted, because we might delete the files before
     * the COMMIT record is flushed to disk.  We do allow asynchronous commit
     * if all to-be-deleted tables are temporary though, since they are lost
     * anyway if we crash.)
     */
    if ((wrote_xlog && markXidCommitted &&
         synchronous_commit > SYNCHRONOUS_COMMIT_OFF) ||
        forceSyncCommit || nrels > 0)
    {
        XLogFlush(XactLastRecEnd);

        /*
         * Now we may update the CLOG, if we wrote a COMMIT record above
         */
        if (markXidCommitted)
            TransactionIdCommitTree(xid, nchildren, children);
    }
    else
    {
        /*
         * Asynchronous commit case:
         *
         * This enables possible committed transaction loss in the case of a
         * postmaster crash because WAL buffers are left unwritten. Ideally we
         * could issue the WAL write without the fsync, but some
         * wal_sync_methods do not allow separate write/fsync.
         *
         * Report the latest async commit LSN, so that the WAL writer knows to
         * flush this commit.
         */
        XLogSetAsyncXactLSN(XactLastRecEnd);

        /*
         * We must not immediately update the CLOG, since we didn't flush the
         * XLOG. Instead, we store the LSN up to which the XLOG must be
         * flushed before the CLOG may be updated.
         */
        if (markXidCommitted)
            TransactionIdAsyncCommitTree(xid, nchildren, children, XactLastRecEnd);
    }
相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
相关文章
|
3月前
|
Cloud Native 关系型数据库 MySQL
PolarDB MySQL企业版:云原生架构,超高性能与可靠性的完美结合
在数字化时代,数据已成为企业的核心资产。对于现代企业来说,选择一款高性能、高可靠性、高性价比的数据库至关重要。阿里巴巴自研的云原生HTAP数据库——PolarDB MySQL企业版,正是这样一款满足企业需求的理想选择。
388 1
|
4月前
|
关系型数据库 Serverless 分布式数据库
PolarDB Serverless能力测评:秒级弹升、无感伸缩与强一致性,助您实现高效云数据库管理!
云原生数据库 PolarDB MySQL 版是阿里云自研产品,100%兼容 MySQL。PolarDB产品具有多主多写、多活容灾、HTAP 等特性,交易性能最高可达开源数据库的6倍,分析性能最高可达开源数据库的400倍,TCO 低于自建数据库50%。【评测用!】
70462 15
|
4月前
|
关系型数据库 Serverless 分布式数据库
PolarDB Serverless能力测评:秒级弹升、无感伸缩与强一致性,助您实现高效云数据库管理!
云原生数据库 PolarDB MySQL 版是阿里云自研产品,100%兼容 MySQL。PolarDB产品具有多主多写、多活容灾、HTAP 等特性,交易性能最高可达开源数据库的6倍,分析性能最高可达开源数据库的400倍,TCO 低于自建数据库50%。
|
8月前
|
关系型数据库 MySQL 数据库
MySQL高可用与复制:确保稳定性与可靠性
本文深入研究了MySQL数据库中的高可用性与复制策略,通过详细的代码示例,介绍了复制的原理与架构,主从复制与读写分离的应用,以及高可用方案中的主备切换和故障转移。复制技术基于主从架构,使数据在多个数据库之间保持一致,提供了高可用性和数据冗余的保障。通过主从复制,数据库不仅能够实现高可用性,还可以通过读写分离来分担主数据库的负载,提升系统的性能。在高可用方案中,主备切换和故障转移是关键策略,可以在主数据库故障时快速切换到备库,确保系统的连续性。
137 0
|
8月前
|
关系型数据库 MySQL 数据库
MySQL数据备份与恢复:保障数据安全与可靠性
本文深入介绍了MySQL数据库中的数据备份与恢复策略,以及相关工具和解决方案。通过详细的代码示例,阐述了使用`mysqldump`工具进行全库备份和数据恢复的步骤。同时,强调了制定合理的备份策略的重要性,以及如何使用定时任务工具自动进行备份。在备份和恢复过程中可能遇到的常见问题,如速度慢和版本兼容性,也提供了相应的解决方案。通过深入了解这些技术,读者将能够在数据库管理中高效地进行数据备份与恢复,确保数据的安全性和可靠性,为应对各种意外情况提供了有力的保障。
141 0
|
存储 运维 算法
PolarDB-X 一致性共识协议 (X-Paxos)
近几年NewSQL和云原生数据库的不断兴起,极大地推动了关系数据库和一致性协议的结合,PolarDB-X也是在这样的背景下应运而生。
1807 0
PolarDB-X 一致性共识协议 (X-Paxos)
|
存储 SQL 关系型数据库
数据库误操作后悔药来了:AnalyticDB PostgreSQL教你实现分布式一致性备份恢复
本文将介绍AnalyticDB PostgreSQL版备份恢复的原理与使用方法。
810 0
数据库误操作后悔药来了:AnalyticDB PostgreSQL教你实现分布式一致性备份恢复
|
关系型数据库 数据库 PostgreSQL
PostgreSQL 数据库数据文件BLOCK一致性校验、备份集恢复后的有效性快速校验 - pg_verify_checksums
PostgreSQL 数据库数据文件BLOCK一致性校验、备份集恢复后的有效性快速校验 - pg_verify_checksums
2361 0
|
监控 关系型数据库 测试技术
PostgreSQL 双节点流复制如何同时保证可用性、可靠性(rpo,rto) - (半同步,自动降级方法实践)
PostgreSQL 双节点流复制如何同时保证可用性、可靠性(rpo,rto) - (半同步,自动降级方法实践)
1156 0
|
弹性计算 监控 关系型数据库
PostgreSQL 双节点流复制如何同时保证可用性、可靠性(rpo,rto) - (半同步,自动降级方法实践)
标签 PostgreSQL , 同步 , 半同步 , 流复制 背景 两节点HA架构,如何做到跨机房RPO=0(可靠性维度)?同时RTO可控(可用性维度)? 半同步是一个不错的选择。 1、当只挂掉一个节点时,可以保证RPO=0。如下: 主 -> 从(挂) 主(挂) -> 从 2、当一个节点挂掉后,在另一个节点恢复并开启同步模式前,如果在此期间(
2069 0

相关产品

  • 云原生数据库 PolarDB