PostgreSQL B+树索引---并发控制

编程知识更新时间:2023-05-03 04:05:15

B+树索引—并发控制

预备知识

《PostgreSQL B+树索引—基本结构》

《PostgreSQL B+树索引—查询》

《PostgreSQL B+树索引—插入》

《PostgreSQL B+树索引—分裂》

概述

最后，我们来讨论B+树的并发控制，准确来说应该是B-Link树的并发控制。并发控制是B-Link的精髓，B-Link树是在B+树的基础上进行了改造，而改造的目的就是为了提升并发性。在《B+树索引—基本结构》中，我们已经简要介绍过了B+树索引并发性的核心思想。基于这个核心思想派生出了许多精妙的实现，在本文中，我们来一一阐述。

分裂锁

基本

首先，我们来看看并发控制中锁最多的操作：分裂。明白了分裂的并发控制，才能更好的阐述其他操作的并发控制。

图1

当前B+树的结构如图1所示，现在我们向节点1中插入一个元素，由于要向节点1插入元素，所以需要对节点1加X锁：

图2

插入导致节点1分裂，由于节点1发生分裂，所以X锁需要继续保持。现在我们需要新分配一个节点：节点4，用于迁移数据，所以节点4也需要加互斥锁：

图3

接着，我们将节点1中分裂点之后的数据迁移到节点4。迁移完成后需要将节点4加入链表，由于加入链表需要修改节点2的前驱，所以需要对节点2加互斥锁：

图4

注意，现在出现了第一个关键动作：持有节点1的锁，然后给节点1的右节点（节点2）加锁，即从左向右加锁。

节点4加入链表后，就可以释放节点2的互斥锁了：

图5

现在，需要将节点4的<minkey, blockno>向上写入父节点（节点3），所以需要给节点3加互斥锁。

图6

注意，现在出现了第二个关键动作：持有节点1和节点4的锁，然后给节1和4的父节点（节点3）加锁，即从下至上加锁。

向上写入完成后，整个分裂结束，释放掉所有的锁。

关键点

整个分裂流程有三个关键点：

在整个分裂过程中，同一时间最多只有三个节点持有锁。所以B+树的并发性是非常高的。
在分裂过程中，存在从左向右加锁的动作。
在分裂过程中，存在从下向上加锁的动作。

从左向右的加锁动作，意味着B+树不能再有从右向左的加锁动作，否则会产生死锁；同样从下向上的加锁动作，意味着B+树不能再有从上向下的加锁动作，否则也会产生死锁。

定位锁

定位流程

下面，我们来看看B+树的定位锁，在查询时，我们需要定位查询的开始位置，在插入时我们需要定位插入位置。定位过程，是从树的根节点出发，向下搜索的过程，是一个从上至下的遍历过程。在遍历过程中，我们需要读取节点中的数据，读取节点就意味着需要给节点加共享锁。前面我们说过，B+树中不能再有从上至下的加锁过程。所以每当需要读取下级节点时，都需要先将当前节点解锁，再给下级节点加共享锁。整个定位过程同一时间，只有一个节点持有锁。

图7

问题

这样的并发控制流程会带来一个问题，如图8所示：

图8

图8，展示了B+树分裂的中间状态，有进程Process A，向B+树中执行了插入操作，导致block1发生了分裂，但还未将block4进行向上写入。此时block1、block4、block2都有互斥锁，而block3没有锁。现在有另一个进程Process B向B+树中插入6。Process B会遍历B+树定位插入位置，于是执行了以下流程：

S Lock block3
二分法，找到6应该写入block1（因为6比8小）
UnLock block3
Lock block1

在Lock block1时，发现block1上的X锁，于是Process B进入等待。而此时Process A对block3加上X锁，执行向上插入，完成分裂，然后解锁block1，block4，block3，并唤醒Process B。Process B成功锁住block1。但显然6不应该插入block1，而应该插入block4。

解决方案

为了应对这样的情况，PostgreSQL提供了一个叫做_bt_moveright的函数，来实现一个右移的逻辑，该函数实现了这样的一个功能：

将scankey（当前为6），与节点的high key进行比较。
如果scankey > high key，就移动到当前节点的右兄弟。

而移动的方式是先Unlock当前节点，再Lock右节点。

遍历锁

回顾一下查询流程，我们利用索引查询时，分为定位和遍历两个阶段。定位到开始位置后，就需要进行遍历，直到满足遍历终止条件。遍历的目的是找到合法元组，并放入结果集。所以遍历过程又可以分为以下步骤：

执行step1：从indextuple中获取tid。
执行step2：根据tid获取元组。
执行step3：判断元组可见性。
执行step4：判断元组合法性。
执行step5：将合法元组加入结果集。
执行step6：获取下一个indextuple。
执行step7：执行step1。

这里，重要的不是步骤本身，而是通过这些步骤我们不难看出，取出一个indextuple之后，还需要干很多事。所以为了提高并发性。PostgreSQL总是一次性获取当前节点中所有满足索引条件的indextuple，存放到当前进程的私有内存中，然后解锁当前节点。后续对indextuple的遍历都是在私有内存中进行的。

从遍历方向上说，PostgreSQL存在向前遍历和向后遍历（向右遍历和向左遍历）。前面说过，分裂阶段存在从左向右上锁。所以在向左遍历时不能存在从右向左上锁。所以，不论是向前遍历还是向后遍历，在访问下一个节点之前，总是需要先解锁当前节点，这和_bt_moveright的逻辑一致。

所以遍历阶段可以归纳为下面几个步骤：

step1：获取当前节点的所有indextuple，存放到进程私有内存。
step2：UnLock当前节点。
step3：遍历私有内存的indextuple，判断可见性、合法性，加入结果集。
step4：Lock下一个节点。

根据遍历方向，下一个节点可能是右节点或左节点。这样的并发控制，使向前遍历和向后遍历有非常大的不同，下面我们来分别阐述。

向前遍历

向前遍历，即向右遍历。在上述并发控制下，向右遍历会有这样一个问题：

图9

B+树的当前状态如图9所示，当前进程正在访问节点1，节点1上有共享锁，当前进程获取了节点1的所有indextuple之后，会Unlock节点1。然后当前进程会做一些列其他事情（遍历所有indextuple，判断可见性等等），当这些事情做完之后，当前进程就需要访问下一个节点，也就是节点2。然而在当前进程做其他事情的时候，有另外的进程对节点1做了插入，导致节点1分裂。

图10

此时，节点1的右兄弟变成了节点3。那么我们应该遍历节点3么？不应该。因为节点3一部分数据是节点1分裂过来的，而节点1已经遍历过了，另外一部分数据是新插入的，这部分数据一定是不可见的，所以节点3不应该遍历，遍历只是浪费时间。当然，在当前进程遍历indextuple期间，节点1可能多次分裂，所以节点1和节点2之间可能不只节点3，而这些节点统统不应该遍历。所以正确的做法是，在我们UnLock节点1之前，需要获取节点的右兄弟的节点编号，即2，然后保存到进程私有空间。当需要访问下一个节点时，直接从私有空间中获取右节点编号，然后给右节点加锁。

那么，如果在当前进程遍历indextuple期间，节点2发生了分裂怎么办呢？这个没关系，因为分裂是向右分裂，所以按照现有逻辑顺着遍历下去就好。

向后遍历

遍历流程

向后遍历，即向左遍历。在上述并发控制下，向左遍历会有这样一个问题：

图11

B+树的当前状态如图11所示，当前进程正在访问节点2，节点2上有共享锁。当前进程获取了节点2的所有indextuple之后，也会Unlock节点2，然后做其他事情，在这个过程中，节点1发生了分裂。如图12所示：

图12

此时，节点2的左兄弟变成了节点3。那么我们应该遍历节点3么？应该，因为节点1还没有遍历过，如果我们采用向右遍历的逻辑，在Unlock节点2之前缓存节点2的左兄弟节点1，之后直接访问节点1，那么我们就会略过节点1的部分数据，从而造成数据丢失。所以，为了正确访问节点2的左兄弟，我们应该采用这样的步骤：

step1：lock block2
step2：get left_blk（block3）
step3：unlock block2
step4：lock left_blk（block3）

这4个步骤中，首先我们需要注意，step3和step4不能交换顺序，因为我们不能在节点2持有锁的时候，给节点3上锁，这是一个从左向右上锁的动作，如果节点3此时正在发生分裂，就有可能造成死锁。所以必须先解锁节点2，再给节点3加锁。然而，这由带来了一个问题，节点3有可能在step3和step4之间发生分裂，这样也会发生数据丢失。所以，我们需要更完善的逻辑。回到我们最初的目的：获取节点2的左兄弟，节点2的左兄弟，也就是右兄弟为节点2的节点。所以，我们需要加入一个判断流程：

step1：lock block2
step2：get current_blk（block2）
step3：get left_blk（block3）
step4：unlock current_blk（block2）
step5：lock left_blk（block3）
step6：check if ( left_blk->right_blk == current_blk)
step7：if true then return
step8：if false then move right
- unlock left_blk
- lock left_blk->right_blk
step9：go to step6

这个流程的关键点是step6-step9。step6-step9是一个循环，核心思想是，锁定一个节点后，我们需要判断该节点右兄弟，是不是节点2。如果不是就右移到他的右节点，再次进行校验。这个逻辑很好很复杂，但这就万事大吉了么？不！还有坑爹的事情。在我们unlock了block2之后，block2可能会被删除！(由节点合并引起)

我们先来理解这个删除，首先在上面的步骤中，block2会被Unlock，但不会被Unpin。也就是说，block2的引用计数不可能为0。所以block2是不可能被彻底回收的。那么所谓的删除，只是将block2从链表中移除。如图13所示：

图13

节点3的右节点变成了节点4，而节点4的左节点变成了节点3。然而节点2依然指向它原始的左右节点（这也是为什么不能Unpin节点2）。这样的变化会导致什么情况发生呢？导致step6-step9永远找不到一个右节点为节点2的节点。那么，我们应该如何来解决这个问题呢？首先，我们需要知道一个很重要的特性：节点2被删除，既然被删除，那么就不会有其他事务再向节点2插入或删除数据。那么节点2就再也不可能发生分裂。所以，节点2的右兄弟也就再也不会发生变化，永远都是节点4。在图13中，我们向右遍历找不到节点2，那么就应该去找节点2的右兄弟。因为此时，节点2右兄弟的左兄弟，就是之前节点2的左兄弟！

所以，在PostgreSQL中，step6-step9只会循环4次（PostgreSQL的经验值），如果循环4次后还是找不到右兄弟为节点2的节点，则跳出循环。然后判断节点2是否被删除，如果节点2被删除了，就去找节点2的右兄弟，这个节点的左兄弟就是我们要找的节点。于是重新goto到step1，执行全部流程。

在寻找节点2右兄弟的时候需要注意，节点2的右兄弟可能也被删除了，如图14所示：

图14

所以，我们在找右兄弟的时候，应该去找第一个没有被删除的右兄弟。

代码实现

现在，我们来看看遍历的代码实现。向前遍历和向后遍历都是由_bt_steppage函数实现，代码如下：

/*
 * so->currPos中记录了当前节点。
 * PostgreSQL将当前节点中的indextuple缓存到进程私有内存后，就会UnLock当前节点。
 * 所以此时该节点处于UnLock状态。
 */
static bool
_bt_steppage(IndexScanDesc scan, ScanDirection dir)
{
	BTScanOpaque so = (BTScanOpaque) scan->opaque;
	Relation	rel;
	Page		page;
	BTPageOpaque opaque;

	Assert(BTScanPosIsValid(so->currPos));

	/* Before leaving current page, deal with any killed items */
	if (so->numKilled > 0)
		_bt_killitems(scan);

	/*
	 * Before we modify currPos, make a copy of the page data if there was a
	 * mark position that needs it.
	 */
	if (so->markItemIndex >= 0)
	{
		/* bump pin on current buffer for assignment to mark buffer */
		if (BTScanPosIsPinned(so->currPos))
			IncrBufferRefCount(so->currPos.buf);
		memcpy(&so->markPos, &so->currPos,
			   offsetof(BTScanPosData, items[1]) +
			   so->currPos.lastItem * sizeof(BTScanPosItem));
		if (so->markTuples)
			memcpy(so->markTuples, so->currTuples,
				   so->currPos.nextTupleOffset);
		so->markPos.itemIndex = so->markItemIndex;
		so->markItemIndex = -1;
	}

	rel = scan->indexRelation;

	if (ScanDirectionIsForward(dir))
	{
        /* 向前遍历 */
		/* Walk right to the next page with data */
		/* We must rely on the previously saved nextPage link! 
		 * 获取之前缓存的右节点编号。
		 */
		BlockNumber blkno = so->currPos.nextPage;

		/* Remember we left a page with data */
		so->currPos.moreLeft = true;

		/* release the previous buffer, if pinned 
		 * 后面不会用到当前节点了，所以可以直接Unpin。
		 */
		BTScanPosUnpinIfPinned(so->currPos);

		for (;;)
		{
			/* if we're at end of scan, give up 
			 * 向右遍历，获取一个合法的右节点
			 */
			if (blkno == P_NONE || !so->currPos.moreRight)
			{
				BTScanPosInvalidate(so->currPos);
				return false;
			}
			/* check for interrupts while we're not holding any buffer lock */
			CHECK_FOR_INTERRUPTS();
			/* step right one page */
			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
			/* check for deleted page */
			page = BufferGetPage(so->currPos.buf);
			TestForOldSnapshot(scan->xs_snapshot, rel, page);
			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
			if (!P_IGNORE(opaque))
			{
				PredicateLockPage(rel, blkno, scan->xs_snapshot);
				/* see if there are any matches on this page */
				/* note that this will clear moreRight if we can stop */
				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
					break;
			}

			/* nope, keep going */
			blkno = opaque->btpo_next;
			_bt_relbuf(rel, so->currPos.buf);
		}
	}
	else
	{
        /* 向后遍历 */
		/* Remember we left a page with data */
		so->currPos.moreRight = true;

		/*
		 * Walk left to the next page with data.  This is much more complex
		 * than the walk-right case because of the possibility that the page
		 * to our left splits while we are in flight to it, plus the
		 * possibility that the page we were on gets deleted after we leave
		 * it.  See nbtree/README for details.
		 *
		 * It might be possible to rearrange this code to have less overhead
		 * in pinning and locking, but that would require capturing the left
		 * pointer when the page is initially read, and using it here, along
		 * with big changes to _bt_walk_left() and the code below.  It is not
		 * clear whether this would be a win, since if the page immediately to
		 * the left splits after we read this page and before we step left, we
		 * would need to visit more pages than with the current code.
		 *
		 * Note that if we change the code so that we drop the pin for a scan
		 * which uses a non-MVCC snapshot, we will need to modify the code for
		 * walking left, to allow for the possibility that a referenced page
		 * has been deleted.  As long as the buffer is pinned or the snapshot
		 * is MVCC the page cannot move past the half-dead state to fully
		 * deleted.
		 *
		 * step1：lock current block
		 */
		if (BTScanPosIsPinned(so->currPos))
			LockBuffer(so->currPos.buf, BT_READ);
		else
			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);

		for (;;)
		{
			/* Done if we know there are no matching keys to the left */
			if (!so->currPos.moreLeft)
			{
				_bt_relbuf(rel, so->currPos.buf);
				BTScanPosInvalidate(so->currPos);
				return false;
			}

			/* Step to next physical page 
			 * 获取当前节点的左兄弟,对应前面的（step2~step9）
			 */
			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
											scan->xs_snapshot);

			/* if we're physically at end of index, return failure */
			if (so->currPos.buf == InvalidBuffer)
			{
				BTScanPosInvalidate(so->currPos);
				return false;
			}

			/*
			 * Okay, we managed to move left to a non-deleted page. Done if
			 * it's not half-dead and contains matching tuples. Else loop back
			 * and do it all again.
			 */
			page = BufferGetPage(so->currPos.buf);
			TestForOldSnapshot(scan->xs_snapshot, rel, page);
			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
			if (!P_IGNORE(opaque))
			{
				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
				/* see if there are any matches on this page */
				/* note that this will clear moreLeft if we can stop */
				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
					break;
			}
		}
	}

	/* Drop the lock, and maybe the pin, on the current page 
	 * 给当前节点解锁以及Unpin，主要是向左遍历时使用
	 */
	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);

	return true;
}

其中获取左兄弟的由_bt_walk_left函数来实现，代码如下：

static Buffer
_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
{
	Page		page;
	BTPageOpaque opaque;

	page = BufferGetPage(buf);
	opaque = (BTPageOpaque) PageGetSpecialPointer(page);

	for (;;)
	{
		BlockNumber obknum;
		BlockNumber lblkno;
		BlockNumber blkno;
		int			tries;

		/* if we're at end of tree, release buf and return failure */
		if (P_LEFTMOST(opaque))
		{
			_bt_relbuf(rel, buf);
			break;
		}
		/* remember original page we are stepping left from */
		//step2：get current_blk
		obknum = BufferGetBlockNumber(buf);
		/* step left */
		//step3：get left_blk
		blkno = lblkno = opaque->btpo_prev;
        //step4：unlock current_blk
		_bt_relbuf(rel, buf);
		/* check for interrupts while we're not holding any buffer lock */
		CHECK_FOR_INTERRUPTS();
        //step5：lock left_blk
		buf = _bt_getbuf(rel, blkno, BT_READ);
		page = BufferGetPage(buf);
		TestForOldSnapshot(snapshot, rel, page);
		opaque = (BTPageOpaque) PageGetSpecialPointer(page);

		/*
		 * If this isn't the page we want, walk right till we find what we
		 * want --- but go no more than four hops (an arbitrary limit). If we
		 * don't find the correct page by then, the most likely bet is that
		 * the original page got deleted and isn't in the sibling chain at all
		 * anymore, not that its left sibling got split more than four times.
		 *
		 * Note that it is correct to test P_ISDELETED not P_IGNORE here,
		 * because half-dead pages are still in the sibling chain.  Caller
		 * must reject half-dead pages if wanted.
		 */
		tries = 0;
		for (;;)
		{
            //step6：check if ( left_blk->right_blk == current_blk)
			if (!P_ISDELETED(opaque) && opaque->btpo_next == obknum)
			{
				/* Found desired page, return it */
                //step7：if true then return
				return buf;
			}
			if (P_RIGHTMOST(opaque) || ++tries > 4)
                //循环4次还没找到，就break
				break;
            //step8：if false then move right
			blkno = opaque->btpo_next;
			buf = _bt_relandgetbuf(rel, buf, blkno, BT_READ);
			page = BufferGetPage(buf);
			TestForOldSnapshot(snapshot, rel, page);
			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
            //step9：go to step6
		}

        //到这里说明current_blk可能被删除了，所以需要判断current_blk有没有被删除
		/* Return to the original page to see what's up */
		buf = _bt_relandgetbuf(rel, buf, obknum, BT_READ);
		page = BufferGetPage(buf);
		TestForOldSnapshot(snapshot, rel, page);
		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
		if (P_ISDELETED(opaque))
		{
			/*
			 * It was deleted.  Move right to first nondeleted page (there
			 * must be one); that is the page that has acquired the deleted
			 * one's keyspace, so stepping left from it will take us where we
			 * want to be.
			 */
            //current_blk被删除，需要获取current_blk第一个没有被删除的右兄弟
			for (;;)
			{
				if (P_RIGHTMOST(opaque))
					elog(ERROR, "fell off the end of index \"%s\"",
						 RelationGetRelationName(rel));
				blkno = opaque->btpo_next;
				buf = _bt_relandgetbuf(rel, buf, blkno, BT_READ);
				page = BufferGetPage(buf);
				TestForOldSnapshot(snapshot, rel, page);
				opaque = (BTPageOpaque) PageGetSpecialPointer(page);
				if (!P_ISDELETED(opaque))
					break;
			}

			/*
			 * Now return to top of loop, resetting obknum to point to this
			 * nondeleted page, and try again.
			 * 然后goto到line12，去找当前这个右兄弟的左兄弟。
			 */
		}
		else
		{
            //current_blk没有被删除，需要校验下current_blk的左节点是不是真的不等于lblkno
            //如果opaque->btpo_prev == lblkno就说明一定出了问题。
			/*
			 * It wasn't deleted; the explanation had better be that the page
			 * to the left got split or deleted. Without this check, we'd go
			 * into an infinite loop if there's anything wrong.
			 */
			if (opaque->btpo_prev == lblkno)
				elog(ERROR, "could not find left sibling of block %u in index \"%s\"",
					 obknum, RelationGetRelationName(rel));
			/* Okay to try again with new lblkno value */
		}
	}

	return InvalidBuffer;
}