HBase源码分析之HRegionServer上MemStore的flush处理流程（二）-阿里云开发者社区

继上篇文章《HBase源码分析之HRegionServer上MemStore的flush处理流程（一）》遗留的问题之后，本文我们接着研究HRegionServer上MemStore的flush处理流程，重点讲述下如何选择一个HRegion进行flush以缓解MemStore压力，还有HRegion的flush是如何发起的。

我们先来看下第一个问题：如何选择一个HRegion进行flush以缓解MemStore压力。上文中我们讲到过flush处理线程如果从flushQueue队列中拉取出的一个FlushQueueEntry为为空，或者为WakeupFlushThread，并且通过isAboveLowWaterMark()方法判断全局MemStore的大小高于限制值得低水平线，调用flushOneForGlobalPressure()方法，按照一定策略，flush一个HRegion的MemStore，降低MemStore的大小，预防OOM等异常情况的发生。

下面，我们重点分析下flushOneForGlobalPressure()方法，代码如下：

/**
   * The memstore across all regions has exceeded the low water mark. Pick
   * one region to flush and flush it synchronously (this is called from the
   * flush thread)
   * 
   * 所有region的memstore已超过最低水平。
   * 选择一个region同步刷新。
   * 被flush线程调用
   * 
   * @return true if successful
   */
  private boolean flushOneForGlobalPressure() {
	  
	// 获取RegionServer上的在线Region，根据Region的memstoreSize大小倒序排列，得到regionsBySize
    SortedMap<Long, HRegion> regionsBySize =
        server.getCopyOfOnlineRegionsSortedBySize();

    // 构造被排除的Region集合excludedRegions
    Set<HRegion> excludedRegions = new HashSet<HRegion>();

    boolean flushedOne = false;// 标志位
    while (!flushedOne) {// 循环一次，没有选中的话，再循环，直到选中或者没有可选的Region
      
      // Find the biggest region that doesn't have too many storefiles
      // (might be null!)
      // 选择一个Memstore最大的并且不含太多storefiles的region作为最有可能被选中的region，即bestFlushableRegion
      HRegion bestFlushableRegion = getBiggestMemstoreRegion(
          regionsBySize, excludedRegions, true);
      
      // Find the biggest region, total, even if it might have too many flushes.
      // 选择一个Memstore最大的region，即便是它包含太多storefiles，作为最终可以被选中的备份方案，即bestAnyRegion
      HRegion bestAnyRegion = getBiggestMemstoreRegion(
          regionsBySize, excludedRegions, false);

      // 在内存上阈值之上但是没有能够flush的region的话，直接返回false
      if (bestAnyRegion == null) {
        LOG.error("Above memory mark but there are no flushable regions!");
        return false;
      }

      HRegion regionToFlush;
      
      // 选择需要flush的region
      // 如果bestAnyRegion的的memstore大小大于bestFlushableRegion的两倍，则选取bestAnyRegion
      if (bestFlushableRegion != null &&
          bestAnyRegion.memstoreSize.get() > 2 * bestFlushableRegion.memstoreSize.get()) {
        // Even if it's not supposed to be flushed, pick a region if it's more than twice
        // as big as the best flushable one - otherwise when we're under pressure we make
        // lots of little flushes and cause lots of compactions, etc, which just makes
        // life worse!
        if (LOG.isDebugEnabled()) {
          LOG.debug("Under global heap pressure: " +
            "Region " + bestAnyRegion.getRegionNameAsString() + " has too many " +
            "store files, but is " +
            StringUtils.humanReadableInt(bestAnyRegion.memstoreSize.get()) +
            " vs best flushable region's " +
            StringUtils.humanReadableInt(bestFlushableRegion.memstoreSize.get()) +
            ". Choosing the bigger.");
        }
        regionToFlush = bestAnyRegion;
      } else {// 否则，优先选取bestFlushableRegion
        if (bestFlushableRegion == null) {
          regionToFlush = bestAnyRegion;
        } else {
          regionToFlush = bestFlushableRegion;
        }
      }

      // 检测状态：被选中Region的memstoreSize必须大于0
      Preconditions.checkState(regionToFlush.memstoreSize.get() > 0);

      LOG.info("Flush of region " + regionToFlush + " due to global heap pressure");
      
      // 调用flushRegion()方法，针对单个Region，进行MemStore的flush
      flushedOne = flushRegion(regionToFlush, true);
      if (!flushedOne) {// flush失败则添加到excludedRegions集合中，避免下次再被选中
        LOG.info("Excluding unflushable region " + regionToFlush +
          " - trying to find a different region to flush.");
        excludedRegions.add(regionToFlush);
      }
    }
    return true;
  }

我们来总结下这个方法的处理逻辑，如下：

1、获取RegionServer上的在线Region，根据Region的memstoreSize大小倒序排列，得到regionsBySize；

2、构造被排除的Region集合excludedRegions；

3、标志位flushedOne设置为false；

4、循环，直到标志位flushedOne为true，即存在Region被选中，或者根本没有可选的Region：

4.1、循环regionsBySize，选择一个Memstore最大的并且不含太多storefiles的region作为最有可能被选中的region，即bestFlushableRegion：

4.1.1、如果当前region在excludedRegions列表中，直接跳过；

4.1.2、如果当前region的写状态为正在flush，或者当前region的写状态不是写启用，直接跳过；

4.1.3、如果需要检查StoreFile数目，且包含太多StoreFiles，也直接跳过；

4.1.4、否则返回该region；

4.2、循环regionsBySize，选择一个Memstore最大的region，即便是它包含太多storefiles，作为最终可以被选中的备份方案，即bestAnyRegion：

4.2.1、如果当前region在excludedRegions列表中，直接跳过；

4.2.2、如果当前region的写状态为正在flush，或者当前region的写状态不是写启用，直接跳过；

4.2.3、否则返回该region；

4.3、在内存上阈值之上但是没有能够flush的region的话，直接返回false；

4.4、选择需要flush的region：

4.4.1、如果bestAnyRegion的的memstore大小大于bestFlushableRegion的两倍，则选取bestAnyRegion；

4.4.2、否则，优先选取bestFlushableRegion；

4.5、检测状态：被选中Region的memstoreSize必须大于0；

4.6、调用flushRegion()方法，针对单个Region，进行MemStore的flush；

4.7、flush失败则添加到excludedRegions集合中，避免下次再被选中。

以上就是按照一定策略选择一个HRegion进行MemStore的flush以缓解MemStore压力的方法。那么，剩下的flush指定HRegion的问题就同接下来我们将要讲的HRegion的flush是如何发起的一致了。我们先看下带一个参数的flushRegion()方法，代码如下：

/*
   * A flushRegion that checks store file count.  If too many, puts the flush
   * on delay queue to retry later.
   * 
   * 一个待刷新的Region首先会检测store file的数目，如果太多，会把该region的刷新推迟并稍后再试，否则立即刷新。
   * 
   * @param fqe
   * @return true if the region was successfully flushed, false otherwise. If
   * false, there will be accompanying log messages explaining why the region was
   * not flushed.
   */
  private boolean flushRegion(final FlushRegionEntry fqe) {
    HRegion region = fqe.region;
    if (!region.getRegionInfo().isMetaRegion() &&
        isTooManyStoreFiles(region)) {// 如果Region不是MetaRegion且Region上有太多的StoreFiles
      
      if (fqe.isMaximumWait(this.blockingWaitTime)) {
    	// 如果已阻塞指定时间，记录日志并执行刷新
        LOG.info("Waited " + (EnvironmentEdgeManager.currentTime() - fqe.createTime) +
          "ms on a compaction to clean up 'too many store files'; waited " +
          "long enough... proceeding with flush of " +
          region.getRegionNameAsString());
      } else {
        // If this is first time we've been put off, then emit a log message.
    	// 如果是第一次推迟，并对该HRegion请求分裂或系统合并，记录一条日志信息
        if (fqe.getRequeueCount() <= 0) {
          // Note: We don't impose blockingStoreFiles constraint on meta regions
          // 注意：我们不强加blockingstorefiles约束元区域
          LOG.warn("Region " + region.getRegionNameAsString() + " has too many " +
            "store files; delaying flush up to " + this.blockingWaitTime + "ms");
          
          // 对该HRegion先请求分裂Split，分裂不成功的话再请求系统合并SystemCompaction
          if (!this.server.compactSplitThread.requestSplit(region)) {
            try {
              this.server.compactSplitThread.requestSystemCompaction(
                  region, Thread.currentThread().getName());
            } catch (IOException e) {
              LOG.error(
                "Cache flush failed for region " + Bytes.toStringBinary(region.getRegionName()),
                RemoteExceptionHandler.checkIOException(e));
            }
          }
        }

        // Put back on the queue.  Have it come back out of the queue
        // after a delay of this.blockingWaitTime / 100 ms.
        // 再放回队列，等待900ms（参数可配置）后，再从队列中取出来
        this.flushQueue.add(fqe.requeue(this.blockingWaitTime / 100));
        // Tell a lie, it's not flushed but it's ok
        // 佯言，该Region没有被flush，但是应该返回true
        return true;
      }
    }
    
    // 调用两个参数的flushRegion()方法，通知HRegion执行flush
    return flushRegion(region, false);
  }

这个带一个参数的flushRegion()方法，实际上是在拿到一个待flush的HRegion的封装体FlushRegionEntry类型的fqe后，对其做一些必要的判断，决定是直接进行flush还是推后执行，且在第一次推后前，如果需要，则做分裂或系统合并处理。具体处理逻辑如下：

1、如果Region不是MetaRegion且Region上有太多的StoreFiles：

1.1、通过isMaximumWait()判断阻塞时间，已阻塞达到或超过指定时间，记录日志并执行flush，跳到2，结束；

1.2、如果是第一次推迟，记录一条日志信息，然后对该HRegion先请求分裂Split，分裂不成功的话再请求系统合并SystemCompaction；

1.3、再将fqe放回到队列flushQueue，增加延迟时间900ms（参数可配置），等到到期后再从队列中取出来进行处理；

1.4、佯言，该Region被推迟进行flush，结果还不确定，所以应该返回true；

2、调用两个参数的flushRegion()方法，通知HRegion执行flush。

如何进行阻塞时间的判断呢？很简单，判断当前时间减去创建时间是否大于指定时间就OK了。代码如下：

/**
     * @param maximumWait
     * @return True if we have been delayed > <code>maximumWait</code> milliseconds.
     */
    public boolean isMaximumWait(final long maximumWait) {
      return (EnvironmentEdgeManager.currentTime() - this.createTime) > maximumWait;
    }

好了，是时候该分析这个带有两个参数的flushRegion()方法了。先上代码，再做分析：

/*
   * Flush a region.
   * @param region Region to flush.
   * @param emergencyFlush Set if we are being force flushed. If true the region
   * needs to be removed from the flush queue. If false, when we were called
   * from the main flusher run loop and we got the entry to flush by calling
   * poll on the flush queue (which removed it).
   *
   * @return true if the region was successfully flushed, false otherwise. If
   * false, there will be accompanying log messages explaining why the region was
   * not flushed.
   * 
   * 刷新region
   */
  private boolean flushRegion(final HRegion region, final boolean emergencyFlush) {
    long startTime = 0;
    synchronized (this.regionsInQueue) {
      
      // 先从regionsInQueue中移除对应的HRegion信息
      FlushRegionEntry fqe = this.regionsInQueue.remove(region);
      // Use the start time of the FlushRegionEntry if available
      if (fqe != null) {
        // 获取flush的开始时间startTime
    	startTime = fqe.createTime;
      }
      if (fqe != null && emergencyFlush) {
        // Need to remove from region from delay queue.  When NOT an
        // emergencyFlush, then item was removed via a flushQueue.poll.
    	// 需要从flushQueue队列中移除，如果不是紧急刷新，fqe将通过flushQueue.poll被移除
    	// 因为如果是flush线程处理的，run()方法会周期性的从flushQueue队列取feq，并且如果取出的为null或者WakeupFlushThread，
        // 它会在MemStore位于低水平线上时，按照一定策略选择一个HRegion，包装成fqe进行flush，以降低MemStore，避免OOM等风险，
    	// 此时，如果fqe位于flushQueue中，需要被移除，移除的判断就是这个emergencyFlush是否为true，
    	// 因为通过线程在到期的正常情况下进行处理的，会传入false，而为降低风险进行紧急flush的，会传入true，此时就需要从队列中移除，也是为了避免做重复工作
        flushQueue.remove(fqe);
     }
    }
    
    // 获取flush的开始时间startTime
    if (startTime == 0) {
      // Avoid getting the system time unless we don't have a FlushRegionEntry;
      // shame we can't capture the time also spent in the above synchronized
      // block
      startTime = EnvironmentEdgeManager.currentTime();
    }
    
    // 上读锁，意味着与其他拥有读锁的线程不冲突，可以同步进行，而与拥有写锁的线程互斥
    lock.readLock().lock();
    try {
      
      // 通过监听器Listener通知flush请求者flush的type
      notifyFlushRequest(region, emergencyFlush);
      
      // 调用HRegion的flushcache()方法，执行MemStore的flush
      HRegion.FlushResult flushResult = region.flushcache();
      
      // 根据flush的结果，判断下一步该做如何处理
      
      // 判断是否应该进行合并compact
      boolean shouldCompact = flushResult.isCompactionNeeded();
      // We just want to check the size
      
      // 检测是否应该进行分裂split
      boolean shouldSplit = region.checkSplit() != null;
      
      // 必要的情况下，先进行split，再进行system compact
      if (shouldSplit) {
        this.server.compactSplitThread.requestSplit(region);
      } else if (shouldCompact) {
        server.compactSplitThread.requestSystemCompaction(
            region, Thread.currentThread().getName());
      }
      
      // 如果flush成功，获取flush结束时间，计算耗时，记录HRegion上的度量信息
      if (flushResult.isFlushSucceeded()) {
        long endTime = EnvironmentEdgeManager.currentTime();
        server.metricsRegionServer.updateFlushTime(endTime - startTime);
      }
    } catch (DroppedSnapshotException ex) {
      // Cache flush can fail in a few places. If it fails in a critical
      // section, we get a DroppedSnapshotException and a replay of wal
      // is required. Currently the only way to do this is a restart of
      // the server. Abort because hdfs is probably bad (HBASE-644 is a case
      // where hdfs was bad but passed the hdfs check).
      server.abort("Replay of WAL required. Forcing server shutdown", ex);
      return false;
    } catch (IOException ex) {
      LOG.error("Cache flush failed" +
        (region != null ? (" for region " + Bytes.toStringBinary(region.getRegionName())) : ""),
        RemoteExceptionHandler.checkIOException(ex));
      if (!server.checkFileSystem()) {
        return false;
      }
    } finally {
      // 释放读锁
      lock.readLock().unlock();
      
      // 唤醒阻塞的其他线程
      wakeUpIfBlocking();
    }
    return true;
  }

带有两个参数的flushRegion()方法大体逻辑如下：

1、首选处理regionsInQueue集合和flushQueue队列：

1.1、先从regionsInQueue中移除对应的HRegion信息，这个无论是否紧急flush，都是必须要做的；

1.2、获取flush的开始时间startTime；

1.3、如果是紧急刷新，需要从flushQueue队列中移除对应的fqe，如果不是紧急刷新，fqe将通过flushQueue.poll被移除；

2、如果startTime为null，获取flush的开始时间startTime；

3、上读锁，意味着与其他拥有读锁的线程不冲突，可以同步进行，而与拥有写锁的线程互斥（后期将会写专门的文章分析HBase内部各流程中锁的应用）；

4、通过监听器Listener通知flush请求者flush的type；

5、调用HRegion的flushcache()方法，执行MemStore的flush，并获得flush结果；

6、根据flush的结果，判断下一步该做如何处理：

6.1、根据flush结果判断是否应该进行合并compact，即标志位shouldCompact；

6.2、调用HRegion的checkSplit()方法检测是否应该进行分裂split，即标志位shouldSplit；

6.3、通过两个标志位判断，必要的情况下，先进行split，再进行system compact；

7、如果flush成功，获取flush结束时间，计算耗时，记录HRegion上的度量信息；

8、最后释放读锁，唤醒阻塞的其他线程。

这里，先有必要解释下对flushQueue的特殊处理，如果是紧急刷新，需要从flushQueue队列中移除对应的fqe，如果不是紧急刷新，fqe将通过flushQueue.poll被移除。因为如果是flush线程处理的，run()方法会周期性的从flushQueue队列取feq，并且如果取出的为null或者WakeupFlushThread，它会在MemStore位于低水平线上时，按照一定策略选择一个HRegion，包装成fqe进行flush，以降低MemStore，避免OOM等风险，此时，如果fqe位于flushQueue中，需要被移除，移除的判断就是这个emergencyFlush是否为true，因为通过线程在到期的正常情况下进行处理的，会传入false，而为降低风险进行紧急flush的，会传入true，此时就需要从队列中移除，也是为了避免做重复工作。

通过监听器Listener通知flush请求者flush的type也很简单，也做注释了，不再解释，代码如下：

private void notifyFlushRequest(HRegion region, boolean emergencyFlush) {
    
	// 默认类型为 FlushType.NORMAL
	FlushType type = FlushType.NORMAL;
    
	// 如果是紧急刷新，跟是否在高水位线上来确定type，高水位线上为FlushType.ABOVE_HIGHER_MARK，低水位线上为FlushType.ABOVE_LOWER_MARK
	if (emergencyFlush) {
      type = isAboveHighWaterMark() ? FlushType.ABOVE_HIGHER_MARK : FlushType.ABOVE_LOWER_MARK;
    }
	
	// 针对监听器逐个添加region、type
    for (FlushRequestListener listener : flushRequestListeners) {
      listener.flushRequested(type, region);
    }
  }

最后再说说这个flush结果FlushResult，它是HRegion中的一个静态内部类，包括一个Result枚举，其中包含的flush结果如下：

1、FLUSHED_NO_COMPACTION_NEEDED：flush成功，但是不需要执行compact；

2、FLUSHED_COMPACTION_NEEDED：flush成功，同时需要执行compact；

3、CANNOT_FLUSH_MEMSTORE_EMPTY：无法进行flush，因为MemStore为空；

4、CANNOT_FLUSH：无法进行flush。

判断flush是否成功，则就是看result是否为FLUSHED_NO_COMPACTION_NEEDED或FLUSHED_COMPACTION_NEEDED，判断是否需要进行compact，则就是看result是否为FLUSHED_COMPACTION_NEEDED。相关代码如下：

    /**
     * Convenience method, the equivalent of checking if result is
     * FLUSHED_NO_COMPACTION_NEEDED or FLUSHED_NO_COMPACTION_NEEDED.
     * @return true if the memstores were flushed, else false.
     */
    public boolean isFlushSucceeded() {
      return result == Result.FLUSHED_NO_COMPACTION_NEEDED || result == Result
          .FLUSHED_COMPACTION_NEEDED;
    }

    /**
     * Convenience method, the equivalent of checking if result is FLUSHED_COMPACTION_NEEDED.
     * @return True if the flush requested a compaction, else false (doesn't even mean it flushed).
     */
    public boolean isCompactionNeeded() {
      return result == Result.FLUSHED_COMPACTION_NEEDED;
    }

至此，HRegionServer上MemStore的flush处理流程全部分析完毕。末尾关于split、compact，后续会有专门的文章进行介绍，敬请关注本人博客，谢谢！