Why Large synchronous writes are slow when a slog is present?-阿里云开发者社区

Why Large synchronous writes are slow when a slog is present?

2016-04-05 1602

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

因为zfs对zil的使用有阈值限制, 例如单次提交的写超过阈值则直接写VDEV.

还有如果设置了logbias=throughput也是直接写vdev的.

参考

Note that this issue seems to impact all ZFS implementations, not just ZFS On Linux.

ZFS uses a complicated process when it comes to deciding whether a write should be logged in indirect mode (written once by the DMU, the log records store a pointer) or in immediate mode (written in the log record, rewritten later by the DMU). Basically, it goes like this:

Write in indirect mode to the data vdevs if:

logbias=throughput, or

There is no slog and the write is larger than zfs_immediate_write_sz.

Write in immediate mode to the data vdevs if logbias=latency and:

There is no slog and the write is smaller than zfs_immediate_write_sz, or

There is a slog and the total commit size if larger than zil_slog_limit.

Write in immediate mode to the slog vdevs if logbias=latency, there is a slog, and the total commit size is smaller than zil_slog_limit.

The decision to use indirect or immediate mode is implemented in zfs_log_write() and zvol_log_write(). The decision to use the slog or the normal vdevs is implemented in the USE_SLOG() macro used by zil_lwb_write_start.

The issue is, this decision process makes sense except for one particularly painful edge case, when these conditions are all true:

logbias=latency, and

There is a slog, and

There are large writes in the ZIL to be commited (e.g. > 100 MB).

In this situation, the optimal choice would be to write to the normal pool in indirect mode, which should give us the minimum latency considering this is a large sequential write. Indeed, for very large writes, you don't want to use immediate mode because it means writing the data twice. Even if you write the log records to the slog, this will be slower with most pool configurations with e.g. lots of spindles and one SSD slog because the aggregate sequential write throughput of all the spindles is usually greater than the SSD's.

Instead, the algorithm makes the worst decision possible: it writes the data in immediate mode to the main data disks. This means that all the (large) data will be commited as ZIL log records on the data disks first, then immediately after, it will get written again by the DMU. This means the overall throughput is halved, and if this is a sustained load, the ZIL commit latency will be doubled compared to indirect mode.

It is shockingly easy to reproduce this issue. In pseudo-code:

open(file)

write(file, lots of data) // e.g. 2 GB

fsync(file)

Watch the zil_stats kstat page when that runs.

If you don't have a slog in your pool, then the fsync() call will complete in roughly the time it takes to write 2 GB sequentially to your main disks. This is optimal.

If you have a slog in your pool, then the fsync() call will generate twice as much write activity, and will write up to 4 GB to your main disks. Ironically, the slog won't be used at all when that happens.

The solution would be to modify the algorithm zfs_log_write() and zvol_log_write() so that, in the conditions mentioned above, it switches to indirect writes when the commit size reaches a certain threshold (e.g. 32 MB).

I would gladly write a patch, but I won't have the time to do it, so I'm just leaving the result of my research here in case anyone's interested. If anyone wants to write the patch, it should be very simple to implement it.

[参考]

1. https://github.com/zfsonlinux/zfs/issues/1012

Why Large synchronous writes are slow when a slog is present?

热门文章

最新文章

相关电子书