fio
引子
最近在阅读Kafka的文档时有这么一段话:7200rpm的STAT硬盘的顺序写的带宽可以达到600MB/s,而随机写只有100kB/s,差距为6000倍
The key fact about disk performance is that the throughput of hard drives has been diverging from the latency of a disk seek for the last decade. As a result the performance of linear writes on a JBOD configuration with six 7200rpm SATA RAID-5 array is about 600MB/sec but the performance of random writes is only about 100k/sec—a difference of over 6000X. These linear reads and writes are the most predictable of all usage patterns, and are heavily optimized by the operating system. A modern operating system provides read-ahead and write-behind techniques that prefetch data in large block multiples and group smaller logical writes into large physical writes. A further discussion of this issue can be found in this ACM Queue article; they actually find that sequential disk access can in some cases be faster than random memory access!
FIO数据对比
环境准备
组件 | 版本 |
---|---|
ubuntu | 16.04 |
fio | 2.2.10 |
硬盘 | ST1000DM010-2EP102 (7200 rpm) (1TB) |
实验数据
1. 同步顺序写(DIRECT)
采用fio作为测试工具,strace观察系统调用。这里采取的是O_DIRECT
的方式来屏蔽pagecache,直接测试硬盘的性能。
strace -f -tt -o /tmp/write.log -D /usr/bin/fio --name=write_test --filename=write_test --bs=4k --size=4G --readwrite=write -direct=1
bw=53MB/s, iops=13.3k
1 | write_test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 |
2. 同步随机写(DIRECT)
strace -f -tt -o /tmp/rand_write.log -D /usr/bin/fio --name=randwrite_test --filename=randwrite_test --bs=4k --size=1G --readwrite=randwrite -direct=1
bw=938KB/s, iops=229
可以看出,随机写的性能直接掉到底了
1 | randwrite_test: (groupid=0, jobs=1): err= 0: pid=28561: Sun Sep 18 16:05:04 2022 |
3. 同步顺序写(Buffer)
删除同步写中的--direct=1
,则会使用到系统的pagecache。
strace -f -tt -o /tmp/writeb.log -D /usr/bin/fio --name=write_btest --filename=write_btest --bs=4k --size=4G --readwrite=write
bw=210MB/s, iops=52k
因此,性能直接提升了4倍
1 | write_btest: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 |
4. 同步随机写(Buffer)
strace -f -tt -o /tmp/rand_bwrite.log -D /usr/bin/fio --name=randwrite_btest --filename=randwrite_btest --bs=4k --size=1G --readwrite=randwrite
bw=115MB/s, iops=28k,同步随机写+pagecache的性能几乎不受随机写的影响,后面将详细介绍这个的原因。
1 | randwrite_btest: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 |
5. 异步顺序写(DIRECT)
strace -f -tt -o /tmp/awrite.log -D /usr/bin/fio --name=write_atest --filename=write_atest --bs=4k --size=1G --readwrite=write --direct=1 --ioengine=libaio --iodepth=64
bw=91.2MB/s IOPS=22.8k
avg_req_sz=8, avg_queue_sz=3
1 | write_atest: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64 |
6. 异步随机写(DIRECT)
strace -f -tt -o /tmp/rand_awrite.log -D /usr/bin/fio --name=randwrite_atest --filename=randwrite_atest --bs=4k --size=1G --readwrite=randwrite --direct=1 --ioengine=libaio --iodepth=64
bw=1MB/s, iops=248
1 | fio-2.2.10 |
1 | Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util |
7. 异步顺序写 (BUFFER)
strace -f -tt -o /tmp/a-bwrite.log -D /usr/bin/fio --name=write_abtest --filename=write_abtest --bs=4k --size=1G --readwrite=write --ioengine=libaio --iodepth=64
bw=110MB/s, iops=27k
1 | Sun Sep 18 21:06:43 CST 2022 |
1 | 09/18/2022 09:06:50 PM [60/1211] |
8. 异步随机写 (BUFFER)
strace -f -tt -o /tmp/rand_abwrite.log -D /usr/bin/fio --name=randwrite_abtest --filename=randwrite_abtest --bs=4k --size=1G --readwrite=randwrite --ioengine=libaio --iodepth=64
bw=82MB/s, iops=20k
1 | Sun Sep 18 21:10:22 CST 2022 |
1 | 09/18/2022 09:10:30 PM |
数据总结
模式 | direct+sync | buffer+sync | direct + libaio | buffer + libaio |
---|---|---|---|---|
write | bw=53MB/s, iops=13k | bw=210MB/s, iops=52k | bw=91MB/s, iops=22k | bw=110MB/s, iops=27k |
randwrite | bw=938KB/s, iops=229 | bw=115MB/s, iops=28k | bw=1MB/s, iops=248 | bw=82MB/s, iops=20k |
数据分析: 为何randwrite+buffer+sync的iops可以是20k+
原因在于操作系统欺骗了应用,在pagecache下,内核层的write操作只将相关buffer写入到pagecache中,然后就通知应用write成功;然后系统的IO子系统可以根据写入到pagecache的page,进行一定的重排,优化性能,再写入到磁盘中。参考文档
在上面的例子8中,可以看到 在fio进程结束后,仍然有大量的io写操作,这就是操作系统在flush pagecache的dirty数据到硬盘中。