TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 694 be/3 root 0.00 B/s 0.00 B/s 0.00 % 2.08 % [jbd2/md0-8] 22780 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.09 % [kworker/1:1-events] 22780 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.08 % [kworker/1:1-events_freezable_power_]Note that the above listing is actually composed of a concatenation of multiple lines sampled at different times and is not a snapshot of a single time. What the above listing does not clearly show is that while the
[jbd2/md0-8]
command (hereafter referred to as jbd2) indeed generated the most IO at a time, the [kworker/1:1-events]
command (hereafter referred to as kworker) was almost constantly generating minute amounts of IO. A brief look at the jbd2 command also showed it to be a Journaling Block Device
process, so I figured that perhaps it was writing journaling data because of what was happening in the kworker process, so I decided to look further into that process.
$ echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event $ cat /sys/kernel/debug/tracing/trace_pipe > out.txtA short while later I also found these same commands in Documentation/core-api/workqueue.rst. Once again, I needed to enable some additional kernel options: CONFIG_DEBUG_FS and CONFIG_FTRACE were the obvious ones, but it wasn't clear which entries under the Tracers menu were important, so I guessed and enabled a large number of them. After another recompile and restart, I used the above-mentioned commands in order to gather my data. This generated a very large amount of output very quickly, but a few recurring entries stuck out at me:
kworker/0:1-20 [000] d... 422.386720: workqueue_queue_work: work struct=0000000062e60b79 function=vmstat_update workqueue=000000005d84bd29 req_cpu=0 cpu=0 kworker/0:1-20 [000] d... 422.826981: workqueue_queue_work: work struct=00000000c7c3684d function=kcryptd_crypt workqueue=000000007cde8437 req_cpu=2 cpu=4294967295 kworker/0:1-20 [000] d... 423.066760: workqueue_queue_work: work struct=00000000b9871e0f function=ata_sff_pio_task workqueue=000000006a76fa14 req_cpu=2 cpu=0 kworker/0:1-20 [000] d.s. 423.246907: workqueue_queue_work: work struct=0000000033e5a536 function=dm_wq_work workqueue=000000005448332a req_cpu=2 cpu=0I figured that the kcryptd_crypt and dm_wq_work functions were likely related to encryption and the queueing of work and were thus unavoidable, but I was most curious what the vmstat_update function was doing since, while I did have a test Virtual Machine (VM) that I would occasionally run, it was not running at the time. From reading this LKML e-mail, I deduced that the function was a way to periodically update metrics for VM resource usage:
The current vmstat_update mechanism depends on a deferrable timer firing every other second by default which registers a work queue item that runs on the local CPU, with the result that we have 1 interrupt and one additional schedulable task on each CPU every 2 secondsPerhaps tuning this mechanism would fix my IO problem? From the kernel documentation at Documentation/admin-guide/kernel-per-CPU-kthreads.rst I found that the firing interval could be modified by writing to /proc/sys/vm/stat_interval. I then attempted to decrease the interval to once a minute with the following:
# echo 60 > /proc/sys/vm/stat_intervalThis did indeed serve to reduce the work queued by vmstat_update, though it did not, unfortunately, quiet my disks.
I went back to monitoring the kworker, and eventually noticed through the noise that the following work was being queued on a 2-second periodic basis:
kworker/0:1-18033 [000] d... 8069.014025: workqueue_queue_work: work struct=00000000b9871e0f function=ata_sff_pio_task workqueue=000000006a76fa14 req_cpu=2 cpu=0 kworker/0:1-18033 [000] d... 8071.093989: workqueue_queue_work: work struct=00000000b9871e0f function=ata_sff_pio_task workqueue=000000006a76fa14 req_cpu=2 cpu=0 kworker/0:1-18033 [000] d... 8073.173961: workqueue_queue_work: work struct=00000000b9871e0f function=ata_sff_pio_task workqueue=000000006a76fa14 req_cpu=2 cpu=0 kworker/0:1-18033 [000] d... 8075.253940: workqueue_queue_work: work struct=00000000b9871e0f function=ata_sff_pio_task workqueue=000000006a76fa14 req_cpu=2 cpu=0SFF? PIO? I had no idea what any of these were. I didn't find anything useful in a search, so I looked through the source code and found that the function was being called in drivers/ata/libata-sff.c:
void ata_sff_port_init(struct ata_port *ap) { INIT_DELAYED_WORK(&ap->sff_pio_task, ata_sff_pio_task); ap->ctl = ATA_DEVCTL_OBS; ap->last_ctl = 0xFF; }This function came with some comments for documentation that I did not understand, much less did I understand the libATA documentation at Documentation/driver-api/libata.rst. After some more searching I found out that PIO stood for Programmed Input/Output and that SFF likely stood for Small Form Factor, though this did little to help me understand my I/O issue. One thing I did notice while browsing the code was that the following definition in include/linux/libata.h:
struct delayed_work sff_pio_task;...was only defined if CONFIG_ATA_SFF was defined; unfortunately, CONFIG_ATA_SFF was required for CONFIG_ATA_PIIX and I wasn't going to go about trying to unload my SATA controller drivers. From here I started to simply guess at solutions in order to see if I could find anything that stuck. From reading the Wikipedia article on PIO, I deduced that DMA (Direct Memory Access) was important, perhaps the new drive had some variation in DMA capabilities that was causing extra IO? I then used the hdparm utility in order to examine each drive for differences:
hesse ~ # hdparm -I /dev/sda | grep -i dma DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 hesse ~ # hdparm -I /dev/sdb | grep -i dma DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 * WRITE_{DMA|MULTIPLE}_FUA_EXT * URG for READ_STREAM[_DMA]_EXT * URG for WRITE_STREAM[_DMA]_EXT hesse ~ # hdparm -I /dev/sdc | grep -i dma DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 DMA Setup Auto-Activate optimizationWell, I had no idea what to do with this information, so I was stuck again. At some point it occurred to me that since the drive was supposedly generating IO, it was probably generating some kind of interrupt, so I decided to run watch cat /proc/interrupts to see if I could track the IO to a particular drive. From watching the output, I noticed that the following was generating two interrupts every two seconds:
17: 0 1738624 IO-APIC 17-fasteoi pata_marvellSince the interrupts appeared to come from the pata_marvell driver, I attempted to use sysfs in order to get the PCI path that the driver was attached to:
hesse ~ # realpath /sys/module/pata_marvell/drivers/pci:pata_marvell/0000:02:00.0 /sys/devices/pci0000:00/0000:00:1c.1/0000:02:00.0Then compare this to the hard drive PCI paths:
lrwxrwxrwx 1 root root 0 Nov 8 19:33 sda -> ../../devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda lrwxrwxrwx 1 root root 0 Nov 8 19:33 sdb -> ../../devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:1/0:0:1:0/block/sdb lrwxrwxrwx 1 root root 0 Nov 8 19:33 sdc -> ../../devices/pci0000:00/0000:00:1f.5/ata3/host2/target2:0:0/2:0:0:0/block/sdc lrwxrwxrwx 1 root root 0 Nov 8 19:33 sr0 -> ../../devices/pci0000:00/0000:00:1c.1/0000:02:00.0/ata5/host4/target4:0:0/4:0:0:0/block/sr0Turns out that the driver was actually attached to an optical drive and not a regular hard drive. Well, I really didn't have any use for the optical drive, so I powered down the computer, unplugged the drive (but kept it in the case), then brought the server backup again. After examining the server for a while I noticed that the periodic blinking of the hard drive light had stopped, the kworker process stopped generating constant small bits of IO, and the driver was no longer generating interrupts. That was nice, but when I went to bed that night (reminder: I worked on this very intermittently) the drives were still constantly generating noise, so unplugging the optical drive didn't solve the overall problem.
825 be/3 root 0.00 B/s 15.34 K/s 0.00 % 4.75 % [jbd2/md0-8]Yet there didn't seem to be any process generating worthwhile IO, so why would the journaling process be running? Some searching turned up this thread which suggested mounting with the noatime argument. I looked, and I was using relatime, so I switched it to noatime and also tried adding the lazytime feature. This seemed to help a little; the journal didn't seem to run unless another process ran first, but at this point irssi seemed to generate enough intermittent IO to continue to be a problem. Was I really getting data every few seconds, though? It seemed a bit much to me, so I used strace in order to monitor the irssi process' system calls, a snippet of which is below:
poll([{fd=0, events=POLLIN|POLLPRI}, {fd=3, events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}, {fd=7, events=POLLIN|POLLPRI}, {fd=9, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN|POLLPRI}], 7, 499) = 0 (Timeout) stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 poll([{fd=0, events=POLLIN|POLLPRI}, {fd=3, events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}, {fd=7, events=POLLIN|POLLPRI}, {fd=9, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN|POLLPRI}], 7, 1) = 0 (Timeout) poll([{fd=0, events=POLLIN|POLLPRI}, {fd=3, events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}, {fd=7, events=POLLIN|POLLPRI}, {fd=9, events=POLLIN|POLLPRI}, {fd=12, events=POLLIN|POLLPRI}], 7, 500) = 0 (Timeout)The calls to poll I could understand, but why stat? Perhaps this was the source of constant access which was diminished with the noatime and lazytime options? This Stack Overflow answer suggested that this was a glibc artifact that came of not having the TZ environment variable set. I didn't feel like restarting irssi, so I searched for a way to set environment variables in a running process and came across this Stack Overflow answer which suggested using gdb in order to set the environment variable as such:
(gdb) attach process_id (gdb) call putenv ("TZ=:/etc/localtime") (gdb) detachUnfortunately, when I tried to make the system call gdb would barf out the following error message, No symbol table is loaded. Use the "file" command.. I barely know gdb, so I simply tried to point the file command at the irssi binary:
(gdb) file /usr/bin/irssi A program is being debugged already. Are you sure you want to change the file? (y or n) y Reading symbols from /usr/bin/irssi... (No debugging symbols found in /usr/bin/irssi) Reading symbols from /lib64/ld-linux-x86-64.so.2... (No debugging symbols found in /lib64/ld-linux-x86-64.so.2)I guess that gdb was unhappy because I didn't compile debugging symbols into the program. Whatever. I wasn't really convinced that the stat calls were generating I/O anyways since I had enabled the noatime option and I didn't feel like digging into it further. I went back to monitoring irssi with strace and noticed that most writes seemed to correspond with actual messages but mostly channel part and join data. As an experiment, I disabled logging by running /set autolog off in irssi and then listening, and, sure enough, my disks finally became much more quiet. This was great, but not really practical, because I wanted to log that data and so I needed a different solution.
I then began looking to see if I could somehow quiet the journaling process. While searching through the ext4 man page, I eventually found the commit=nrsec option, and set it with the following:
# mount -o remount,commit=60This actually had the effect of limiting the jdb2 process to once a minute according to iotop, but I noticed that the Actual DISK WRITE: statistic would still show intermittent activity in-between commits, often associated with the clicking of the hard drives. I thought about it and this actually made sense: the actual writes were the log data being written to disk while the jdb2 writes were associated metadata being committed to the journal periodically. In order to delay the writes to the disk, I'd have to tweak the I/O scheduler.
hesse /sys/class/block # cat /sys/class/block/sda/queue/scheduler [mq-deadline] kyber noneSince I had forgotten the details of the scheduler, I did some searching and found this presentation which suggested that I could use the write_expire parameter in order to delay writes; looking also at the kernel documentation at Documentation/block/deadline-iosched.rst suggested the same thing. Based on this information, I then attempted to use a 1-minute write expire on each drive:
echo 60000 > /sys/block/sda/queue/iosched/write_expire echo 60000 > /sys/block/sdb/queue/iosched/write_expire echo 60000 > /sys/block/sdc/queue/iosched/write_expireThis seemed to do the trick! The disks were finally much more quiet than before, and the irssi data was still being logged. Looking at the rest of the options for the scheduler I also decided to set fifo_batch from 16 to 256 so that a larger number of writes could be queued before being flushed.
Thus the total modifications that I had made were to first remove the CD-ROM drive, which may or may not have been useful regarding hard drive noise, though it did prevent my computer's I/O LED from lighting up as much. Next, to increase the VM statistics-updating mechanism from once a second to once a minute, which also may not actually have been useful if it was only written in memory. After that, to change relatime to noatime and add the lazymount and commit=60 arguments to the root filesystem. Lastly, to configure the deadline I/O scheduler to have a write delay of 1 minutes rather than 5 seconds and increase the batch count from 16 to 256.
Though I used certain values for testing, I slightly tweaked them afterwards. I bumped up the VM statistics-updating interval to 3600 seconds (1 hour), because I don't use it. As for the journal commit interval and write starvation period, I increased both of those to 3 minutes (180 seconds and 180000 milliseconds, respectively). I'm really not in any particular rush to get the data written to disk, though it would be a good idea for me to get a UPS that sends a power-down command to my computer when the power starts to run out... I also increased the FIFO queue size to 1024, though I don't have any actual empiric data to back that decision. Finally, it was time to make the settings permanent.
While figuring out the configuration settings needed to quiet the disks down was most of the battle, one final obstacle remained: to configure the settings at boot so I wouldn't need to manually re-configure them. The CD-ROM being disconnected didn't need any configuration. Since the VM statistics-updating mechanism was part of procfs, it could be updated by opening /etc/sysctl.conf and adding the line vm.stat_interval = 3600. Configuring the jbd2 process was a matter of updating the root filesystem mount options in /etc/fstab to include lazytime,commit=180 (I doubt that noatime is needed with lazytime). Configuring the I/O schedulers was a bit more tricky, though. According to this gist, there is no mechanism similar to sysctl to set sysfs settings at boot. Instead, since I am running Gentoo, I decided to use an OpenRC service instead. The first step was to create the following file at /etc/init.d/iohush:
#!/sbin/openrc-run name="iohush" description="Increase write starve time in order to silence noisy spinning disks" depend() { after sysfs } start() { echo 180000 > /sys/block/sda/queue/iosched/write_expire echo 180000 > /sys/block/sdb/queue/iosched/write_expire echo 180000 > /sys/block/sdc/queue/iosched/write_expire echo 256 > /sys/block/sda/queue/iosched/fifo_batch echo 256 > /sys/block/sdb/queue/iosched/fifo_batch echo 256 > /sys/block/sdc/queue/iosched/fifo_batch }Note the dependency on sysfs, and the definition of the start function which runs the shell commands in order to configure the I/O scheduler. With the start-up script written I then made it executable and added it to the boot runlevel:
chmod 0755 /etc/init.d/iohush rc-update add iohush bootWith this done the script would automatically configure the I/O scheduler on boot! Thus all configuration would now be in place on boot, and I would hopefully be able to sleep without having to reach for the earplugs once I got tired of hearing the hard drives click.
A better option might have been to use solid state drives instead.
After a prolonged period I had to admit to myself that my solution was not quite working as intended. Although I/O requests were not dispatching requests immediately or nearly as frequently, I would still oftentimes hear more clicking than I desired (though not enough to keep me awake on all but the most restless nights). I played around with a couple more settings, such as setting sd*/queue/nr_requests from 2 to 256, sd*/queue/max_sectors_kb from 128 to 25000, and sd*/queue/iosched/writes_starved from 2 to 128. None of these seemed to help (though my hopes were not high).
After spending some long periods monitoring iotop I noticed that there appeared to be two main patterns appearing during I/O:
Pattern 1: Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 6.77 K/s Pattern 2 (two subsequent polls): Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 12.48 K/s Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 1966.06 B/sCurious if I could find anything more out about this pattern, I attempted to use iostat to monitor the disks. I would open iotop, then, in another terminal, sync the disks, run iostat, wait for write activity to appear in the terminal with iotop running, then quickly run iostat again. The result would be categorized based on the corresponding iotop pattern it matched until I had 3 samples of each (the 2nd pattern appeared less frequently in testing). This gave me the following data points:
Headers: Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd Pattern #1: Sample 1: sda 3.00 6.77 16.12 0.00 19551420 46595192 0 sdb 3.15 6.75 16.08 0.00 19505657 46474388 0 sdc 3.57 6.72 16.10 0.00 19405438 46513572 0 sda 3.00 6.77 16.12 0.00 19551420 46595197 0 sdb 3.15 6.75 16.08 0.00 19505657 46474393 0 sdc 3.57 6.72 16.10 0.00 19405438 46513573 0 Sample 2: sda 3.00 6.77 16.12 0.00 19551440 46595206 0 sdb 3.15 6.75 16.08 0.00 19505661 46474418 0 sdc 3.57 6.72 16.10 0.00 19405438 46513598 0 sda 3.00 6.77 16.12 0.00 19551440 46595211 0 sdb 3.15 6.75 16.08 0.00 19505661 46474423 0 sdc 3.57 6.71 16.09 0.00 19405438 46513599 0 Sample 3: sda 3.00 6.76 16.12 0.00 19551592 46595376 0 sdb 3.15 6.75 16.08 0.00 19505689 46474716 0 sdc 3.57 6.71 16.09 0.00 19405442 46513856 0 sda 3.00 6.76 16.12 0.00 19551592 46595381 0 sdb 3.15 6.75 16.08 0.00 19505689 46474721 0 sdc 3.57 6.71 16.09 0.00 19405442 46513857 0 Pattern #2: Sample 1: sda 3.00 6.76 16.12 0.00 19551536 46595324 0 sdb 3.15 6.75 16.08 0.00 19505677 46474620 0 sdc 3.57 6.71 16.09 0.00 19405438 46513768 0 sda 3.00 6.76 16.12 0.00 19551536 46595329 0 sdb 3.15 6.75 16.08 0.00 19505681 46474621 0 sdc 3.57 6.71 16.09 0.00 19405438 46513773 0 Sample 2: sda 3.00 6.76 16.12 0.00 19551624 46595398 0 sdb 3.15 6.75 16.08 0.00 19505689 46474782 0 sdc 3.57 6.71 16.09 0.00 19405450 46513902 0 sda 3.00 6.76 16.12 0.00 19551624 46595403 0 sdb 3.15 6.75 16.08 0.00 19505689 46474787 0 sdc 3.57 6.71 16.09 0.00 19405450 46513903 0 Sample 3: sda 3.00 6.76 16.12 0.00 19551716 46595823 0 sdb 3.15 6.75 16.08 0.00 19505693 46475127 0 sdc 3.57 6.71 16.09 0.00 19405462 46514383 0 sda 3.00 6.76 16.12 0.00 19551716 46595828 0 sdb 3.15 6.75 16.08 0.00 19505693 46475128 0 sdc 3.57 6.71 16.09 0.00 19405462 46514388 0The most interesting piece of this data were the consistent changes in kB_wrtn. Of note was that in each case two drives would show an increment of 5 while the other would show an increment of 1. This made sense: the RAID-5 set-up was mirroring changes across two drives while the 3rd drive only had its event counter updated. Unfortunately, the data which I gathered did not show any difference in quantity of writes between the two I/O patterns observed. Casual observation while gathering the data also appeared to show that the first pattern was generated by irssi while the second pattern was generated by atheme-services, so I decided to re-gather some data while also taking note of the programs generating the I/O:
fail2ban (10k): Sample 1: md0 4.34 10.51 28.34 0.00 30382489 81938988 0 sda 3.00 6.76 16.12 0.00 19551716 46596031 0 sdb 3.15 6.75 16.07 0.00 19505697 46475227 0 sdc 3.57 6.71 16.09 0.00 19405466 46514551 0 md0 4.34 10.51 28.34 0.00 30382489 81938996 0 sda 3.00 6.76 16.12 0.00 19551716 46596040 0 sdb 3.14 6.75 16.07 0.00 19505697 46475232 0 sdc 3.57 6.71 16.09 0.00 19405466 46514556 0 irssi (Pattern 1, 6K): Sample 1: md0 4.34 10.51 28.34 0.00 30382489 81939088 0 sda 3.00 6.76 16.11 0.00 19551716 46596126 0 sdb 3.14 6.75 16.07 0.00 19505697 46475266 0 sdc 3.57 6.71 16.09 0.00 19405466 46514626 0 md0 4.34 10.51 28.34 0.00 30382489 81939092 0 sda 3.00 6.76 16.11 0.00 19551716 46596131 0 sdb 3.14 6.75 16.07 0.00 19505697 46475267 0 sdc 3.57 6.71 16.09 0.00 19405466 46514631 0 Sample 2: md0 4.34 10.51 28.34 0.00 30382489 81939120 0 sda 3.00 6.76 16.11 0.00 19551716 46596153 0 sdb 3.14 6.75 16.07 0.00 19505697 46475277 0 sdc 3.57 6.71 16.09 0.00 19405466 46514661 0 md0 4.34 10.51 28.34 0.00 30382489 81939124 0 sda 3.00 6.76 16.11 0.00 19551716 46596158 0 sdb 3.14 6.75 16.07 0.00 19505697 46475278 0 sdc 3.57 6.71 16.09 0.00 19405466 46514666 0 Sample 3: md0 4.34 10.51 28.33 0.00 30382489 81939388 0 sda 3.00 6.76 16.11 0.00 19551724 46596394 0 sdb 3.14 6.74 16.07 0.00 19505701 46475398 0 sdc 3.57 6.71 16.08 0.00 19405478 46514874 0 md0 4.34 10.51 28.33 0.00 30382489 81939392 0 sda 3.00 6.76 16.11 0.00 19551724 46596399 0 sdb 3.14 6.74 16.07 0.00 19505701 46475399 0 sdc 3.57 6.71 16.08 0.00 19405478 46514879 0 irssi (Pattern 2, 10k-1000b): Sample 1: md0 4.34 10.51 28.33 0.00 30382489 81939316 0 sda 3.00 6.76 16.11 0.00 19551724 46596326 0 sdb 3.14 6.75 16.07 0.00 19505697 46475362 0 sdc 3.57 6.71 16.08 0.00 19405474 46514822 0 md0 4.34 10.51 28.33 0.00 30382489 81939320 0 sda 3.00 6.76 16.11 0.00 19551724 46596331 0 sdb 3.14 6.75 16.07 0.00 19505697 46475367 0 sdc 3.57 6.71 16.08 0.00 19405474 46514823 0 Sample 2: md0 4.34 10.50 28.33 0.00 30382489 81939620 0 sda 3.00 6.76 16.11 0.00 19551728 46596597 0 sdb 3.14 6.74 16.07 0.00 19505737 46475509 0 sdc 3.57 6.71 16.08 0.00 19405482 46515057 0 md0 4.34 10.50 28.33 0.00 30382489 81939624 0 sda 3.00 6.76 16.11 0.00 19551728 46596602 0 sdb 3.14 6.74 16.07 0.00 19505737 46475514 0 sdc 3.57 6.71 16.08 0.00 19405482 46515058 0 atheme (10K): Sample 1: md0 4.34 10.51 28.34 0.00 30382489 81939216 0 sda 3.00 6.76 16.11 0.00 19551724 46596231 0 sdb 3.14 6.75 16.07 0.00 19505697 46475323 0 sdc 3.57 6.71 16.09 0.00 19405474 46514747 0 md0 4.34 10.51 28.34 0.00 30382489 81939224 0 sda 3.00 6.76 16.11 0.00 19551724 46596240 0 sdb 3.14 6.75 16.07 0.00 19505697 46475324 0 sdc 3.57 6.71 16.09 0.00 19405474 46514756 0So not only did the data which I gathered begin nullifying my hypothesis, I also started to gather data which didn't fit any of the patterns which I had previously defined. Utterly confounding. I also had some rare data from fail2ban, which was mildly interesting. Overall, though, it appeared that irssi was capable of generating both types of I/O pattern. While not conclusive, I thus hypothesized that the different I/O patterns likely had more to do with variance in disk write location and polling timing rather than the program generating the I/O, but had no way to test this hypothesis. At this point I also got rather bored of this data-gathering method (it took quite a few tries to get atheme data) which did not appear to be bearing any fruit, so I decided to try looking at the scheduler code instead.
static const int read_expire = HZ / 2; /* max time before a read is submitted. */ static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */Following the use of this constant showed that it was assigned as part of the queue initialization functional:
static int dd_init_queue(struct request_queue *q, struct elevator_type *e) ... dd->fifo_expire[READ] = read_expire; dd->fifo_expire[WRITE] = write_expire; ...Tracking the assigned member then showed that it was being used to set the fifo_time member of each request that was added to the queue:
static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, bool at_head) ... /* * set expire time and add to fifo list */ rq->fifo_time = jiffies + dd->fifo_expire[data_dir]; list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]); ...How was the fifo_time member used, though? I saw two usages: the first was to be set to an adjacent quest's fifo_time during a merge:
static void dd_merged_requests(struct request_queue *q, struct request *req, struct request *next) ... /* * if next expires before rq, assign its expire time to rq * and move into next position (next will be deleted) in fifo */ if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) { if (time_before((unsigned long)next->fifo_time, (unsigned long)req->fifo_time)) { list_move(&req->queuelist, &next->queuelist); req->fifo_time = next->fifo_time; } } ...The second use was to check whether or not the queue had any expired requests:
/* * deadline_check_fifo returns 0 if there are no expired requests on the fifo, * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir]) */ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir) { struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next); /* * rq is expired! */ if (time_after_eq(jiffies, (unsigned long)rq->fifo_time)) return 1; return 0; }While I couldn't easily figure out whether or not write requests were batched together, what this did suggest to me was that each request had its own separate expiration time unless the request was being merged with an adjacent request, in which case it inherited the original requests' fifo_time. This would reflect what I was seeing, which is that requests were being intermittently dispatched but at a delay due to the high write_expire time. I also posited that I might be able to hackishly batch all write requests together by giving them the same fifo_time when they were inserted to the queue in dd_insert_request. Before trying to modify the code, though, I wanted to see if I could get some data on the queue's contents via debugfs as the following code which I read suggested that might be possible:
#define DEADLINE_QUEUE_DDIR_ATTRS(name) \ {#name "_fifo_list", 0400, .seq_ops = &deadline_##name##_fifo_seq_ops}, \ {#name "_next_rq", 0400, deadline_##name##_next_rq_show} static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = { DEADLINE_QUEUE_DDIR_ATTRS(read), DEADLINE_QUEUE_DDIR_ATTRS(write), {"batching", 0400, deadline_batching_show}, {"starved", 0400, deadline_starved_show}, {"dispatch", 0400, .seq_ops = &deadline_dispatch_seq_ops}, {}, }; #undef DEADLINE_QUEUE_DDIR_ATTRSWell, I tried that on a separate system (the running server wasn't readily configured and I didn't feel like modifying it) which was also RAID-5 by generating periodic writes and monitoring the files exported under /sys/kernel/debug/block/sda/sched/*. I only (briefly) saw stuff when generating heavy I/O due to running some other task; the periodic writes did not generate any noise in the above files at all.
This left one obvious way forward as mentioned earlier: try modifying the scheduler to set the write expiration time of a block I/O request to be the same as that of the rest of the queue. While this would certainly be an interesting task, modifying the scheduler myself would mean that I run the risk of making a terrible mistake and corrupting my filesystem as a result. A workaround might have been to use a VM which I've backed-up in the event of an issue.
Fortunately for me, though, my circumstances changed and I got a separate room to put the noisy server in. That's much easier than trying to hack the scheduler for an uncommon use case.