login
Header Space

 
 

Linux: Where The Anticipatory Scheduler Shines

February 24, 2003 - 8:59pm
Submitted by Jeremy on February 24, 2003 - 8:59pm.
Linux news

Andrew Morton [interview] recently posted some interesting benchmarks comparing the current 2.4 IO scheduler, a "hacked" version of the deadline IO scheduler [story] in 2.5.61, the CFQ scheduler [story], and the anticipatory scheduler [story]. Offering a succinct "executive summary" of his results, Andrew said, "the anticipatory scheduler is wiping the others off the map, and 2.4 is a disaster." Indeed, in many of the tests the other IO schedulers were measured in minutes, whereas the anticipatory IO scheduler was measured in mere seconds.

Andrea Arcangeli responded to these tests pointing out that they do not in any way highlight the benefits of the CFQ scheduler, which instead is designed to maintain minimal worst case latency on each and every IO read and write. Andrea explains, "CFQ is made for multimedia desktop usage only, you want to be sure mplayer or xmms will never skip frames, not for parallel cp reading floods of data at max speed like a database with zillon of threads." This lead to an interesting discussion in which Andrew suggested that such programs employ a broken design which should be fixed directly, rather than working around them in the IO scheduler.

Andrew's latest -mm release, 2.5.62-mm3, still includes all three of the 2.5 kernel's IO schedulers. The default is 'as', but you can also select 'cfq' or 'deadline' from the kernel boot commandline.


From: Andrew Morton
Subject: IO scheduler benchmarking
Date: Thu, 20 Feb 2003 21:23:04 -0800

Following this email are the results of a number of tests of various I/O
schedulers:

- Anticipatory Scheduler (AS) (from 2.5.61-mm1 approx)

- CFQ (as in 2.5.61-mm1)

- 2.5.61+hacks (Basically 2.5.61 plus everything before the anticipatory
  scheduler - tweaks which fix the writes-starve-reads problem via a
  scheduling storm)

- 2.4.21-pre4

All these tests are simple things from the command line.

I stayed away from the standard benchmarks because they do not really touch
on areas where the Linux I/O scheduler has traditionally been bad.  (If they
did, perhaps it wouldn't have been so bad..)

Plus all the I/O schedulers perform similarly with the usual benchmarks. 
With the exception of some tiobench phases, where AS does very well.

Executive summary: the anticipatory scheduler is wiping the others off the
map, and 2.4 is a disaster.

I really have not sought to make the AS look good - I mainly concentrated on
things which we have traditonally been bad at.  If anyone wants to suggest
other tests, please let me know.

The known regressions from the anticipatory scheduler are:

1) 15% (ish) slowdown in David Mansfield's database run.  This appeared to
   go away in later versions of the scheduler.

2) 5% dropoff in single-threaded qsbench swapstorms

3) 30% dropoff in write bandwidth when there is a streaming read (this is
   actually good).

The test machine is a fast P4-HT with 256MB of memory.  Testing was against a
single fast IDE disk, using ext2.


From: Andrew Morton Subject: iosched: parallel streaming reads Date: Thu, 20 Feb 2003 21:23:52 -0800 Here we see how well the scheduler can cope with multiple processes reading multiple large files. We read ten well laid out 100 megabyte files in parallel (ten readers): for i in $(seq 0 9) do time cat 100-meg-file-$i > /dev/null & done 2.4.21-pre4: 0.00s user 0.18s system 2% cpu 6.115 total 0.02s user 0.22s system 1% cpu 14.312 total 0.01s user 0.19s system 1% cpu 14.812 total 0.00s user 0.14s system 0% cpu 20.462 total 0.02s user 0.19s system 0% cpu 23.887 total 0.06s user 0.14s system 0% cpu 27.085 total 0.01s user 0.26s system 0% cpu 32.367 total 0.00s user 0.22s system 0% cpu 34.844 total 0.01s user 0.21s system 0% cpu 35.233 total 0.01s user 0.16s system 0% cpu 37.007 total 2.5.61+hacks: 0.01s user 0.16s system 0% cpu 2:12.00 total 0.01s user 0.15s system 0% cpu 2:12.12 total 0.00s user 0.14s system 0% cpu 2:12.34 total 0.01s user 0.15s system 0% cpu 2:12.68 total 0.00s user 0.15s system 0% cpu 2:12.93 total 0.01s user 0.17s system 0% cpu 2:13.06 total 0.01s user 0.14s system 0% cpu 2:13.18 total 0.01s user 0.17s system 0% cpu 2:13.31 total 0.01s user 0.16s system 0% cpu 2:13.49 total 0.01s user 0.19s system 0% cpu 2:13.51 total 2.5.61+CFQ: 0.01s user 0.16s system 0% cpu 50.778 total 0.01s user 0.16s system 0% cpu 51.067 total 0.01s user 0.16s system 0% cpu 52.854 total 0.01s user 0.17s system 0% cpu 53.303 total 0.01s user 0.17s system 0% cpu 54.565 total 0.01s user 0.18s system 0% cpu 1:07.39 total 0.01s user 0.17s system 0% cpu 1:19.96 total 0.00s user 0.17s system 0% cpu 1:28.74 total 0.01s user 0.18s system 0% cpu 1:31.28 total 0.01s user 0.18s system 0% cpu 1:32.34 total 2.5.61+AS 0.01s user 0.17s system 0% cpu 27.995 total 0.01s user 0.18s system 0% cpu 30.550 total 0.00s user 0.17s system 0% cpu 31.413 total 0.00s user 0.18s system 0% cpu 32.381 total 0.01s user 0.17s system 0% cpu 33.273 total 0.01s user 0.18s system 0% cpu 33.389 total 0.01s user 0.15s system 0% cpu 34.534 total 0.01s user 0.17s system 0% cpu 34.481 total 0.00s user 0.17s system 0% cpu 34.694 total 0.01s user 0.16s system 0% cpu 34.832 total AS and 2.4 almost achieved full disk bandwidth. 2.4 does quite well here, although it was unfair. As an aside, I reran this test with the VM readahead wound down from the usual 128k to just 8k: 2.5.61+CFQ: 0.01s user 0.25s system 0% cpu 7:48.39 total 0.01s user 0.23s system 0% cpu 7:48.72 total 0.02s user 0.26s system 0% cpu 7:48.93 total 0.02s user 0.25s system 0% cpu 7:48.93 total 0.01s user 0.26s system 0% cpu 7:49.08 total 0.02s user 0.25s system 0% cpu 7:49.22 total 0.02s user 0.26s system 0% cpu 7:49.25 total 0.02s user 0.25s system 0% cpu 7:50.35 total 0.02s user 0.26s system 0% cpu 8:19.82 total 0.02s user 0.28s system 0% cpu 8:19.83 total 2.5.61 base: 0.01s user 0.25s system 0% cpu 8:10.53 total 0.01s user 0.27s system 0% cpu 8:11.96 total 0.02s user 0.26s system 0% cpu 8:14.95 total 0.02s user 0.26s system 0% cpu 8:17.33 total 0.02s user 0.25s system 0% cpu 8:18.05 total 0.01s user 0.24s system 0% cpu 8:19.03 total 0.02s user 0.27s system 0% cpu 8:19.66 total 0.02s user 0.25s system 0% cpu 8:20.00 total 0.02s user 0.26s system 0% cpu 8:20.10 total 0.02s user 0.25s system 0% cpu 8:20.11 total 2.5.61+AS 0.02s user 0.23s system 0% cpu 28.640 total 0.01s user 0.23s system 0% cpu 28.066 total 0.02s user 0.23s system 0% cpu 28.525 total 0.01s user 0.20s system 0% cpu 28.925 total 0.01s user 0.22s system 0% cpu 28.835 total 0.02s user 0.21s system 0% cpu 29.014 total 0.02s user 0.23s system 0% cpu 29.093 total 0.01s user 0.20s system 0% cpu 29.175 total 0.01s user 0.23s system 0% cpu 29.233 total 0.01s user 0.21s system 0% cpu 29.285 total We see here that the anticipatory scheduler is not dependent upon large readahead to get good performance.
From: Andrew Morton Subject: iosched: effect of streaming write on interactivity Date: Thu, 20 Feb 2003 21:24:39 -0800 It peeves me that if a machine is writing heavily, it takes *ages* to get a login prompt. Here we start a large streaming write, wait for that to reach steady state and then see how long it takes to pop up an xterm from the machine under test with time ssh testbox xterm -e true there is quite a lot of variability here. 2.4.21-4: 62 seconds 2.5.61+hacks: 14 seconds 2.5.61+CFQ: 11 seconds 2.5.61+AS: 12 seconds
From: Andrew Morton Subject: iosched: effect of streaming read on interactivity Date: Thu, 20 Feb 2003 21:25:21 -0800 Similarly, start a large streaming read on the test box and see how long it then takes to pop up an x client running on that box with time ssh testbox xterm -e true 2.4.21-4: 45 seconds 2.5.61+hacks: 5 seconds 2.5.61+CFQ: 8 seconds 2.5.61+AS: 9 seconds
From: Andrew Morton Subject: iosched: time to copy many small files Date: Thu, 20 Feb 2003 21:25:54 -0800 This test simply measures how long it takes to copy a large number of files within the same filesystem. It creates a lot of small, competing read and write I/O's. Changes which were made to the VFS dirty memory handling early in the 2.5 cycle tends to make 2.5 a bit slower at this. Three copies of the 2.4.19 kernel tree were placed on an ext2 filesystem. Measure the time it takes to copy them all to the same filesystem, and to then sync the system. This is just cp -a ./dir-with-three-kernel-trees/ ./new-dir sync The anticipatory scheduler doesn't help here. It could, but we haven't got there yet, and it may need VFS help. 2.4.21-pre4: 70 seconds 2.5.61+hacks: 72 seconds 2.5.61+CFQ: 69 seconds 2.5.61+AS: 66 seconds
From: Andrew Morton Subject: iosched: concurrent reads of many small files Date: Thu, 20 Feb 2003 21:26:27 -0800 This test is very approximately the "busy web server" workload. We set up a number of processes each of which are reading many small files from different parts of the disk. Set up six separate copies of the 2.4.19 kernel tree, and then run, in parallel, six processes which are reading them: for i in 1 2 3 4 5 6 do time (find kernel-tree-$i -type f | xargs cat > /dev/null ) & done With this test we have six read requests in the queue all the time. It's what the anticipatory scheduler was designed for. 2.4.21-pre4: 6m57.537s 6m57.620s 6m57.741s 6m57.891s 6m57.909s 6m57.916s 2.5.61+hacks: 3m40.188s 3m51.332s 3m55.110s 3m56.186s 3m56.757s 3m56.791s 2.5.61+CFQ: 5m15.932s 5m16.219s 5m16.386s 5m17.407s 5m50.233s 5m50.602s 2.5.61+AS: 0m44.573s 0m45.119s 0m46.559s 0m49.202s 0m51.884s 0m53.087s This was a little unfair to 2.4 because three of the trees were laid out by the pre-Orlov ext2. So I reran the test with 2.4.21-pre4 when all six trees were laid out by 2.5's Orlov allocator: 6m12.767s 6m12.974s 6m13.001s 6m13.045s 6m13.062s 6m13.085s Not much difference there, although Orlov is worth a 4x speedup in this test when there is only a single reader (or multiple readers + anticipatory scheduler)
From: Andrew Morton Subject: iosched: impact of streaming write on streaming read Date: Thu, 20 Feb 2003 21:27:03 -0800 Here we take a look at the impact which a streaming write has upon streaming read bandwidth. A single streaming write was set up with: while true do dd if=/dev/zero of=foo bs=1M count=512 conv=notrunc done and we measure how long it takes to read a 100 megabyte file from the same filesystem with time cat 100m-file > /dev/null I'll include `vmstat 1' snippets here as well. 2.4.21-pre4: 42 seconds 1 3 276 4384 2144 222300 0 0 80 26480 520 743 0 6 94 0 0 3 276 4344 2144 222240 0 0 76 25224 512 492 0 4 96 0 0 3 276 4340 2148 222220 0 0 124 25584 520 536 0 3 97 0 0 3 276 4404 2152 222132 0 0 44 26604 538 533 0 5 95 0 0 4 276 4464 2160 221928 0 0 60 25040 516 559 0 4 96 0 0 4 276 4460 2160 221900 0 0 612 27456 560 621 0 4 96 0 0 4 276 4392 2156 221972 0 0 708 23872 488 566 0 4 95 0 0 4 276 4420 2168 221852 0 0 688 26668 545 653 0 4 96 0 0 4 276 4204 2164 221912 0 0 696 21588 492 884 0 5 95 0 0 4 276 4448 2164 221668 0 0 396 21376 423 833 0 4 96 0 0 4 276 4432 2160 221688 0 0 784 26368 544 705 0 4 96 0 0 4 276 4400 2168 221608 0 0 560 27640 563 596 0 5 95 0 4 1 276 4324 2188 221616 0 0 12476 12996 538 908 0 4 96 0 0 4 276 3516 2196 222408 0 0 12320 16048 529 971 0 2 98 0 0 4 276 3468 2212 222424 0 0 12704 14428 540 1039 0 4 96 0 0 4 276 4112 2208 221700 0 0 552 20824 474 539 0 4 96 0 3 2 276 3768 2208 222040 0 0 524 25428 503 612 0 3 97 0 0 4 276 4452 2216 221344 0 0 536 19548 437 1241 0 3 97 0 2.5.61+hacks: 48 seconds 0 5 0 2140 1296 227700 0 0 0 22236 1213 126 0 4 0 96 0 5 0 2252 1296 227664 0 0 0 23340 1219 123 0 3 0 97 0 6 0 4044 1288 225904 0 0 1844 13632 1183 236 0 2 0 98 0 6 0 4100 1268 225788 0 0 1920 13780 1173 217 0 2 0 98 0 6 0 4156 1248 225908 0 0 2184 14828 1184 236 0 3 0 97 0 6 0 4100 1244 226012 0 0 2176 13720 1173 237 0 2 0 98 0 6 0 4212 1240 225980 0 0 1924 13900 1175 236 0 2 0 98 0 5 0 5444 1192 224824 0 0 2304 11820 1164 206 0 2 0 98 0 6 0 2196 1180 228088 0 0 2308 14460 1180 269 0 3 0 97 2.5.61+CFQ: 27 seconds 1 3 0 6196 2060 222852 0 0 0 23840 1247 220 0 4 4 92 0 2 0 4404 1820 224880 0 0 0 22208 1237 271 0 3 8 89 2 4 0 2884 1680 226588 0 0 1496 26944 1263 355 0 4 2 94 0 4 0 4332 1312 225388 0 0 4592 14692 1244 414 0 3 0 97 0 4 0 4268 1012 225764 0 0 1408 29540 1308 671 0 5 0 95 0 4 0 3316 1016 226752 0 0 2820 27500 1306 668 0 5 0 95 0 4 0 4212 992 225924 0 0 3076 22148 1255 508 0 3 0 97 2.5.61+AS: 3.8 seconds 0 4 0 2236 1320 227548 0 0 0 36684 1335 136 0 5 0 95 0 4 0 2236 1296 227636 0 0 0 37736 1334 134 0 5 0 95 0 5 0 3348 1088 226604 0 0 1232 30040 1320 174 0 4 0 96 0 5 0 2284 1056 227920 0 0 29088 5488 1536 855 0 4 0 96 0 5 0 4916 1080 225672 0 0 26904 8452 1517 993 0 5 0 95 0 5 120 2228 1108 228732 0 120 29472 6752 1545 940 0 3 1 96 0 4 120 4196 1060 226984 0 0 16164 15740 1426 627 0 3 3 93
From: Andrew Morton Subject: iosched: impact of streaming write on read-many-files Date: Thu, 20 Feb 2003 21:27:30 -0800 Here we look at what affect a large streaming write has upon an operation which reads many small files from the same disk. A single streaming write was set up with: while true do dd if=/dev/zero of=foo bs=1M count=512 conv=notrunc done and we measure how long it takes to read all the files from a 2.4.19 kernel tree off the same disk with time (find kernel-tree -type f | xargs cat > /dev/null) As a reference, the time to read the kernel tree with no competing I/O is 7.9 seconds. 2.4.21-pre4: Don't know. I killed it after 15 minutes. Judging from the vmstat output it would have taken many hours. 2.5.61+hacks: 7 minutes 27 seconds r b swpd free buff cache si so bi bo in cs us sy id wa 0 8 0 2188 1200 226692 0 0 852 17664 1204 253 0 3 0 97 0 8 0 4148 1212 224804 0 0 1940 16208 1187 245 0 2 0 98 0 7 0 4260 1128 224756 0 0 324 20228 1226 298 0 3 0 97 0 8 0 4204 1048 224944 0 0 500 20856 1227 313 0 3 0 97 1 7 0 2300 1040 226840 0 0 348 20272 1227 313 0 3 0 97 0 8 0 4204 1044 224952 0 0 212 21564 1230 320 0 3 0 97 2.5.61+CFQ: 9 minutes 55 seconds r b swpd free buff cache si so bi bo in cs us sy id wa 1 2 0 4308 1028 224660 0 0 180 38368 1250 357 0 3 6 91 0 4 0 2180 1020 226852 0 0 324 25196 1266 408 0 4 1 95 0 4 0 2236 1016 226744 0 0 252 26948 1276 449 0 4 2 93 0 4 0 4196 1020 224816 0 0 380 23204 1250 454 0 3 4 93 0 3 0 4356 1036 224632 0 0 2616 25824 1271 490 0 4 0 96 0 4 0 4140 968 224996 0 0 496 29416 1304 609 0 4 0 96 0 4 0 2180 948 226972 0 0 352 29364 1300 688 0 5 0 95 0 3 0 4364 928 224796 0 0 344 22100 1281 656 0 4 22 74 (CFQ had a strange 20-second pause in which it performed no reads at all) (And a later 4-second one) (then 10 seconds..) 2.5.61+AS: 17 seconds r b swpd free buff cache si so bi bo in cs us sy id wa 0 6 0 2280 2716 226112 0 0 0 22388 1205 151 0 3 0 97 0 6 0 4296 2596 224168 0 0 0 21968 1213 148 0 3 0 97 1 6 0 3872 2516 224408 0 0 296 19552 1223 249 0 3 0 97 0 9 0 2176 2584 225324 0 0 5112 14588 1573 1424 0 5 0 94 0 8 0 3364 2668 223116 0 0 17512 8500 3059 6065 0 8 0 92 1 8 0 4156 2708 221340 0 0 12812 9560 2695 4863 0 9 0 91 0 8 0 3740 2956 221188 0 0 17216 7200 2406 4045 0 6 0 94 0 9 0 3828 2668 221192 0 0 9712 8972 1615 1540 0 5 0 94 1 6 0 2060 2924 222272 0 0 8428 17784 1713 1718 0 5 0 95
From: Andrew Morton Subject: iosched: effect of streaming read on streaming write Date: Thu, 20 Feb 2003 21:28:29 -0800 Here we look at how much damage a streaming read can do to writeout performance. Start a streaming read with: while true do cat 512M-file > /dev/null done and measure how long it takes to write out and fsync a 100 megabyte file: time write-and-fsync -f -m 100 outfile 2.4.21-pre4: 6.4 seconds 2.5.61+hacks: 7.7 seconds 2.5.61+CFQ: 8.4 seconds 2.5.61+AS: 11.9 seconds This is the one where the anticipatory scheduler could show its downside. It's actually not too bad - the read stream steals 2/3rds of the disk bandwidth. Dirty memory will reach the vm threshold and writers will throttle. This is usually what we want to happen. Here is the vmstat 1 trace for the anticipatory scheduler: r b swpd free buff cache si so bi bo in cs us sy id wa 1 1 8728 2268 2620 233412 0 0 40360 0 1658 802 0 4 0 96 0 2 8728 3780 2508 231924 0 0 40616 4 1668 874 0 5 0 95 0 2 8728 3668 2276 232416 0 0 40740 20 1668 978 0 4 0 96 0 3 8728 3660 2192 232668 40 0 35296 12 1603 904 0 4 0 95 0 5 8728 3612 1964 231672 0 0 26220 18572 1497 1381 0 15 0 85 0 5 8728 2100 1732 233584 0 0 25232 8696 1497 867 0 3 16 81 0 5 8728 3664 1204 232424 0 0 27668 8696 1533 787 0 3 0 97 1 4 8728 2432 792 234108 0 0 27160 8696 1527 965 0 3 0 97 0 6 8728 2208 760 234436 0 0 25904 9584 1513 856 0 3 0 97 2 6 8728 3776 760 233148 0 0 27776 8716 1537 880 0 3 0 97 0 6 8728 2204 624 234968 0 0 27924 8812 1541 991 0 4 0 96 0 4 8716 2508 600 234740 0 0 28188 8216 1537 1038 0 4 0 96 0 4 8716 4072 532 233316 0 16 25624 9644 1515 896 0 3 0 97 0 4 8716 3740 548 233624 0 0 27548 8696 1528 908 0 3 0 97
From: Andrew Morton Subject: iosched: impact of streaming read on read-many-files Date: Thu, 20 Feb 2003 21:27:58 -0800 Here we look at what affect a large streaming read has upon an operation which reads many small files from the same disk. A single streaming read was set up with: while true do cat 512M-file > /dev/null done and we measure how long it takes to read all the files from a 2.4.19 kernel tree off the same disk with time (find kernel-tree -type f | xargs cat > /dev/null) 2.4.21-pre4: 31 minutes 30 seconds 2.5.61+hacks: 3 minutes 39 seconds 2.5.61+CFQ: 5 minutes 7 seconds (*) 2.5.61+AS: 17 seconds * CFQ performed very strangely here. Tremendous amount of seeking and a big drop in aggregate bandwidth. See the vmstat 1 output from when the kernel tree read started up: r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 1240 125260 1176 109488 0 0 40744 0 1672 725 0 3 49 47 0 1 1240 85892 1220 148788 0 0 39344 0 1651 693 0 3 49 48 0 1 1240 45124 1260 189492 0 0 40744 0 1663 683 0 3 49 47 1 1 1240 4544 1300 230068 0 0 40616 0 1661 837 0 4 49 47 0 2 1348 3468 944 231696 0 108 40488 148 1671 800 0 4 4 91 0 2 1348 2180 936 232920 0 0 40612 64 1668 789 0 4 0 96 0 3 1348 4220 996 230648 0 0 11348 0 1256 352 0 2 0 98 0 3 1348 4052 1064 230472 0 0 9012 0 1207 305 0 1 0 98 0 4 1348 3596 1148 230580 0 0 6756 0 1171 247 0 1 0 99 0 4 1348 4044 1148 229888 0 0 6344 0 1165 237 0 1 0 99 1 3 1348 3708 1160 230212 0 0 7800 0 1187 255 0 1 21 78
From: Andrea Arcangeli Subject: Re: iosched: impact of streaming read on read-many-files Date: Fri, 21 Feb 2003 11:40:28 +0100 On Thu, Feb 20, 2003 at 09:27:58PM -0800, Andrew Morton wrote: > > Here we look at what affect a large streaming read has upon an operation > which reads many small files from the same disk. > > A single streaming read was set up with: > > while true > do > cat 512M-file > /dev/null > done > > and we measure how long it takes to read all the files from a 2.4.19 kernel > tree off the same disk with > > time (find kernel-tree -type f | xargs cat > /dev/null) > > > > 2.4.21-pre4: 31 minutes 30 seconds > > 2.5.61+hacks: 3 minutes 39 seconds > > 2.5.61+CFQ: 5 minutes 7 seconds (*) > > 2.5.61+AS: 17 seconds > > > > > > * CFQ performed very strangely here. Tremendous amount of seeking and a strangely? this is the *feature*. Benchmarking CFQ in function of real time is pointless, apparently you don't understand the whole point about CFQ and you keep benchmarking like if CFQ was designed for a database workload. the only thing you care if you run CFQ is the worst case latency of read, never the throughput, 128k/sec is more than enough as far as you never wait 2 seconds before you can get the next 128k. take tiobench with 1 single thread in read mode and keep it running in background and collect the worst case latency, only *then* you will have a chance to see a benefit. CFQ is all but a generic purpose elevator. You must never use CFQ if your object is throughput and you benchmark the global workload and not the worst case latency of every single read or write-sync syscall. CFQ is made for multimedia desktop usage only, you want to be sure mplayer or xmms will never skip frames, not for parallel cp reading floods of data at max speed like a database with zillon of threads. For multimedia not to skip frames 1M/sec is more than enough bandwidth, doesn't matter if the huge database in background runs much slower as far as you never skip a frame. If you don't mind to skip frames you shouldn't use CFQ and everything will run faster, period. Andrea
From: Andrew Morton Subject: Re: iosched: impact of streaming read on read-many-files Date: Fri, 21 Feb 2003 13:11:58 -0800 Andrea Arcangeli wrote: > > CFQ is made for multimedia desktop usage only, you want to be sure > mplayer or xmms will never skip frames, not for parallel cp reading > floods of data at max speed like a database with zillon of threads. For > multimedia not to skip frames 1M/sec is more than enough bandwidth, > doesn't matter if the huge database in background runs much slower as > far as you never skip a frame. These applications are broken. The kernel shouldn't be bending over backwards trying to fix them up. Because this will never ever work as well as fixing the applications. The correct way to design such an application is to use an RT thread to perform the display/audio device I/O and a non-RT thread to perform the disk I/O. The disk IO thread keeps the shared 8 megabyte buffer full. The RT thread mlocks that buffer. The deadline scheduler will handle that OK. The anticipatory scheduler (which is also deadline) will handle it better. If an RT thread performs disk I/O it is bust, and we should not try to fix it. The only place where VFS/VM/block needs to care for RT tasks is in the page allocator. Because even well-designed RT tasks need to allocate pages. The 2.4 page allocator has a tendency to cause 5-10 second stalls for a single page allocation when the system is under writeout load. That is fixed in 2.5, but special-casing RT tasks in the allocator would make sense.
From: Andrea Arcangeli Subject: Re: iosched: impact of streaming read on read-many-files Date: Sun, 23 Feb 2003 16:16:43 +0100 On Fri, Feb 21, 2003 at 01:11:58PM -0800, Andrew Morton wrote: > Andrea Arcangeli wrote: > > > > CFQ is made for multimedia desktop usage only, you want to be sure > > mplayer or xmms will never skip frames, not for parallel cp reading > > floods of data at max speed like a database with zillon of threads. For > > multimedia not to skip frames 1M/sec is more than enough bandwidth, > > doesn't matter if the huge database in background runs much slower as > > far as you never skip a frame. > > These applications are broken. The kernel shouldn't be bending over > backwards trying to fix them up. Because this will never ever work as well > as fixing the applications. disagree, if the kernel doesn't provide a lowlatency elevator of some sort there's no way to workaround it in userspace with just a partial-mem buffer (unless you do [1]) > The correct way to design such an application is to use an RT thread to > perform the display/audio device I/O and a non-RT thread to perform the disk > I/O. The disk IO thread keeps the shared 8 megabyte buffer full. The RT > thread mlocks that buffer. having an huge buffering introduces the 8m latency during startup Which is very annoying if the machine is under high load (especially if you want to apply realtime effects to the audio, ever tried the xmms equalizer with an 8m buffer? and it still doesn't guarantee that 8megs are enough. secondly 8mbytes mlocked are quite a lot for a 128m destkop. third, applications are just doing what you suggest and still you can hear seldom skips during heavy I/O i.e. having buffering is not enough if the elevator only cares about global throughput or if the queue is very huge (and incidentally you're not using SFQ/CFQ). It is also possible you don't know what you want to read until the last minute. [1] Along your lines you can also buy some giga of ram and copy the whole multimedia data in ramfs before playback ;) I mean, I agree it's a problem that can be solved by throwing money into the hardware. > The deadline scheduler will handle that OK. The anticipatory scheduler > (which is also deadline) will handle it better. > > > > If an RT thread performs disk I/O it is bust, and we should not try to fix > it. The only place where VFS/VM/block needs to care for RT tasks is in the > page allocator. Because even well-designed RT tasks need to allocate pages. > > The 2.4 page allocator has a tendency to cause 5-10 second stalls for a > single page allocation when the system is under writeout load. That is fixed > in 2.5, but special-casing RT tasks in the allocator would make sense. the main issue that matters here is not the vm but the blkdev layer and there you never know if the I/O was submitted by an RT task or not. and btw the right design for such app is really to use async-io not to fork off a worthless thread for the I/O. Andrea


From: Nick Piggin
Subject: Re: iosched: impact of streaming read on read-many-files
Date: Fri, 21 Feb 2003 21:55:00 +1100

Andrea Arcangeli wrote:

>On Thu, Feb 20, 2003 at 09:27:58PM -0800, Andrew Morton wrote:
>
>>Here we look at what affect a large streaming read has upon an operation
>>which reads many small files from the same disk.
>>
>>A single streaming read was set up with:
>>
>>	while true
>>	do
>>	        cat 512M-file > /dev/null
>>	done
>>
>>and we measure how long it takes to read all the files from a 2.4.19 kernel
>>tree off the same disk with
>>
>>	time (find kernel-tree -type f | xargs cat > /dev/null)
>>
>>
>>
>>2.4.21-pre4:	31 minutes 30 seconds
>>
>>2.5.61+hacks:	3 minutes 39 seconds
>>
>>2.5.61+CFQ:	5 minutes 7 seconds (*)
>>
>>2.5.61+AS:	17 seconds
>>
>>
>>
>>
>>
>>* CFQ performed very strangely here.  Tremendous amount of seeking and a
>>
>
>strangely? this is the *feature*. Benchmarking CFQ in function of real
>time is pointless, apparently you don't understand the whole point about
>CFQ and you keep benchmarking like if CFQ was designed for a database
>workload. the only thing you care if you run CFQ is the worst case
>latency of read, never the throughput, 128k/sec is more than enough as
>far as you never wait 2 seconds before you can get the next 128k.
>
>take tiobench with 1 single thread in read mode and keep it running in
>background and collect the worst case latency, only *then* you will have
>a chance to see a benefit. CFQ is all but a generic purpose elevator.
>You must never use CFQ if your object is throughput and you benchmark
>the global workload and not the worst case latency of every single read
>or write-sync syscall.
>
>CFQ is made for multimedia desktop usage only, you want to be sure
>mplayer or xmms will never skip frames, not for parallel cp reading
>floods of data at max speed like a database with zillon of threads. For
>multimedia not to skip frames 1M/sec is  more than enough bandwidth,
>doesn't matter if the huge database in background runs much slower as
>far as you never skip a frame.
>
>If you don't mind to skip frames you shouldn't use CFQ and everything
>will run faster, period.
>
There is actually a point when you have a number of other IO streams
going on where your decreased throughput means *maximum* latency goes
up because robin doesn't go round fast enough. I guess desktop loads
won't often have a lot of different IO streams.

The anticipatory scheduler isn't so strict about fairness, however it
will make as good an attempt as CFQ at keeping maximum read latency
below read_expire (actually read_expire*2 in the current implementation).


From: Andrea Arcangeli Subject: Re: iosched: impact of streaming read on read-many-files Date: Fri, 21 Feb 2003 12:23:47 +0100 On Fri, Feb 21, 2003 at 09:55:00PM +1100, Nick Piggin wrote: > There is actually a point when you have a number of other IO streams > going on where your decreased throughput means *maximum* latency goes > up because robin doesn't go round fast enough. I guess desktop loads this is why it would be nice to set a prctl in the task structure that defines the latency sensitive tasks, so you could leave enabled the CFQ always and only xmms and mplayer would take advantage of it (unless you run then with --skip-frame-is-ok). CFQ in function of pid is the simpler closer transparent approximation of that. Andrea


From: David Lang
Subject: Re: IO scheduler benchmarking
Date: Thu, 20 Feb 2003 22:51:37 -0800 (PST)

one other useful test would be the time to copy a large (multi-gig) file.
currently this takes forever and uses very little fo the disk bandwidth, I
suspect that the AS would give more preference to reads and therefor would
go faster.

for a real-world example, mozilla downloads files to a temp directory and
then copies it to the premanent location. When I download a video from my
tivo it takes ~20 min to download a 1G video, during which time the system
is perfectly responsive, then after the download completes when mozilla
copies it to the real destination (on a seperate disk so it is a copy, not
just a move) the system becomes completely unresponsive to anything
requireing disk IO for several min.

David Lang


From: Andrew Morton Subject: Re: IO scheduler benchmarking Date: Fri, 21 Feb 2003 00:16:24 -0800 David Lang wrote: > > one other useful test would be the time to copy a large (multi-gig) file. > currently this takes forever and uses very little fo the disk bandwidth, I > suspect that the AS would give more preference to reads and therefor would > go faster. Yes, that's a test. time (cp 1-gig-file foo ; sync) 2.5.62-mm2,AS: 1:22.36 2.5.62-mm2,CFQ: 1:25.54 2.5.62-mm2,deadline: 1:11.03 2.4.21-pre4: 1:07.69 Well gee. > for a real-world example, mozilla downloads files to a temp directory and > then copies it to the premanent location. When I download a video from my > tivo it takes ~20 min to download a 1G video, during which time the system > is perfectly responsive, then after the download completes when mozilla > copies it to the real destination (on a seperate disk so it is a copy, not > just a move) the system becomes completely unresponsive to anything > requireing disk IO for several min. Well 2.4 is unreponsive period. That's due to problems in the VM - processes which are trying to allocate memory get continually DoS'ed by `cp' in page reclaim. For the reads-starved-by-writes problem which you describe, you'll see that quite a few of the tests did cover that. contest does as well.
From: Andrea Arcangeli Subject: Re: IO scheduler benchmarking Date: Fri, 21 Feb 2003 11:31:40 +0100 On Fri, Feb 21, 2003 at 12:16:24AM -0800, Andrew Morton wrote: > Yes, that's a test. > > time (cp 1-gig-file foo ; sync) > > 2.5.62-mm2,AS: 1:22.36 > 2.5.62-mm2,CFQ: 1:25.54 > 2.5.62-mm2,deadline: 1:11.03 > 2.4.21-pre4: 1:07.69 > > Well gee. It's pointless to benchmark CFQ in a workload like that IMHO. if you read and write to the same harddisk you want lots of unfariness to go faster. Your latency is the mixture of read and writes and the writes are run by the kernel likely so CFQ will likely generate more seeks (it also depends if you have the magic for the current->mm == NULL). You should run something on these lines to measure the difference: dd if=/dev/zero of=readme bs=1M count=2000 sync cp /dev/zero . & time cp readme /dev/null And the best CFQ benchmark really is to run tiobench read test with 1 single thread during the `cp /dev/zero .`. That will measure the worst case latency that `read` provided during the benchmark, and it should make the most difference because that is definitely the only thing one can care about if you need CFQ or SFQ. You don't care that much about throughput if you enable CFQ, so it's not even correct to even benchmark in function of real time, but only the worst case `read` latency matters. > > for a real-world example, mozilla downloads files to a temp directory and > > then copies it to the premanent location. When I download a video from my > > tivo it takes ~20 min to download a 1G video, during which time the system > > is perfectly responsive, then after the download completes when mozilla > > copies it to the real destination (on a seperate disk so it is a copy, not > > just a move) the system becomes completely unresponsive to anything > > requireing disk IO for several min. > > Well 2.4 is unreponsive period. That's due to problems in the VM - processes > which are trying to allocate memory get continually DoS'ed by `cp' in page > reclaim. this depends on the workload, you may not have that many allocations, a echo 1 >/proc/sys/vm/bdflush will fix it shall your workload be hurted by too much dirty cache. Furthmore elevator-lowlatency makes the blkdev layer much more fair under load. Andrea
From: William Lee Irwin III Subject: Re: IO scheduler benchmarking Date: Fri, 21 Feb 2003 02:51:46 -0800 On Fri, Feb 21, 2003 at 12:16:24AM -0800, Andrew Morton wrote: >> Well 2.4 is unreponsive period. That's due to problems in the VM - >> processes which are trying to allocate memory get continually DoS'ed >> by `cp' in page reclaim. On Fri, Feb 21, 2003 at 11:31:40AM +0100, Andrea Arcangeli wrote: > this depends on the workload, you may not have that many allocations, > a echo 1 >/proc/sys/vm/bdflush will fix it shall your workload be hurted > by too much dirty cache. Furthmore elevator-lowlatency makes > the blkdev layer much more fair under load. Restricting io in flight doesn't actually repair the issues raised by it, but rather avoids them by limiting functionality. The issue raised here is streaming io competing with processes working within bounded memory. It's unclear to me how 2.5.x mitigates this but the effects are far less drastic there. The "fix" you're suggesting is clamping off the entire machine's io just to contain the working set of a single process that generates unbounded amounts of dirty data and inadvertently penalizes other processes via page reclaim, where instead it should be forced to fairly wait its turn for memory. -- wli
From: Andrea Arcangeli Subject: Re: IO scheduler benchmarking Date: Fri, 21 Feb 2003 12:08:07 +0100 On Fri, Feb 21, 2003 at 02:51:46AM -0800, William Lee Irwin III wrote: > On Fri, Feb 21, 2003 at 12:16:24AM -0800, Andrew Morton wrote: > >> Well 2.4 is unreponsive period. That's due to problems in the VM - > >> processes which are trying to allocate memory get continually DoS'ed > >> by `cp' in page reclaim. > > On Fri, Feb 21, 2003 at 11:31:40AM +0100, Andrea Arcangeli wrote: > > this depends on the workload, you may not have that many allocations, > > a echo 1 >/proc/sys/vm/bdflush will fix it shall your workload be hurted > > by too much dirty cache. Furthmore elevator-lowlatency makes > > the blkdev layer much more fair under load. > > Restricting io in flight doesn't actually repair the issues raised by the amount of I/O that we allow in flight is purerly random, there is no point to allow several dozen mbytes of I/O in flight on a 64M machine, my patch fixes that and nothing more. > it, but rather avoids them by limiting functionality. If you can show a (throughput) benchmark where you see this limited functionalty I'd be very interested. Alternatively I can also claim that 2.4 and 2.5 are limiting functionalty too by limiting the I/O in flight to some hundred megabytes right? it's like a dma ring buffer size of a soundcard, if you want low latency it has to be small, it's as simple as that. It's a tradeoff between latency and performance, but the point here is that apparently you gain nothing with such an huge amount of I/O in flight. This has nothing to do with the number of requests, the requests have to be a lot, or seeks won't be reordered aggressively, but when everything merges using all the requests is pointless and it only has the effect of locking everything in ram, and this screw the write throttling too, because we do write throttling on the dirty stuff, not on the locked stuff, and this is what elevator-lowlatency address. You may argue on the amount of in flight I/O limit I choosen, but really the default in mainlines looks overkill to me for generic hardware. > The issue raised here is streaming io competing with processes working > within bounded memory. It's unclear to me how 2.5.x mitigates this but > the effects are far less drastic there. The "fix" you're suggesting is > clamping off the entire machine's io just to contain the working set of show me this claimping off please. take 2.4.21pre4aa3 and trash it compared to 2.4.21pre4 with the minimum 32M queue, I'd be very interested, if I've a problem I must fix it ASAP, but all the benchmarks are in green so far and the behaviour was very bad before these fixes, go ahead and show me red and you'll make me a big favour. Either that or you're wrong that I'm claimping off anything. Just to be clear, this whole thing has nothing to do with the elevator, or the CFQ or whatever, it only is related to the worthwhile amount of in flight I/O to keep the disk always running. > a single process that generates unbounded amounts of dirty data and > inadvertently penalizes other processes via page reclaim, where instead > it should be forced to fairly wait its turn for memory. > > -- wli Andrea
From: Nick Piggin Subject: Re: IO scheduler benchmarking Date: Fri, 21 Feb 2003 22:17:55 +1100 Andrea Arcangeli wrote: >it's like a dma ring buffer size of a soundcard, if you want low latency >it has to be small, it's as simple as that. It's a tradeoff between > Although the dma buffer is strictly FIFO, so the situation isn't quite so simple for disk IO.
From: Andrea Arcangeli Subject: Re: IO scheduler benchmarking Date: Fri, 21 Feb 2003 12:41:43 +0100 On Fri, Feb 21, 2003 at 10:17:55PM +1100, Nick Piggin wrote: > Andrea Arcangeli wrote: > > >it's like a dma ring buffer size of a soundcard, if you want low latency > >it has to be small, it's as simple as that. It's a tradeoff between > > > Although the dma buffer is strictly FIFO, so the situation isn't > quite so simple for disk IO. In genereal (w/o CFQ or the other side of it that is an extreme unfair starving elevator where you're stuck regardless the size of the queue) larger queue will mean higher latencies in presence of flood of async load like in a dma buffer. This is obvious for the elevator noop for example. I'm speaking about a stable, non starving, fast, default elevator (something like in 2.4 mainline incidentally) and for that the similarity with dma buffer definitely applies, there will be a latency effect coming from the size of the queue (even ignoring the other issues that the load of locked buffers introduces). The whole idea of CFQ is to make some workload work lowlatency indipendent on the size of the async queue. But still (even with CFQ) you have all the other problems about write throttling and worthless amount of locked ram and even wasted time on lots of full just ordered requests in the elevator (yeah I know you use elevator noop won't waste almost any time, but again this is not most people will use). I don't buy Andrew complaining about the write throttling when he still allows several dozen mbytes of ram in flight and invisible to the VM, I mean, before complaining about write throttling the excessive worthless amount of locked buffers must be fixed and so I did and it works very well from the feedback I had so far. You can take 2.4.21pre4aa3 and benchmark it as you want if you think I'm totally wrong, the elevator-lowlatency should be trivial to apply and backout (benchmarking against pre4 would be unfair). Andrea
From: Andrew Morton Subject: Re: IO scheduler benchmarking Date: Fri, 21 Feb 2003 13:25:49 -0800 Andrea Arcangeli wrote: > > I don't > buy Andrew complaining about the write throttling when he still allows > several dozen mbytes of ram in flight and invisible to the VM, The 2.5 VM accounts for these pages (/proc/meminfo:Writeback) and throttling decisions are made upon the sum of dirty+writeback pages. The 2.5 VFS limits the amount of dirty+writeback memory, not just the amount of dirty memory. Throttling in both write() and the page allocator is fully decoupled from the queue size. An 8192-slot (4 gigabyte) queue on a 32M machine has been tested. The only tasks which block in get_request_wait() are the ones which we want to block there: heavy writers. Page reclaim will never block page allocators in get_request_wait(). That causes terrible latency if the writer is still active. Page reclaim will never block a page-allocating process on I/O against a particular disk block. Allocators are instead throttled against _any_ write I/O completion. (This is broken in several ways, but it works well enough to leave it alone I think).


Related Links:

AS in mainline 2.5 now!!

February 24, 2003 - 9:37pm
Anonymous

Not that I'm Linus, but AS looks to be rocking ass. Perhaps this with the other block updates and the new threading fixes will finally set Linux performance into a class all its own, even above the sacred FreeBSD (esp since, as of now, FreeBSD 5 is slower then 4.x). What would have to be done in order to get AS into mainline?

~Christopher

Not before 2.7

February 24, 2003 - 11:57pm

I believe a change like this will have to wait for the next development series unless it's very unintrusive (which I don't know about) and Linus decides it goes in even though they're well into feature freeze.

RE: Not before 2.7

February 25, 2003 - 12:41am
Anonymous

Well, you could say that it's not a new feature, rather it's an performance related fix to an existing feature.

Hopefully before 2.6

February 25, 2003 - 1:01am

It is actually quite unobtrusive. Jens has an interchangable elevator API which means it is just about all contained in its own file. I am pretty sure it would be included in Linux 2.6 _if_ it performs well enough. The locking and API is simple, the code is small and relatively simple with few strange interactions, so there isn't a big case for the code being upset. It has been in mm for a while without problems. We'll see though.

Not yet

February 25, 2003 - 12:34am

It needs more test cycles. More instrumentation and heuristics are needed in order to keep regressions down. Being something which has to make a descision to idle the disk, you can imagine the window for regressions is quite large.

Fortunately by now it seems most of them are under control. There are a few things in the works which are looking hopeful.

yes and no :-)

February 25, 2003 - 9:08am

Yes, provided that it's easy for the end-user to switch between different schedulers. For example, it's still unclear for me how/if AS is affecting negatively the performance of near-real-time applications, such as media players.
But maybe that's just me, i'm running xine all the time and i care very much about multimedia. I agree with you, though, that AS might prove to be beneficial for, say, my servers.

Media bandwidth

February 26, 2003 - 9:14pm
Anonymous

I don't feel that media playback bandwidth is gonna kill most people here anyway, i can watch anything smoth on a k6-2 500, old hardware, but still enought to do video. Most people are not going to notice, if you use a system for multimedia, you most likly either are not doing anything else, or you have the hardware to take the problems out.

FBSD 5.0 is not yet performance optimized.

February 26, 2003 - 8:55pm
Anonymous

FreeBSD is supposed to do a lot of tuning between 5.0 release and
the moment it goes -STABLE, not before the 5.0 release.

This is normal, but the mechanisms and restructures for perforamnce are already in place. It just needs the finetuning with realworld workloads, which why it is released now.

real time threads?

February 24, 2003 - 11:47pm
Anonymous

In the article we've read they mentioned realtime threads.

Does anyone of you have got some documentation about it?

IEEE has it (Posix)

February 25, 2003 - 7:17am
Anonymous



but you have to pay for the standard


O'Reilly also has book/s about it


which one?

February 25, 2003 - 8:50am
Anonymous

which one should i choose if i dont want xmms to skip and want to copy huge files from one partition to another then, cfq or as?

thanks

CFQ

February 25, 2003 - 9:09am

(see subj.)

Where most users are

February 25, 2003 - 10:54am
Anonymous

Perhaps CFQ is good for some professional work, like with Film-Gimp and the sound editing Ardour, and not just XMMS or Mplayer.
Couldnt it be eventualy added to the kernel as a module option?
One does not necessarely exclude the other!

But if one excludes the other, and if you ask my opinion related to features of kernel more usefull, that without question as to do with the quantity of possible users, then CFQ should go in and AS > /dev/null

There are zillions of opinions, and sure the all "open souce, gnu, bsd, linux" movement is complex!.But things dont usually fall from the sky, you got to make it happen...and a desktop linux revolution is very much needed( could have been first priority).

No

February 25, 2003 - 12:23pm
Anonymous

> "and a desktop linux revolution is very much needed( could have been first priority)"

No, no, no, no. Just remember that the desktop was Microsofts first priority and look what that did to the quality of their OS.

Microsoft did it the wrong way around: they designed their OS for the client and then tried to get this design to translate to the server. This is silly. Linux is kinda doing it the other way around by basing itself on tried and tested UNIX concepts which come from the world of servers and slowly molding this into a general purpose OS which is good at everything. Coming from the server side of things first ensures that reliabilty and scalability are already there before you start worrying about the fiddly changes required for the desktop.

It is easier and will result in a higher quality product in the end if we simply accept that the time for Linux on Joe Publics desktop is not here YET. Don't beat yourself up about it, the time will come. Just don't rush it. As a great man once said, "it's ready when it's ready".

(Yes, I do use Linux on all my machines (servers, desktops and laptop) but I don't make all my friends use it, yet. :o)

dont take it from me

February 25, 2003 - 8:29pm
Anonymous

Just look at http://www.aaxnet.com/editor/edit029.html , and if you have patient to read until the end you'll find out that security problems whit microsoft windows arrose from the fact it was build with no security in mind, because it was build with no networking connectivity "what so ever" concern. Remember microsoft "internet blank".?
I think you will understand why i say fight microsoft and pass warnings like the URL above to avoid "it".

AS is *killer* in server task

February 27, 2003 - 2:56am

AS is *killer* in server tasks (ie: webserver). CFQ is about interactivity

Here comes the Desktop / Server Divide

February 25, 2003 - 12:30pm
Anonymous

I look at RedHat – Desktop version and Server version
I look at Suse - Desktop version and Server version

It appears the OS is branching into two markets and Distro vendors will probably roll CFQ into desktop version and AS into server version and thus justify why they charge thousands for the (improved) server editions.

The kernel is constantly pulled in all the directions... Unless we manage to make it modular in the VM respect, I see forking immenent.

Embedded devices and cell pho

February 25, 2003 - 1:20pm
Anonymous

Embedded devices and cell phones will need a lot of VM changes. So there might be a "Embedded Linux" fork, too.

embeded and other issues

February 25, 2003 - 1:40pm
Anonymous

We are talking about the kernel, not really OS. My humble opinion is that embeded distributions are rather "platforms" than a different iteration of kernel alone. They are used on a particular hardware and are often stripped of generic stuff.

CFQ vs. AS issue is a general design (philosophy) question. Look at the way Andrew and Andrea differ about software developement issues: "2 diff-priority threads for a MP3 player VS. async code." Andrew wants everyone to write FOR his kernel. Andrea (seems) to write the kernel FOR the users in general. Both have good sides (ideas). But their kernels have very different applications (and implications).

I am stuck looking at this disarray and not being able to choose easily what I want my customers to use. In my perfect world, there would be one distro with tweaking for desktop and server. This is not a perfect world.

Actually

February 25, 2003 - 3:26pm

Andrew was pointing out that you have to do a _lot_ more than CFQ
if you want to ensure there is no skipping. And yes even Andrew's
method could skip.

Here comes the Desktop / Server Divide

February 27, 2003 - 10:04am
Anonymous

That is what they should do!
but the distro's don't seem to understand Linux well enough.

Contest benchmark results

February 25, 2003 - 12:48pm

I've been benchmarking the -cfq and -as schedulers with contest and so far with respect to contest results the cfq gives far better results than the as scheduler. I've been unable to post results with the most recent as/cfq kernel 2.5.62-mm3 due to a hardware failure followed by an apparent memory leak, but all the results done to date show the cfq significantly better on this benchmark. I posted some results to lkml previously comparing the two. Here is a copy of just the relevant (disk based) contest results:

ctar_load:
Kernel         [runs]   Time    CPU%    Loads   LCPU%   Ratio
2.5.61-mm1          2   137     58.4    2.0     5.8     1.69
2.5.61-mm1cfq       3   104     76.0    1.0     3.8     1.32
xtar_load:
Kernel         [runs]   Time    CPU%    Loads   LCPU%   Ratio
2.5.61-mm1          2   158     48.7    2.0     4.4     1.95
2.5.61-mm1cfq       3   104     74.0    1.0     3.8     1.32
io_load:
Kernel         [runs]   Time    CPU%    Loads   LCPU%   Ratio
2.5.61-mm1          2   634     12.5    257.3   24.6    7.83
2.5.61-mm1cfq       3   397     19.6    123.3   18.1    5.03
io_other:
Kernel         [runs]   Time    CPU%    Loads   LCPU%   Ratio
2.5.61-mm1          2   187     41.7    84.7    27.3    2.31
2.5.61-mm1cfq       3   199     39.2    77.2    23.5    2.52
read_load:
Kernel         [runs]   Time    CPU%    Loads   LCPU%   Ratio
2.5.61-mm1          2   120     65.8    8.9     5.8     1.48
2.5.61-mm1cfq       3   109     72.5    7.1     5.5     1.38
list_load:
Kernel         [runs]   Time    CPU%    Loads   LCPU%   Ratio
2.5.61-mm1          2   97      79.4    0.0     6.2     1.20
2.5.61-mm1cfq       3   97      79.4    0.0     6.2     1.23
dbench_load:
Kernel         [runs]   Time    CPU%    Loads   LCPU%   Ratio
2.5.61-mm1          2   716     10.8    11.0    50.4    8.84
2.5.61-mm1cfq       3   426     18.1    5.7     50.7    5.39

mm1 is with the -as scheduler
mm1cfq is with the -cfq scheduler

Contest results

February 25, 2003 - 9:52pm

the CFQ scheduler is truely impressive, I hope that someone branchs of a desktop development kernel for 2.5 - to test desktop related patches.

Shouldn't need to fork

February 26, 2003 - 3:44am
Anonymous

AFAICT these are all modular enough to coexist, and be switched at boot time. 'Twould be even better if you could sync down and change at runtime, e.g. use AS for workaday stuff then come to a screaming halt and proceed with CFQ while you did multimedia work. Might also help TwoKernelMonte-ish situations to practice that.

I'd love to be able to change things like VM management, scheduling etc on the fly (or even on the brief touch-and-go), and even more love being able to replace the entire kernel without rebooting (ie, read in new kernel, flush queues and momentarily suspend everything (not even close files), switch kernels, resize data elements etc if needed - could be done from symbol tables if new fields in structures always default safely to zero), unsleep everything, resume).

Please pardon the programmer's parentheses.

Basically, what you're descr

February 26, 2003 - 1:28pm
Anonymous

Basically, what you're describing in the "replace the kernel" scenario is task migration, only you're migrating a task to yourself, across a reboot.

You'd probably have to select a subset of tasks to maintain across reboot, and let the rest shutdown/restart as part of normal boot procedure. There's a lot of "kernel state" that's established by userland as part of bootup, and so you need to find some way to bring that across.

If all you're worried about are your mortal-user tasks, save/restore them, and let all the root-owned stuff cycle like a normal reboot.

Atleast

February 27, 2003 - 6:09am
Anonymous

Changing the kernel runtime isn't the easiest thing in the world, but how about doing a shutdown that results in a new kernel loading, which in turn starts up init and boots the machine? Why you might ask, well, simply to avoid the whole process of booting (some machines are fast, some, well, slooooow) the hardware before the OS kicks in. I always wondered why you couldn't do that with linux (and XP for that matter). I remember it being done in Win95. Could be that they cheated and exited windows to DOS and then started it again though ;)

Actually, I seem to recall su

February 28, 2003 - 8:03am
Anonymous

Actually, I seem to recall such a feature going in, or at least being available as a patch -- the ability to load one Linux kernel from another. One of the proposed uses was for fast rebooting of dev kernels. Another was to boot a "light" kernel, probe the machine from userspace, and then fast-reboot a tuned kernel for the particular machine. The latter scenario seems reasonable/important for distribution installers or highly configurable machines.

Scheduler should be data adaptive

March 9, 2003 - 1:41pm
Anonymous

IMO the right way to do scheduler would be to make it 'adaptive'. Collect all sorts of data about the processes kernel runs and then switch/decide which scheduler to use. Alternatively one could try to generate optimum scheduler for given type of load (harder). (lots of free parameters, optimize to minimize estimated workload and compile and load new scheduler code).

Collecting data probably makes things quite slow so it should be done only seldomly.

Difficult

March 9, 2003 - 6:38pm

A.S. is moving toward being more adaptive based on process'
previous IO history... Unfortunately for the most part, IO
latency and throughput needs are very difficult to impossible
to be detected by the kernel, and, as they have mostly
mutually exclusive requirements, user intervention has to
play a part some time for maximum performance for a given
situation.

So yes, data can be collected and used to an advantage,
however it is a much bigger (if not impossible) step to
really get it to make good "policy" descisions.

Not neccasarily good

February 26, 2003 - 8:55am
Anonymous

Someone in a previous news article about CFQ and AS wrt to contest made an important comment. Contest uses gcc which has too much processing between reads to take advantage of AS. So to take one style of benchmarks and say "Wow, this one is horrinble andy this one is great" does not mean it applies in all cases

And what does most software use...gcc

February 26, 2003 - 11:06pm
Anonymous

I think that since almost all software is compiled on linux using gcc, it wouldn't be a good idea to pick a schedular that only performs good in mostly non-existant scenarios.

Re: And what does most software use...gcc

February 27, 2003 - 3:16am

I think that since almost all software is compiled on linux using gcc, it wouldn't be a
good idea to pick a schedular that only performs good in mostly non-existant scenarios.

I think you misunderstand. It's not that contest is compiled with gcc, it's that contest benchmarks gcc. That is, it runs gcc as one way to exercise the CPU and disk. The objection here is that gcc uses too much CPU to benefit much from certain disk scheduling styles, which are designed to perform optimally when you are mostly doing I/O.

?

February 27, 2003 - 5:40am
Anonymous

He said the program itself uses a lot of processing between reads, not GCC in general.

Are you saying that Linux box

February 27, 2003 - 6:43am
Anonymous

Are you saying that Linux boxes are only used for running GCC? In a desktop scenario, I think that's hardly the case. It's kinda like saying you only drive your car around steel foundries and auto manufacturer's plants, because that's where the metal came from and where it was built. Don't confuse the compiler used for building the application with the application itself.

Or are you just a stupid troll?

Damn Slashdot. :-P

Are you a troll, or just ignorant?

February 27, 2003 - 10:57am
Anonymous

Just because most software is compiled with GCC doesn't mean most software uses GCC. There's a very important difference.

A typical automobile plant requires high current, high voltage electrical feeds, access to railways and freeways for shipping, large buildings for housing machinery, large parking lots for the factory workers, and tons of other infrastructure. All that to build a car.

Do you need any of that to operate your car? Hardly. But your car wouldn't exist if it weren't built.

Don't confuse the tool used to built an app with the app itself.

The argument levied against ConTest in this setting is that it is not wonderfully representative of interactive workloads, since its "think time" between dependent reads is somewhat longer than more typical workloads. And the reason for this is that the workload is GCC itself, and not the final end-user applications.

Or are you trying to imply that Linux boxes are only used for compilation? (That's hardly the case on the desktop!)

Are you a troll, or are you just ignorant?

February 27, 2003 - 2:45pm

Just because most software is compiled with GCC doesn't mean most software uses GCC. There's a very important difference.

A typical automobile plant requires high current, high voltage electrical feeds, access to railways and freeways for shipping, large buildings for housing machinery, large parking lots for the factory workers, and tons of other infrastructure. All that to build a car.

Do you need any of that to operate your car? Hardly. But your car wouldn't exist if it weren't built.

Don't confuse the tool used to built an app with the app itself.

The argument levied against ConTest is that it is not wonderfully representative of interactive workloads, since its "think time" between dependent reads is somewhat longer than more typical workloads. And the reason for this is that the workload is GCC itself, and not the final end-user applications.

Or are you trying to imply that Linux boxes are only used for compilation? (That's hardly the case on the desktop!)

btw, "justanyone" didn't p

February 27, 2003 - 3:15pm
Anonymous

btw, "justanyone" didn't post that post.

For desktop use?

February 26, 2003 - 11:19am
Anonymous

It seems that CFQ is better for interactive, low-latency tasks. But does AS improve on those things as well when compared to standard scheduler? How do the implementations stand when it comes to desktop-use?

1. CFQ
2. AS
3. Standard scheduler

Or is it:

1. CFQ
2. Standard Scheduler
3. AS

If CFQ is for low-latency and interactivity, where does AS shine (server-tasks?)?

from what I can tell...

February 26, 2003 - 1:38pm
Anonymous

CFQ aims to provide fairness among competing tasks in that when more than one task requests disk access, the I/O scheduler round-robins among all requestors rather than letting one requestor dominate. It can result in seek storm, however, when the various tasks want to access different areas of the disk. Thus, it can be bad for total throughput. (This is unlike the network stack, where the cost of switching between two unrelated streams is zero or nearly zero.)

AS seeks to improve throughput by reducing the seek storms that occur when you switch among requestors. It observes that most small reads are so-called dependent reads, and so even though a given requestor might have only one request queued at the moment, it may have several others for areas nearby on the disk. Thus, when it picks one of these reads to schedule, it decides whether to hold off scheduling anything else for that same disk for "a moment" to see if more read requests arrive. The net effect is to reduce the total number of seeks that result between competing streams by allowing greater batches of nearby reads to execute together. Overall throughput should go up as a result.

CFQ should be better for low-latency tasks *if* the resultant seek storms don't kill throughput. If you have two streams duking it out and no hysteresis to keep you from hopping queues continuously, then the resulting seeks could cause CFQ to leave you crying. In the dependent read case, since the requestor's queue will go empty (and cause a queue switch) after each request, it seems as though these seek storms are practically a given without aggressive readahead.

AS should help avoid total throughput decimation by adding that needed hysteresis. To the extent that higher throughput reduces latency, AS may be the better choice for some applications.

One thing I'm wondering is if AS-like heuristics could be applied to CFQ to control when CFQ decides to move between queues. Seems like you'd get the best of both worlds.

CFQ? Why?

February 26, 2003 - 7:34pm
Anonymous

It seems that if CFQ is only for multimedia applications than they seem to be going about it the wrong way.

Having a real-time thread that reads from an 8 MB buffer while another thread does the disk reads is indeed the correct solution. What is missing is something that can gurantee disk IO speeds to the application to ensure that that 8 MB buffer is always full.

SGI has had GRIO (Guranteed-rate I/O) implemented with XFS for quite some time on their IRIX machines and is for this purpose. In a scenario where you need to read and write say, 200 MB/s to an array of disks that is coming from an unbuffered source, it is very useful (audio channels from live instruments, etc.).

Of course that's just my opinion, I could be wrong.

GRIO

February 27, 2003 - 5:46am
Anonymous

Well, from the general discussion (borne true in the benchmarks), the main purpose of CFQ is to serve as a kind of ad-hoc substitute for Guaranteed-Rate I/O. What prevents the Anticipatory Scheduler from *anticipating* this behavior, provided a program can specifically instruct it to do so?

The AS absolutely shines in reducing the seek-frenzy associated with multiple threaded reads on different portions of the disk -- this is something that CFQ needs, not only to dramatically improve its performance, but to (hopefully) extend the life of the hard disks.

You could kill the argument right here by introducing an API function to tag a thread with a "read floor"(bytes) per "buffer length"(ms); The AS would then include this information in its consideration of whether to continue anticipating a series of reads/writes or break off back to the GRIO-tagged thread.

The bottom line is: The applications which would benefit from CFQ are so few and so specialized that it is more than reasonable to expect them to add a few lines of code to 'tweak' the scheduler -- the benefits to the application would far outweigh the cost of 10-15 more lines of code. Meanwhile, all the other applications still realize the benefits of the AS engine, even while GRIO-tagged threads are in the scheduling queue.

I will later

March 3, 2003 - 9:05pm

For 2.6 I am aiming to get the scheduler working OK. No explicit fairness, priority, or rates, although even priority fairness is mostly achieved with the 1 way elevator algorithm.

After that is all working well, I will look into doing explicit fairness, process IO priority and even GRIO in the AS scheduler.

buffers are not the solution for multimedia

February 27, 2003 - 2:16pm

While simple media streamers & players (xmms, mplayer and the like) would be pretty fine with a "fat" buffer since the single aim is to avoid annoying skips, preserving quality in "serious" real-time apps (like, say, ardour) that will also have to work in an "assembly" of different tools (a bunch of soft synths, wave streamers and a handfull of effects) is a far more serious task.
Perhaps you are familiar with the almost maniacal chase for low latency in the music scene. This chase is not limited to the FIFO I/Os of the soundcard (audio and midis). If you enlarge your buffer you are safe from "skips and clicks" but I don't think you'll be able to do much work if you have to wait a couple of seconds for that knob turn in the reverb to take effect on the output sound.
So, yes, the GRIO idea is a good approach to the problem, but CFQ might just be right in this niche. I also liked that idea of "interchangeable" scheds but I think it will be pretty tough to get it going with relative safety. At least we could get both of them to select from in compile time and have 2 different kernels on your bootloader.
Latency issues are a major showstopper for the computer-assisted music scene in linux - the other being an immature API regarding the effects while chaos reigns over the soft synths (damn Steinberg for not allowing a linux port of the VST!)

Different buffering strategies

March 3, 2003 - 5:54am
Anonymous

If you enlarge your buffer you are safe from "skips and clicks" but I don't think you'll be able to do much work if you have to wait a couple of seconds for that knob turn in the reverb to take effect on the output sound.

This isn't relevant though, if you're designing a real-time multimedia app that plays off disk and you want decent performance, you don't do things the way XMMS does.

XMMS has its processing stage (where decoding is done, and effects are applied) work immediately after data is read from disk, and this processing stage feeds a large output buffer whose tail is fed to the audio hardware.

This is quite a dumb dataflow, it means that latency is inevitably high because of the large amount of buffering between the processing stage and the output, and it wastes lots of memory because it buffers decoded data rather than its much smaller undecoded form. Unfortunately, it's the only way XMMS can get a reasonable level of skip protection without having real-time process scheduling for the processing thread. Without real-time scheduling it can't guarantee it'll always get enough CPU time to decode or apply effects, but if that happens it might still get enough time to despatch the tail of the output buffer to the audio hardware (which is orders of magnitude less work).

It's also why XMMS' CPU usage appears to be quite low, because it only decodes in bursts every few seconds and is mostly dormant the rest of the time - the decoder thread is not constantly running.

If you can depend on having real-time scheduling, then it's much better to have your buffering the other way round, you use a large input buffer and leave your processing stage until immediately before you send the decoded audio to the audio hardware. This reduces latency to the absolute minimum because there's a minimum amount of buffering between the processing stage and the audio hardware (just the processing buffer in which you actually do the decoding/effects), and in the case of an MP3/Ogg player, makes your buffering more efficient because you store coded rather than decoded data which means that you can buffer more and are less affected by disk latency. Realistically, it requires real-time scheduling in order to work though, or your decoder thread won't keep the sound hardware satisfied, and it increases the apparent CPU usage in e.g. top, because the processing stage is pretty much constantly running.

A good example of something that uses this buffering strategy is aRts, KDE's sound server/media framework (which e.g. Noatun uses to play music). aRts skips like crazy if you set its processing buffer to small values and it doesn't have RT scheduling, but give it RT scheduling and it's far tougher making it skip than XMMS, even with sub-10ms latencies, and it seems much less bothered by disk seek storms than XMMS.

I can see Andrew Morton's point. It's simply not worth trying to make XMMS skip-proof as a benchmark, just whack up the output buffer size and be done with it. XMMS isn't designed to be both skip-proof and low-latency. Play, stop, pause, seek and the volume control are all independent of the output buffer size anyway, so latency doesn't affect the important controls in XMMS.

If you want real-time response in multimedia apps and want those apps to be skip-proof, use real-time scheduling and a buffering strategy that's designed for that.

FS Design

February 27, 2003 - 3:01am

About filesystem design.

Slightly off topic, but..
Shouldn't filesystem design also be a factor in this discussion, i.e different
filesystem designs will behave differently with a different scheduler ?

- Ext2
- Ext3
- Reiserfs

Should the benchmarking be done on the Ext3 and Reiserfs also.
In my little experiments it
seems as performancewise Ext2

Comparison with more than Linux?

February 27, 2003 - 12:57pm

It would be good to see these benchmarks run on FreeBSD (4 & 5)
as well as the various Linux versions.
Are the actual benchmarks available?
What was the exact hardware?
What was the result of the AS on FreeBSD 4.3 where it was developed?
http://www.cs.rice.edu/~ssiyer/r/antsched/

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
speck-geostationary