Description of problem: High I/O wait (100%) when writing lots of data with 3ware 9650, awfully slow write performance. A friend also gets the problem with aacraid. Version-Release number of selected component (if applicable): 2.6.18-53.1.14 How reproducible: Write something like 1GB data. Steps to Reproduce: 1. 2. 3. Actual results: 4 hours to import a 1GB sql file in postgresql. Expected results: 6 minutes with the fix. 10 to 15% of I/O wait instead of 100% previously seen. Additional info: See http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=1e6c38cec08f88b0df88a34e80f15492cace74e9 I've made a quick and dirty fix for local use and without having to backport the pci_try_set_mwi function. I simply used pci_set_mwi instead of pci_try_set_mwi. Will such a fix be included in a next kernel release ? Regards.
Tomas, See if you can reproduce this with the 3ware hardware you have. Either way, please review for applicability on RHEL 5, and keep an eye on the upstream results. If it looks good, we can propose it for 5.3. Tom
AFAIK, it depends on the bios. Some enable MWI by default, some don't. Linux devs choose to override it, just in case. The box is running fine since the fix (no data corruption occured, normal performance, sluggish system and I/O vanished). I think this fix should be included as soon as possible, no need to wait for 5.3. Regards, Laurent.
Created attachment 308525 [details] modified version Laurent, thanks for the patch, I had to slightly modify it. I don't have access to the hardware, please could you verify that the test kernel on http://people.redhat.com/thenzl/kernel/ resolves your issue ?
Tomas, just rebooted the server with the x86_64 version of your kernel. I/O wait went up to 45% with another test (and previously untested - deleting loads of data from postgresql). Data are being processed again so we can recreate the previous test that made me post this bug, I'll keep you informed very soon (say about one hour). Could you show me the diff you applied ?
duh, didn't see the attached patch, sorry. OK, we've run again the 1GB import, and system is behaving normally, the same way as with my patch. Thank you very much. When will that fix be included by default ? Please tell me it'll be included in the next kernel update, and not with 5.3 release :)
Laurent, the attachment is inaccessible for you? But it doesn't matter - here is it: diff -Naurp linux-2.6.18.x86_64/drivers/scsi/3w-9xxx.c linux-2.6.18.x86_64a/drivers/scsi/3w-9xxx.c --- linux-2.6.18.x86_64/drivers/scsi/3w-9xxx.c 2008-06-05 14:53:15.000000000 +0200 +++ linux-2.6.18.x86_64a/drivers/scsi/3w-9xxx.c 2008-06-05 14:39:11.000000000 +0200 @@ -89,6 +89,7 @@ #include <scsi/scsi_host.h> #include <scsi/scsi_tcq.h> #include <scsi/scsi_cmnd.h> +#include <linux/libata.h> #include "3w-9xxx.h" /* Globals */ @@ -2062,6 +2063,7 @@ static int __devinit twa_probe(struct pc } pci_set_master(pdev); + pci_try_set_mwi(pdev); retval = pci_set_dma_mask(pdev, sizeof(dma_addr_t) > 4 ? DMA_64BIT_MASK : DMA_32BIT_MASK); if (retval) {
Laurent, is your test system still working well with this test kernel ? >Please tell me it'll be included in >the next kernel update, and not with 5.3 release :) I could tell you what you want :) , but I think that it will go in the 5.3. release.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Yes, the system still works without problems, and normal performance. Please include that fix in the bext kernel update. It'd be shame to have to wait until 5.3. Thanks.
The only kernel updates between 5.2 and 5.3 are for critical security fixes, system crashers, or data corruption. This does not fit the bill. The other option is for you to contact Red Hat support, and request a hot fix for this.
Enormous performance problems should be added too, IMHO :) Not only the performance increase is 40 times (!) when it comes to write I/O, and the system remains responsive under load, and that's absolutely not the case without it. I'm a bit unsure about the real cause, but we met crashes of the system with high I/O writes (good ol' oops, but I haven't kept logs, unfortunately). That never happened with the fix. I think it may be due to the ever growing list of I/O write demands that take longer and longer to de done. I can't try that case anymore, as the corresponding server is going into prod real soon now. I guess I'll have to add by hand that fix until 5.3 then, as I'm not a RH customer, but a CentOS user. Unless I can convince CentOS people to add it before you do in 5.3. Regards.
Laurent, once again thanks for your report + testing. I'm sorry, but I think that requesting a hot fix via our support is the only way how to get it sooner then in rel. 5.3.
You're welcome. I'll try to maintain a fixed kernel repo until 5.3 is out then. Time to take a look at how to create a repo.
OK ... I have created a test kernel that has this patch for CentOS. We will be maintaining it as security updates are done until this issue is fixed in the 5.3 kernel. http://people.centos.org/hughesjr/kernel/5/bz444759/
Just a note on this 3ware issue: It seems to be more that a simple performance problem! *EACH* of the attempts to use a 3ware based RAID 5 has caused an OS crash while trying to rsync or scp ~800 GB from another box. The failure will occur between 2.5 hours to about 5 hours. There is an error of "lost link" for each of the drives it appears. This has also happened with the above bz444759 centos kernel in with 5.2. (same issue with 5.1). To compare, the 3ware exported drives and an md created array not only performed much better, but did not crash the box with the same scp or rsync.
(In reply to comment #17) > Just a note on this 3ware issue: It seems to be more that a simple performance > problem! *EACH* of the attempts to use a 3ware based RAID 5 has caused an OS > crash while trying to rsync or scp ~800 GB from another box. The failure will > occur between 2.5 hours to about 5 hours. There is an error of "lost link" for > each of the drives it appears. This has also happened with the above bz444759 > centos kernel in with 5.2. (same issue with 5.1). This sounds like a different bug. The crash you describe happens with or without the patch posted in this BZ, right? And the patch posted in this BZ improves performance, at least on some motherboards. right? If so, then we will go ahead with this patch and you should open a new BZ for the crash you have seen. Please include a stack trace, or even better, a crash dump. > To compare, the 3ware exported drives and an md created array not only > performed > much better, but did not crash the box with the same scp or rsync. Just to confirm: you can avoid the crash by setting up the 3ware to export physical disks, and not export a RAID 5 volume. Then you use md to do RAID 5 on the 3ware drives. Is there any kernel/driver version that successfully allows you to run 3ware RAID 5 without crashing?
Sorry for the delay (summer vacation)... (In reply to comment #18) ... > > This sounds like a different bug. Yes, after I added my comment, I kinda thought the same thing. > The crash you describe happens with or without the patch posted in this BZ, > right? Yes, the crash happens with all of the kernels CentOS derived kernels. Those kernels include stock, centosplus and the BZ kernel noted here. I also tried a kernel.org 2.6.25.7 kernel (with kernel config from CentOS kernel) and includes the newer 3ware 2.26.02.010 driver. Sadly, all of them crash after a couple hours or more of continuous transfer with an error like: 3w-9xxx: scsio: warning: (cox06: 0x000c): Character ioctl (0x108) timed out reseting card. 3w-9xxx: scsio: ERROR (ox06: 0x0030) Response queue (large) empty failed during reset sequence. > And the patch posted in this BZ improves performance, at least on some > motherboards. right? There is a slight improvement to performance with the bz kernel with: sync ; iozone -s 20480m -r 64 -i 0 -i 1 -t 1 -b some_file.xls sync ;hdparm -Tt /dev/sda1 sync ; time `dd if=/dev/md0 of=/dev/null bs=1M count=20480` compared to the stock centOS kernels. It's interesting that using the "deadline" scheduler actually seems makes a positive difference, also. > If so, then we will go ahead with this patch and you should > open a new BZ for the crash you have seen. Please include a stack trace, or even > better, a crash dump. I did find a CentOS bug that seems to match my issue at: http://bugs.centos.org/view.php?id=2186 One of the solutions noted did not help ("noapic" for the kernel). Sadly, that bug was closed with a "it's a hardware problem"... I'm not sure I agree ;) > > > To compare, the 3ware exported drives and an md created array not only > > performed > > much better, but did not crash the box with the same scp or rsync. > > Just to confirm: you can avoid the crash by setting up the 3ware to export > physical disks, and not export a RAID 5 volume. Then you use md to do RAID 5 on > the 3ware drives. Correct, with all the same hardware, and only changing from 3ware RAID 5 to md raid 5 = no crash with LONG transfers. Every attempt at LONG transfers with 3ware RAID 5 = crash in all x86 kernels tested. > > Is there any kernel/driver version that successfully allows you to run 3ware > RAID 5 without crashing? No. There was no kernel/driver version (including the 2.6.25.7 with 2.26.08.010 driver and latest card firmware) that did not crash with 3ware RAID 5.
(In reply to comment #19) I think that this definitely is a different bug and the patch from this bug which should increase the performance on some motherboards doesn't make it worse or better. If your problem still persist please open for it another bugzilla where we will then continue the discussion.
in kernel-2.6.18-99.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Has this been resolved? We recently re-provisioned a server with latest kernel updates and this problem seems to have gone away.
the fix is included in rhel 5.3 kernel. I've checked the kernel patches list in 5.3 beta.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html