This is part III of my most recent project named “Building a custom and cheap storage server yourself”. Part I can be found here and Part II here.
Part II left with a current status of a crashing machine during high I/O loads.During the previous weeks i tried to solve the stability problems. And finally: we solved them!
This third part is about our problems, what caused them and how we finally solved them….
Problems, Problems
Part I – using Adaptec Controller
Initially we equipped our storage with one Adaptec 52445 or one Adaptec 51645. These adapters were chosen because they offered a large amount of SATA ports (16 and 24) so we could attach all remaining disks directly without any expander.
During our tests we noticed a compete halt of all I/O under high I/O load. Not even a “ls” returns any result. The problem was discussed at the opensolaris mailing list but without any insight. Even the message log did not showed any error message.
I tried several settings on controller side, including creating a RAID array, turning off NCQ, setting timeout values higher and so on. But nothing worked. Under load sooner or later the I/O to all drives attached to the adaptec controller came to a complete halt.
After noticing that high I/O to disks directly attached to the motherboard do not “lock up” in any way we were pretty sure out problems were caused by the adaptec controller…. so we replaced them.
Part II – using LSI 1086E-based controllers
In order to solve our problems which were most probably related to the adaptec controllers we replaced them with two LSI 1086E-based controllers plus one expander attached to each controller. We used the Chenbro CK12804 expander.
Unfortunately we observed the same behavior with these controllers as well: the I/O stalled sooner or later.
So i took a look at the hard disks next. For the prototype we ordered fourty Seagate ST31000340NS disks with 1 TB each. Note that these disks are – according to the seagate homepage – designed for:
"The Barracuda ES.2 drive is the perfect solution for high-capacity enterprise storage applications such as the migration of mission-critical transactional data, from tier 1 to tier 2 (nearline) storage, where dollars/GB and GB/watt are a primary concern."
Another criteria for chosing this drive was:
"- 24x7 operation and 1.2 M hrs. MTBF"
After making sure we used the most recent firmware (actually “SN06” is the most recent firmware) i took a look at the Seagate forums and found an interesting thread about Seagate ES.2 drives (the drive we used) and some firmware problems. Especially one post took my attention:
I've got a setup with two 1TB disks which previously used the AN05 firmware and they worked just fine, no matter how many GB's of data I copied to/from the RAID1 mirror. Right after updating the firmware to suggested SN06 version it seems that I must disable NCQ completely or the system will hang in 30 mins if there are any data being copied from the disks. Because there has not been any guarantees that AN05-version of the firmware doesn't suffer from the disk locking bug, I did not feel comfortable in continuing using that version even though it seemed to work just fine.
Well, i thought to myself: “You´ve already tried disabling NCQ at /etc/system level and controller level with the adaptec controller but anyway – try it!”. So i turned off NCQ at LSI controllers with the help of LSIutil.
After booting the system again i put some I/O load on it and waited for the problem to appear again…. after waiting pessimistically for days it did not appear. So i once again booted the system and increased the I/O pressure and waited again. Soon after some errors were shown:
Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.notice] scsi_state = 0, transfer count = 1400, scsi_status = 0 Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1014,396@0/sd@a,0 (sd52): Jan 29 12:00:48 openstorage incomplete read- retrying Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1014,396@0 (mpt1): Jan 29 12:00:48 openstorage unknown ioc_status = 4 Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.notice] scsi_state = 0, transfer count = 12800, scsi_status = 0 Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1014,396@0/sd@8,0 (sd50): Jan 29 12:00:48 openstorage incomplete read- retrying
Despite of these errors a complete lockup of all I/O was not observed anymore. So disabling NCQ did the trick as the following days (and weeks) proved.
Obviously Seagates SATA “server” disks have some kind of problem with NCQ. Using an older firmware was not an option because older firmware versions might cause the disks to fail completely. A newer firmware than SN06 was also not available. I wont comment this any further but most likely i will think twice before buying any Seagate disks again. I even contacted Seagate support but i have not heared anything from them.
Interestingly disabling NCQ at the Adaptec controller and even on /etc/system level did not work. I have not investigated this further.
Part III – getting rid of “incomplete read- retrying” messages
After replacing the adaptec controller with some LSI controller and disabling NCQ for all disks the system stabilized. Even extremely high I/O would not lock up anymore.
But there were still error messages like this:
Jan 29 12:00:48 openstorage incomplete read- retrying Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,340c@5/pci1014,396@0 (mpt1): Jan 29 12:00:48 openstorage unknown ioc_status = 4
When debugging these errors i noticed every time an “incomplete read- retrying” message was issued the error counters on the physical link level was increased:
Adapter Phy 6: Link Up Invalid DWord Count 1 Running Disparity Error Count 1 Loss of DWord Synch Count 0 Phy Reset Problem Count 0
<INCOMPLETE READ ERROR MESSAGE>
Adapter Phy 6: Link Up Invalid DWord Count 2 Running Disparity Error Count 2 Loss of DWord Synch Count 0 Phy Reset Problem Count 0
So it seemed related either to the physical connection – or – still the hard disks. Before digging further on hard disk level we replaced the cables with new ones. The error persisted.
In the thread quoted above the user not only reported problems with NCQ but changed the SATA interface speed from 3 gbit/s to 1.5 gbit/s as well. According to his findings changing the speed is not a viable workaround. But i tried it anyway. After chaning the speed of the drives interface to 1.5 Gbit/s the error disappeared as well.
The controller configuration looks like this:
Main menu, select an option: [1-99 or e/p/w or 0 to quit] 16 SAS1068E's links are 1.5 G, 1.5 G, 1.5 G, 1.5 G, 3.0 G, 3.0 G, 3.0 G, down B___T SASAddress PhyNum Handle Parent Type 500605b0017ff110 0001 SAS Initiator 500605b0017ff111 0002 SAS Initiator 500605b0017ff112 0003 SAS Initiator 500605b0017ff113 0004 SAS Initiator 500605b0017ff114 0005 SAS Initiator 500605b0017ff115 0006 SAS Initiator 500605b0017ff116 0007 SAS Initiator 500605b0017ff117 0008 SAS Initiator 0 5 09221b095e5c7c67 0 0009 0001 SATA Target 0 6 09221b095d6f556b 1 000a 0002 SATA Target 0 7 09221b087d585275 2 000b 0003 SATA Target 0 8 09221b087a5a7a6c 3 000c 0004 SATA Target 5001c450000c4700 4 000d 0005 Edge Expander 0 9 5001c450000c470c 12 000e 000d SATA Target 0 10 5001c450000c470d 13 000f 000d SATA Target 0 11 5001c450000c470e 14 0010 000d SATA Target 0 12 5001c450000c470f 15 0011 000d SATA Target 0 13 5001c450000c4710 16 0012 000d SATA Target 0 14 5001c450000c4711 17 0013 000d SATA Target 0 15 5001c450000c4712 18 0014 000d SATA Target 0 16 5001c450000c4713 19 0015 000d SATA Target 0 17 5001c450000c4714 20 0016 000d SATA Target 0 18 5001c450000c4715 21 0017 000d SATA Target 0 19 5001c450000c4716 22 0018 000d SATA Target 0 20 5001c450000c4717 23 0019 000d SATA Target 0 21 5001c450000c4718 24 001a 000d SATA Target 0 22 5001c450000c4719 25 001b 000d SATA Target 0 23 5001c450000c471a 26 001c 000d SATA Target 0 24 5001c450000c471b 27 001d 000d SATA Target 0 25 5001c450000c473d 28 001e 000d SAS Initiator and Target
Type NumPhys PhyNum Handle PhyNum Handle Port Speed Adapter 8 0 0001 --> 0 0009 0 1.5 1 0002 --> 0 000a 1 1.5 2 0003 --> 0 000b 2 1.5 3 0004 --> 0 000c 3 1.5 4 0005 --> 0 000d 4 3.0 5 0005 --> 1 000d 4 3.0 6 0005 --> 2 000d 4 3.0 Expander 30 0 000d --> 4 0005 4 3.0 1 000d --> 5 0005 4 3.0 2 000d --> 6 0005 4 3.0 12 000d --> 0 000e 4 1.5 13 000d --> 0 000f 4 1.5 14 000d --> 0 0010 4 1.5 15 000d --> 0 0011 4 1.5 16 000d --> 0 0012 4 1.5 17 000d --> 0 0013 4 1.5 18 000d --> 0 0014 4 1.5 19 000d --> 0 0015 4 1.5 20 000d --> 0 0016 4 1.5 21 000d --> 0 0017 4 1.5 22 000d --> 0 0018 4 1.5 23 000d --> 0 0019 4 1.5 24 000d --> 0 001a 4 1.5 25 000d --> 0 001b 4 1.5 26 000d --> 0 001c 4 1.5 27 000d --> 0 001d 4 1.5 28 000d --> 0 001e 4 3.0
As you can see the connection Controller –> Expander has a bandwidth of 3.0 Gbit/s whereas Expander –> Drive is limited to 1.5 Gbit/s.
Conclusion
After disabling NCQ on controller lever and limiting the bandwidth between expander and hard disk to 1.5 Gbit/s for each disk all problems are gone.
All? Yes, indeed. For three weeks there are NO error messages anymore and the system itself is rock-stable. During that period one hard disk physically failed and had to be replaced (remember, i wont comment on the quality of these “server” hard disks….).
Tomorrow or the day after tomorrow i wil post Part IV covering first performance benchmarks.
Pingback: Ronny Egners Blog » Creating database clones with ZFS really FAST
Pingback: Ronny Egners Blog » Building a custom and cheap storage server yourself – Part II
Still awaiting part IV.
Patiently. =)
How did you set the expander ports to 1.5 Gbit?
I’ve looked through the LSIutil program and it can only set link speed on the controller ports not the expander ports.
You need to enable advanced settings. Then “diagnostics” and “configure ports”.