Building a custom and cheap storage server yourself – Part III

This is part III of my most recent project named “Building a custom and cheap storage server yourself”. Part I can be found here and Part II here.

Part II left with a current status of a crashing machine during high I/O loads.During the previous weeks i tried to solve the stability problems. And finally: we solved them!

This third part is about our problems, what caused them and how we finally solved them….

Problems, Problems

Part I – using Adaptec Controller

Initially we equipped our storage with one Adaptec 52445 or one Adaptec 51645. These adapters were chosen because they offered a large amount of SATA ports (16 and 24) so we could attach all remaining disks directly without any expander.

During our tests we noticed a compete halt of all I/O under high I/O load. Not even a “ls” returns any result. The problem was discussed at the opensolaris mailing list but without any insight. Even the message log did not showed any error message.

I tried several settings on controller side, including creating a RAID array, turning off NCQ, setting timeout values higher and so on. But nothing worked. Under load sooner or later the I/O to all drives attached to the adaptec controller came to a complete halt.

After noticing that high I/O to disks directly attached to the motherboard do not “lock up” in any way we were pretty sure out problems were caused by the adaptec controller…. so we replaced them.

Part II – using LSI 1086E-based controllers

In order to solve our problems which were most probably related to the adaptec controllers we replaced them with two LSI 1086E-based controllers plus one expander attached to each controller. We used the Chenbro CK12804 expander.

Unfortunately we observed the same behavior with these controllers as well: the I/O stalled sooner or later.

So i took a look at the hard disks next. For the prototype we ordered fourty Seagate ST31000340NS disks with 1 TB each. Note that these disks are – according to the seagate homepage – designed for:

"The Barracuda ES.2 drive is the perfect solution for high-capacity enterprise storage
applications such as the migration of mission-critical transactional data, from tier 1
to tier 2 (nearline) storage, where dollars/GB and GB/watt are a primary concern."

Another criteria for chosing this drive was:

"- 24x7 operation and 1.2 M hrs. MTBF"

After making sure we used the most recent firmware (actually “SN06” is the most recent firmware) i took a look at the Seagate forums and found an interesting thread about Seagate ES.2 drives (the drive we used) and some firmware problems. Especially one post took my attention:

I've got a setup with two 1TB disks which previously used the AN05 firmware and they
worked just fine, no matter how many GB's of data I copied to/from the RAID1 mirror.
Right after updating the firmware to suggested SN06 version it seems that I must
disable NCQ completely or the system will hang in 30 mins if there are any data being
copied from the disks. Because there has not been any guarantees that AN05-version of
the firmware doesn't suffer from the disk locking bug, I did not feel comfortable in
continuing using that version even though it seemed to work just fine.

Well, i thought to myself: “You´ve already tried disabling NCQ at /etc/system level and controller level with the adaptec controller but anyway – try it!”. So i turned off NCQ at LSI controllers with the help of LSIutil.

After booting the system again i put some I/O load on it and waited for the problem to appear again…. after waiting pessimistically for days it did not appear. So i once again booted the system and increased the I/O pressure and waited again. Soon after some errors were shown:

Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.notice]
scsi_state = 0, transfer count = 1400, scsi_status = 0
Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.warning] WARNING:
/pci@0,0/pci8086,340c@5/pci1014,396@0/sd@a,0 (sd52):
Jan 29 12:00:48 openstorage     incomplete read- retrying
Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.warning] WARNING:
/pci@0,0/pci8086,340c@5/pci1014,396@0 (mpt1):
Jan 29 12:00:48 openstorage     unknown ioc_status = 4
Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.notice]
scsi_state = 0, transfer count = 12800, scsi_status = 0
Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.warning] WARNING:
/pci@0,0/pci8086,340c@5/pci1014,396@0/sd@8,0 (sd50):
Jan 29 12:00:48 openstorage     incomplete read- retrying

Despite of these errors a complete lockup of all I/O was not observed anymore. So disabling NCQ did the trick as the following days (and weeks) proved.

Obviously Seagates SATA “server” disks have some kind of problem with NCQ. Using an older firmware was not an option because older firmware versions might cause the disks to fail completely. A newer firmware than SN06 was also not available. I wont comment this any further but most likely i will think twice before buying any Seagate disks again. I even contacted Seagate support but i have not heared anything from them.

Interestingly disabling NCQ at the Adaptec controller and even on /etc/system level did not work. I have not investigated this further.

Part III – getting rid of “incomplete read- retrying” messages

After replacing the adaptec controller with some LSI controller and disabling NCQ for all disks the system stabilized. Even extremely high I/O would not lock up anymore.

But there were still error messages like this:

Jan 29 12:00:48 openstorage     incomplete read- retrying
Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.warning] WARNING:
/pci@0,0/pci8086,340c@5/pci1014,396@0 (mpt1):
Jan 29 12:00:48 openstorage     unknown ioc_status = 4

When debugging these errors i noticed every time an “incomplete read- retrying” message was issued the error counters on the physical link level was increased:

Adapter Phy 6:  Link Up
 Invalid DWord Count                                      1
 Running Disparity Error Count                            1
 Loss of DWord Synch Count                                0
 Phy Reset Problem Count                                  0
<INCOMPLETE READ ERROR MESSAGE>
Adapter Phy 6:  Link Up
 Invalid DWord Count                                      2
 Running Disparity Error Count                            2
 Loss of DWord Synch Count                                0
 Phy Reset Problem Count                                  0

So it seemed related either to the physical connection – or – still the hard disks. Before digging further on hard disk level we replaced the cables with new ones. The error persisted.

In the thread quoted above the user not only reported problems with NCQ but changed the SATA interface speed from 3 gbit/s to 1.5 gbit/s as well. According to his findings changing the speed is not a viable workaround. But i tried it anyway. After chaning the speed of the drives interface to 1.5 Gbit/s the error disappeared as well.

The controller configuration looks like this:

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 16

SAS1068E's links are 1.5 G, 1.5 G, 1.5 G, 1.5 G, 3.0 G, 3.0 G, 3.0 G, down
 B___T     SASAddress     PhyNum  Handle  Parent  Type
        500605b0017ff110           0001           SAS Initiator
        500605b0017ff111           0002           SAS Initiator
        500605b0017ff112           0003           SAS Initiator
        500605b0017ff113           0004           SAS Initiator
        500605b0017ff114           0005           SAS Initiator
        500605b0017ff115           0006           SAS Initiator
        500605b0017ff116           0007           SAS Initiator
        500605b0017ff117           0008           SAS Initiator
 0   5  09221b095e5c7c67     0     0009    0001   SATA Target
 0   6  09221b095d6f556b     1     000a    0002   SATA Target
 0   7  09221b087d585275     2     000b    0003   SATA Target
 0   8  09221b087a5a7a6c     3     000c    0004   SATA Target
        5001c450000c4700     4     000d    0005   Edge Expander
 0   9  5001c450000c470c    12     000e    000d   SATA Target
 0  10  5001c450000c470d    13     000f    000d   SATA Target
 0  11  5001c450000c470e    14     0010    000d   SATA Target
 0  12  5001c450000c470f    15     0011    000d   SATA Target
 0  13  5001c450000c4710    16     0012    000d   SATA Target
 0  14  5001c450000c4711    17     0013    000d   SATA Target
 0  15  5001c450000c4712    18     0014    000d   SATA Target
 0  16  5001c450000c4713    19     0015    000d   SATA Target
 0  17  5001c450000c4714    20     0016    000d   SATA Target
 0  18  5001c450000c4715    21     0017    000d   SATA Target
 0  19  5001c450000c4716    22     0018    000d   SATA Target
 0  20  5001c450000c4717    23     0019    000d   SATA Target
 0  21  5001c450000c4718    24     001a    000d   SATA Target
 0  22  5001c450000c4719    25     001b    000d   SATA Target
 0  23  5001c450000c471a    26     001c    000d   SATA Target
 0  24  5001c450000c471b    27     001d    000d   SATA Target
 0  25  5001c450000c473d    28     001e    000d   SAS Initiator and Target
Type      NumPhys    PhyNum  Handle     PhyNum  Handle  Port  Speed
Adapter      8          0     0001  -->    0     0009     0    1.5
  1     0002  -->    0     000a     1    1.5
  2     0003  -->    0     000b     2    1.5
  3     0004  -->    0     000c     3    1.5
  4     0005  -->    0     000d     4    3.0
  5     0005  -->    1     000d     4    3.0
  6     0005  -->    2     000d     4    3.0
Expander    30          0     000d  -->    4     0005     4    3.0
  1     000d  -->    5     0005     4    3.0
  2     000d  -->    6     0005     4    3.0
 12     000d  -->    0     000e     4    1.5
 13     000d  -->    0     000f     4    1.5
 14     000d  -->    0     0010     4    1.5
 15     000d  -->    0     0011     4    1.5
 16     000d  -->    0     0012     4    1.5
 17     000d  -->    0     0013     4    1.5
 18     000d  -->    0     0014     4    1.5
 19     000d  -->    0     0015     4    1.5
 20     000d  -->    0     0016     4    1.5
 21     000d  -->    0     0017     4    1.5
 22     000d  -->    0     0018     4    1.5
 23     000d  -->    0     0019     4    1.5
 24     000d  -->    0     001a     4    1.5
 25     000d  -->    0     001b     4    1.5
 26     000d  -->    0     001c     4    1.5
 27     000d  -->    0     001d     4    1.5
 28     000d  -->    0     001e     4    3.0

As you can see the connection Controller –> Expander has a bandwidth of 3.0 Gbit/s whereas Expander –> Drive is limited to 1.5 Gbit/s.

Conclusion

After disabling NCQ on controller lever and limiting the bandwidth between expander and hard disk to 1.5 Gbit/s for each disk all problems are gone.

All? Yes, indeed. For three weeks there are NO error messages anymore and the system itself is rock-stable. During that period one hard disk physically failed and had to be replaced (remember, i wont comment on the quality of these “server” hard disks….).

Tomorrow or the day after tomorrow i wil post Part IV covering first performance benchmarks.

This entry was posted in Openstorage. Bookmark the permalink.

5 Responses to Building a custom and cheap storage server yourself – Part III

  1. Pingback: Ronny Egners Blog » Creating database clones with ZFS really FAST

  2. Pingback: Ronny Egners Blog » Building a custom and cheap storage server yourself – Part II

  3. Richard J. says:

    Still awaiting part IV.

    Patiently. =)

  4. Brendan says:

    How did you set the expander ports to 1.5 Gbit?

    I’ve looked through the LSIutil program and it can only set link speed on the controller ports not the expander ports.

Leave a Reply to Richard J. Cancel reply

Your email address will not be published. Required fields are marked *