<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Ronny Egners Blog &#187; Openstorage</title>
	<atom:link href="http://blog.ronnyegner-consulting.de/category/openstorage/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.ronnyegner-consulting.de</link>
	<description>Ronny Egners Blog about Oracle, UNIX and EMC / Legato Networker</description>
	<lastBuildDate>Sun, 04 Dec 2011 12:10:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Building a custom and cheap storage server yourself – Part III</title>
		<link>http://blog.ronnyegner-consulting.de/2010/02/16/building-a-custom-and-cheap-storage-server-yourself-%e2%80%93-part-iii/</link>
		<comments>http://blog.ronnyegner-consulting.de/2010/02/16/building-a-custom-and-cheap-storage-server-yourself-%e2%80%93-part-iii/#comments</comments>
		<pubDate>Tue, 16 Feb 2010 12:15:53 +0000</pubDate>
		<dc:creator>Ronny Egner</dc:creator>
				<category><![CDATA[Openstorage]]></category>

		<guid isPermaLink="false">http://blog.ronnyegner-consulting.de/?p=1643</guid>
		<description><![CDATA[This is part III of my most recent project named &#8220;Building a custom and cheap storage server yourself&#8221;. Part I can be found here and Part II here. Part II left with a current status of a crashing machine during high I/O loads.During the previous weeks i tried to solve the stability problems. And finally: [...]]]></description>
			<content:encoded><![CDATA[<p>This is part III of my most recent project named &#8220;Building a custom and cheap storage server yourself&#8221;. Part I can be found <a href="http://blog.ronnyegner-consulting.de/2009/11/06/building-a-custom-and-cheap-storage-server-yourself/" target="_blank">here</a> and Part II <a href="http://blog.ronnyegner-consulting.de/2010/01/11/building-a-custom-and-cheap-storage-server-yourself-part-ii/" target="_blank">here</a>.</p>
<p>Part II left with a current status of a crashing machine during high I/O loads.During the previous weeks i tried to solve the stability problems. <strong>And finally: we solved them!</strong></p>
<p>This third part is about our problems, what caused them and how we finally solved them&#8230;.</p>
<h2><strong><span id="more-1643"></span>Problems, Problems</strong></h2>
<h3><strong>Part I &#8211; using Adaptec Controller<br />
</strong></h3>
<p>Initially we equipped our storage with one Adaptec 52445 or one Adaptec 51645. These adapters were chosen because they offered a large amount of SATA ports (16 and 24) so we could attach all remaining disks directly without any expander.</p>
<p>During our tests we noticed a compete halt of all I/O under high I/O load. Not even a &#8220;ls&#8221; returns any result. The problem was discussed at the <a href="http://opensolaris.org/jive/thread.jspa?threadID=121445&amp;tstart=0" target="_blank">opensolaris mailing list</a> but without any insight. Even the message log did not showed any error message.</p>
<p>I tried several settings on controller side, including creating a RAID array, turning off NCQ, setting timeout values higher and so on. But nothing worked. Under load sooner or later the I/O to all drives attached to the adaptec controller came to a complete halt.</p>
<p>After noticing that high I/O to disks directly attached to the motherboard do not &#8220;lock up&#8221; in any way we were pretty sure out problems were caused by the adaptec controller&#8230;. so we replaced them.</p>
<h3>Part II &#8211; using LSI 1086E-based controllers</h3>
<p>In order to solve our problems which were most probably related to the adaptec controllers we replaced them with two LSI 1086E-based controllers plus one expander attached to each controller. We used the <a href="http://www.chenbro.com/corporatesite/products_line.php?pos=36" target="_blank">Chenbro CK12804</a> expander.</p>
<p>Unfortunately we observed the same behavior with these controllers as well: the I/O stalled sooner or later.</p>
<p>So i took a look at the hard disks next. For the prototype we ordered fourty Seagate ST31000340NS disks with 1 TB each. Note that these disks are &#8211; <a href="http://www.seagate.com/ww/v/index.jsp?vgnextoid=481e83de34b43110VgnVCM100000f5ee0a0aRCRD#tTabContentOverview" target="_blank">according to the seagate homepage</a> &#8211; designed for:</p>
<pre>"The Barracuda ES.2 drive is the perfect solution for high-capacity enterprise storage
applications such as the migration of mission-critical transactional data, from tier 1
to tier 2 (nearline) storage, where dollars/GB and GB/watt are a primary concern."</pre>
<p>Another criteria for chosing this drive was:</p>
<pre>"- 24x7 operation and 1.2 M hrs. MTBF"</pre>
<p>After making sure we used the most recent firmware (actually &#8220;SN06&#8243; is the most recent firmware) i took a look at the Seagate forums and found an <a href="http://forums.seagate.com/stx/board/message?board.id=ata_drives&amp;message.id=10211&amp;query.id=161260#M10211" target="_blank">interesting thread</a> about Seagate ES.2 drives (the drive we used) and some firmware problems. Especially one post took my attention:</p>
<pre>I've got a setup with two 1TB disks which previously used the AN05 firmware and they
worked just fine, no matter how many GB's of data I copied to/from the RAID1 mirror.
Right <strong>after updating the firmware to suggested SN06 version it seems that I must
disable <strong>NCQ</strong> completely or the system will hang in 30 mins if there are any data being
copied</strong> from the disks. Because there has not been any guarantees that AN05-version of
the firmware doesn't suffer from the disk locking bug, I did not feel comfortable in
continuing using that version even though it seemed to work just fine.</pre>
<p>Well, i thought to myself: &#8220;You´ve already tried disabling NCQ at /etc/system level and controller level with the adaptec controller but anyway &#8211; try it!&#8221;. So i turned off NCQ at LSI controllers with the help of LSIutil.</p>
<p>After booting the system again i put some I/O load on it and waited for the problem to appear again&#8230;. after waiting pessimistically for days it did not appear. So i once again booted the system and increased the I/O pressure and waited again. Soon after some errors were shown:</p>
<pre>Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.notice]
scsi_state = 0, transfer count = 1400, scsi_status = 0
Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.warning] WARNING:
/pci@0,0/pci8086,340c@5/pci1014,396@0/sd@a,0 (sd52):
Jan 29 12:00:48 openstorage     incomplete read- retrying
Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.warning] WARNING:
/pci@0,0/pci8086,340c@5/pci1014,396@0 (mpt1):
Jan 29 12:00:48 openstorage     unknown ioc_status = 4
Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.notice]
scsi_state = 0, transfer count = 12800, scsi_status = 0
Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.warning] WARNING:
/pci@0,0/pci8086,340c@5/pci1014,396@0/sd@8,0 (sd50):
Jan 29 12:00:48 openstorage     incomplete read- retrying</pre>
<p>Despite of these errors a complete lockup of all I/O was not observed anymore. So disabling NCQ did the trick as the following days (and weeks) proved.</p>
<p>Obviously Seagates SATA &#8220;server&#8221; disks have some kind of problem with NCQ. Using an older firmware was not an option because older firmware versions might cause the disks to fail completely. A newer firmware than SN06 was also not available. I wont comment this any further but most likely i will think twice before buying any Seagate disks again. I even contacted Seagate support but i have not heared anything from them.</p>
<p>Interestingly disabling NCQ at the Adaptec controller and even on /etc/system level did not work. I have not investigated this further.</p>
<h3>Part III &#8211; getting rid of &#8220;incomplete read- retrying&#8221; messages</h3>
<p>After replacing the adaptec controller with some LSI controller and disabling NCQ for all disks the system stabilized. Even extremely high I/O would not lock up anymore.</p>
<p>But there were still error messages like this:</p>
<pre>Jan 29 12:00:48 openstorage     incomplete read- retrying
Jan 29 12:00:48 openstorage scsi: [ID 107833 kern.warning] WARNING:
/pci@0,0/pci8086,340c@5/pci1014,396@0 (mpt1):
Jan 29 12:00:48 openstorage     unknown ioc_status = 4
</pre>
<p>When debugging these errors i noticed every time an &#8220;incomplete read- retrying&#8221; message was issued the error counters on the physical link level was increased:</p>
<pre>Adapter Phy 6:  Link Up
 Invalid DWord Count                                      1
 Running Disparity Error Count                            1
 Loss of DWord Synch Count                                0
 Phy Reset Problem Count                                  0
</pre>
<pre>&lt;INCOMPLETE READ ERROR MESSAGE&gt;</pre>
<pre>Adapter Phy 6:  Link Up
 Invalid DWord Count                                      2
 Running Disparity Error Count                            2
 Loss of DWord Synch Count                                0
 Phy Reset Problem Count                                  0</pre>
<p>So it seemed related either to the physical connection &#8211; or &#8211; still the hard disks. Before digging further on hard disk level we replaced the cables with new ones. The error persisted.</p>
<p>In the thread quoted above the user not only reported problems with NCQ but changed the SATA interface speed from 3 gbit/s to 1.5 gbit/s as well. According to his findings changing the speed is not a viable workaround. But i tried it anyway. After chaning the speed of the drives interface to 1.5 Gbit/s the error disappeared as well.</p>
<p>The controller configuration looks like this:</p>
<pre>Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 16

SAS1068E's links are 1.5 G, 1.5 G, 1.5 G, 1.5 G, 3.0 G, 3.0 G, 3.0 G, down
 B___T     SASAddress     PhyNum  Handle  Parent  Type
        500605b0017ff110           0001           SAS Initiator
        500605b0017ff111           0002           SAS Initiator
        500605b0017ff112           0003           SAS Initiator
        500605b0017ff113           0004           SAS Initiator
        500605b0017ff114           0005           SAS Initiator
        500605b0017ff115           0006           SAS Initiator
        500605b0017ff116           0007           SAS Initiator
        500605b0017ff117           0008           SAS Initiator
 0   5  09221b095e5c7c67     0     0009    0001   SATA Target
 0   6  09221b095d6f556b     1     000a    0002   SATA Target
 0   7  09221b087d585275     2     000b    0003   SATA Target
 0   8  09221b087a5a7a6c     3     000c    0004   SATA Target
        5001c450000c4700     4     000d    0005   Edge Expander
 0   9  5001c450000c470c    12     000e    000d   SATA Target
 0  10  5001c450000c470d    13     000f    000d   SATA Target
 0  11  5001c450000c470e    14     0010    000d   SATA Target
 0  12  5001c450000c470f    15     0011    000d   SATA Target
 0  13  5001c450000c4710    16     0012    000d   SATA Target
 0  14  5001c450000c4711    17     0013    000d   SATA Target
 0  15  5001c450000c4712    18     0014    000d   SATA Target
 0  16  5001c450000c4713    19     0015    000d   SATA Target
 0  17  5001c450000c4714    20     0016    000d   SATA Target
 0  18  5001c450000c4715    21     0017    000d   SATA Target
 0  19  5001c450000c4716    22     0018    000d   SATA Target
 0  20  5001c450000c4717    23     0019    000d   SATA Target
 0  21  5001c450000c4718    24     001a    000d   SATA Target
 0  22  5001c450000c4719    25     001b    000d   SATA Target
 0  23  5001c450000c471a    26     001c    000d   SATA Target
 0  24  5001c450000c471b    27     001d    000d   SATA Target
 0  25  5001c450000c473d    28     001e    000d   SAS Initiator and Target</pre>
<pre>Type      NumPhys    PhyNum  Handle     PhyNum  Handle  Port  Speed
Adapter      8          0     0001  --&gt;    0     0009     0    1.5
  1     0002  --&gt;    0     000a     1    1.5
  2     0003  --&gt;    0     000b     2    1.5
  3     0004  --&gt;    0     000c     3    1.5
  4     0005  --&gt;    0     000d     4    3.0
  5     0005  --&gt;    1     000d     4    3.0
  6     0005  --&gt;    2     000d     4    3.0
Expander    30          0     000d  --&gt;    4     0005     4    3.0
  1     000d  --&gt;    5     0005     4    3.0
  2     000d  --&gt;    6     0005     4    3.0
 12     000d  --&gt;    0     000e     4    1.5
 13     000d  --&gt;    0     000f     4    1.5
 14     000d  --&gt;    0     0010     4    1.5
 15     000d  --&gt;    0     0011     4    1.5
 16     000d  --&gt;    0     0012     4    1.5
 17     000d  --&gt;    0     0013     4    1.5
 18     000d  --&gt;    0     0014     4    1.5
 19     000d  --&gt;    0     0015     4    1.5
 20     000d  --&gt;    0     0016     4    1.5
 21     000d  --&gt;    0     0017     4    1.5
 22     000d  --&gt;    0     0018     4    1.5
 23     000d  --&gt;    0     0019     4    1.5
 24     000d  --&gt;    0     001a     4    1.5
 25     000d  --&gt;    0     001b     4    1.5
 26     000d  --&gt;    0     001c     4    1.5
 27     000d  --&gt;    0     001d     4    1.5
 28     000d  --&gt;    0     001e     4    3.0</pre>
<p>As you can see the connection Controller &#8211;&gt; Expander has a bandwidth of 3.0 Gbit/s whereas Expander &#8211;&gt; Drive is limited to 1.5 Gbit/s.</p>
<h2>Conclusion</h2>
<p>After disabling NCQ on controller lever and limiting the bandwidth between expander and hard disk to 1.5 Gbit/s for each disk all problems are gone.</p>
<p>All? Yes, indeed. For three weeks there are NO error messages anymore and the system itself is rock-stable. During that period one hard disk physically failed and had to be replaced (remember, i wont comment on the quality of these &#8220;server&#8221; hard disks&#8230;.).</p>
<p>Tomorrow or the day after tomorrow i wil post Part IV covering first performance benchmarks.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.ronnyegner-consulting.de/2010/02/16/building-a-custom-and-cheap-storage-server-yourself-%e2%80%93-part-iii/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Building a custom and cheap storage server yourself &#8211; Part II</title>
		<link>http://blog.ronnyegner-consulting.de/2010/01/11/building-a-custom-and-cheap-storage-server-yourself-part-ii/</link>
		<comments>http://blog.ronnyegner-consulting.de/2010/01/11/building-a-custom-and-cheap-storage-server-yourself-part-ii/#comments</comments>
		<pubDate>Mon, 11 Jan 2010 15:15:58 +0000</pubDate>
		<dc:creator>Ronny Egner</dc:creator>
				<category><![CDATA[Openstorage]]></category>

		<guid isPermaLink="false">http://blog.ronnyegner-consulting.de/?p=1552</guid>
		<description><![CDATA[It´s been a few months since posting the idea of building a custom made storage system in my blog. During this time we convinced people to give our idea a try, did some minor changes to the box layout and ordered the parts. Finally they arrived at 27th December 2009. So this is part II [...]]]></description>
			<content:encoded><![CDATA[<p>It´s been a few months since posting the idea of building a custom made storage system in my blog. During this time we convinced people to give our idea a try, did some minor changes to the box layout and ordered the parts. Finally they arrived at 27th December 2009.</p>
<p>So this is part II of the project called &#8220;Building a custom and cheap storage server yourself&#8221;. For people dont know what i am talking about: part I can be found <a href="http://blog.ronnyegner-consulting.de/2009/11/06/building-a-custom-and-cheap-storage-server-yourself/" target="_blank">here</a> and part III <a href="http://blog.ronnyegner-consulting.de/2010/02/16/building-a-custom-and-cheap-storage-server-yourself-%e2%80%93-part-iii/" target="_blank">here</a>.</p>
<h2>Building the box</h2>
<p>We had a budget of approx 12.000 euros for building the prototype of the storage box. We decided to build the prototype with 40 disks from the start and fit an SAN-HBA in it to try <a href="http://hub.opensolaris.org/bin/view/Project+comstar/" target="_blank">COMSTAR</a> (this enables us to export the storage via SAN to other servers). As operating system we choose <a href="http://hub.opensolaris.org/bin/view/Main/" target="_blank">Open Solaris</a>. Two disks are dedicated to the operating system and are attached directly to the mainboard. The remaining 38 disks are attached to either one Adaptec 52445 or one Adaptec 51645 controller.</p>
<p>Below are some pictures of the components and the final assembled box. Click on the image for a larger version:</p>
<p><a href="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_001.jpg"><img class="alignnone size-medium wp-image-1618" title="openstorage_001" src="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_001-300x199.jpg" alt="" width="300" height="199" /></a></p>
<p><span id="more-1552"></span><a href="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_002.jpg"><img class="alignnone size-medium wp-image-1619" title="openstorage_002" src="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_002-300x199.jpg" alt="" width="300" height="199" /></a></p>
<p><a href="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_003.jpg"><img class="alignnone size-medium wp-image-1620" title="openstorage_003" src="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_003-300x199.jpg" alt="" width="300" height="199" /></a></p>
<p><a href="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_004.jpg"><img class="alignnone size-medium wp-image-1621" title="openstorage_004" src="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_004-300x199.jpg" alt="" width="300" height="199" /></a></p>
<p><a href="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_005.jpg"><img class="alignnone size-medium wp-image-1622" title="openstorage_005" src="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_005-300x199.jpg" alt="" width="300" height="199" /></a></p>
<p><a href="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_006.jpg"><img class="alignnone size-medium wp-image-1623" title="openstorage_006" src="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_006-300x199.jpg" alt="" width="300" height="199" /></a></p>
<p><a href="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_007.jpg"><img class="alignnone size-medium wp-image-1624" title="openstorage_007" src="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_007-300x199.jpg" alt="" width="300" height="199" /></a></p>
<p><a href="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_009.jpg"><img class="alignnone size-medium wp-image-1626" title="openstorage_009" src="http://blog.ronnyegner-consulting.de/wp-content/uploads/openstorage_009-300x199.jpg" alt="" width="300" height="199" /></a></p>
<h2>Installing</h2>
<p>After putting everything in place we started to install Open Solaris for the first time. But wait &#8211; we forgot a DVD drive. But USB-attached dvd drives are widely available. After attaching a usb-dvd drive we started to install.</p>
<p>This time we were able to boot from dvd but installation did not start. A short glance at the mainboard manufacturer site yielded:</p>
<pre>Fixed an issue where the system could not boot from a DVD ROM when an Adaptec Raid Card
(ASR-5805) was installed on any PCI-e Slot</pre>
<p>Yup, we had a similar controller installed. So &#8211; instead of shorthand installing the operating system and playing around a little bit &#8211; we patched all components to the most recent version (BIOS, HBA, and so on).</p>
<p>After doing so installation went fine and the system booted Open Solaris 2009.06 for the first time.</p>
<h2>Patching (again)</h2>
<p>Open Solaris 2009.06 is quite old and we needed the latest features of ZFS and COMSTAR so we upgraded our Open Solaris. There are basically two versions available:</p>
<ul>
<li>A &#8220;release&#8221; version (currently as of December 2009: Build 111) of Open Solaris (<a href="http://pkg.opensolaris.org/release/en/index.shtml">Repository Link</a>)</li>
<li>A &#8220;development&#8221; version (currently as of December 2009: Build 130) of Open Solaris (<a href="http://pkg.opensolaris.org/dev/en/index.shtml" target="_blank">Repository Link</a>)</li>
</ul>
<p>The upgrade process is pretty straight forward:</p>
<p>Upgrading to the most recent &#8220;release&#8221; version is done by entering:</p>
<pre>pfexec pkg image-update
</pre>
<p>Upgrading to the most recent &#8220;development&#8221; build is done by:</p>
<pre>$ pfexec pkg set-publisher -O http://pkg.opensolaris.org/dev opensolaris.org
$ pfexec pkg image-update</pre>
<p>We first patched to the most recent &#8220;release&#8221; build and afterwards patched to the most recent &#8220;developement&#8221; build.Beside from the painfully slow fetching of the packages (*really* slow&#8230; it run several hours for few hundred MB) everything worked fine and we ended up with two bootable configurations:</p>
<ul>
<li>Release &#8211; Build 111</li>
<li>Development &#8211; Build 130</li>
</ul>
<p>We continued our work with the development build due to the fact we wanted deduplication and the most recent version with the most recent fixes included.</p>
<h2>Configure</h2>
<h3>Replacing the Adaptec controller driver</h3>
<p>In order to use the drives attached to the Adaptec controller we had to replace the driver shipped with Open Solaris with the appropriate driver from Adaptec.</p>
<h3>Creating the ZFS pool</h3>
<p>After successfully booting with Build 130 and replacing the adaptec driver we created our first ZFS pool &#8220;pool1&#8243; with a total capacity of 20 TB:</p>
<pre>zpool create pool1 raid2z c10t0d0s0 c11t0d0s0 c12t0d0s0 c13t0d0s0 c14t0d0s0 c15t0d0s0 c16t0d0s0 c17t0d0s0
c18t0d0s0 c19t0d0s0 c20t0d0s0 c21t0d0s0 c22t0d0s0 c23t0d0s0 c24t0d0s0 c25t0d0s0 c26t0d0s0 c27t0d0s0
c28t0d0s0 c29t0d0s0 c30t0d0s0 c31t0d0s0</pre>
<p>After several seconds the zpool was created successfully. So far &#8211; so cool.</p>
<p>I know a raid1z pool with 20 disks i pretty uncommon and i would never ever use this configuration in production due to the extremely high probability of a double-disk-failure. But for the very first tests it seemed acceptable.</p>
<h3>Install missing packages</h3>
<p>For using COMSTAR (and iSCSI as well) we needed several packages which could be installed with the help of the package manager (&#8220;pkg&#8221;) easily.</p>
<h3>Booting after COMSTAR and ZPool creation</h3>
<p>The first boot after installing the COMSTAR packages, replacing the qlc with the qlt driver (to export our storage over SAN) and creating the large zpool caused the system to crash during boot.</p>
<p>It took me some time to find out package management silently uninstalled the adaptec controller driver from adaptec we installed before and replaced it with the original Open Solaris driver (which cannot use disks attached to the adaptec controller) when installing the COMSTAR packages. Booting afterwards caused the system to crash.</p>
<h3>Testing and Crashing</h3>
<p>Currently we are testing overall system performance. While doing so we faced problems were I/O to the adaptec controller would hang occasionally. The system is still responsible (most probably because the system disks are not attached via the adaptec controller) but I/O to the data pool is impossible. I´ve posted the problem at the <a href="http://opensolaris.org/jive/thread.jspa?threadID=121445&amp;tstart=0" target="_blank">opensolaris mailing list</a> but currently with no replies.</p>
<h2>Conclusion: currently crashing</h2>
<p>Although we are facing some problems we will continue this project. First of all we separated the hard disks on each controller into their own zpools. If this is a hardware issue the error should be located to one zpool only. In addition to that we will replace the controller with LSI-based ones which are also used in the SUN Thumper systems.</p>
<p>If the system runs performance is good: We observed up to 1 GB/s (not Gbit!) or 1000 MB/s sequential read/write speed and up to 7500 I/O operations per second. Exporting storage via SAN works as well with decent speed: We observed up to 320 MB/s on a QLogic 4 Gbit/s HBA running under Linux SLES 10 SP3.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.ronnyegner-consulting.de/2010/01/11/building-a-custom-and-cheap-storage-server-yourself-part-ii/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Building a custom and cheap storage server yourself</title>
		<link>http://blog.ronnyegner-consulting.de/2009/11/06/building-a-custom-and-cheap-storage-server-yourself/</link>
		<comments>http://blog.ronnyegner-consulting.de/2009/11/06/building-a-custom-and-cheap-storage-server-yourself/#comments</comments>
		<pubDate>Fri, 06 Nov 2009 05:00:18 +0000</pubDate>
		<dc:creator>Ronny Egner</dc:creator>
				<category><![CDATA[Openstorage]]></category>

		<guid isPermaLink="false">http://ronnyegner.wordpress.com/?p=784</guid>
		<description><![CDATA[Recently i came across a project where they built their own cheap storage. The whole story is documented here and here. A colleague of me and i saw this project and wondered if this kind of storage could be used for databases as well. So we analyzed the design and noticed some problems from our [...]]]></description>
			<content:encoded><![CDATA[<p>Recently i came across a project where they built their own cheap storage. The whole story is documented <a href="http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/" target="_blank">here</a> and <a href="http://blog.backblaze.com/2009/10/07/backblaze-storage-pod-vendors-tips-and-tricks/" target="_blank">here</a>.</p>
<p>A colleague of me and i saw this project and wondered if this kind of storage could be used for databases as well. So we analyzed the design and noticed some problems from our point of view:</p>
<ul>
<li>data access only via HTTP</li>
<li>they used the JFS file system which is not widely used</li>
<li>generally hard disks are hot-swappable; but not used due to fear of problems</li>
<li>optimized for space rather than for speed</li>
<li>relatively &#8220;weak&#8221; power supply</li>
</ul>
<p>So we tried to improve the layout with the following constraints:</p>
<ul>
<li>approx. 10.000 Euro (approx. 15.000 US-Dollar) in total</li>
<li>Storage accessible via multiple protocols:
<ul>
<li>NFS</li>
<li>iSCSI</li>
<li>NFS</li>
<li>CIFS</li>
<li>if possible SAN</li>
</ul>
</li>
<li>Reliable</li>
<li>Optimized for speed rather than capacity (remember: we talked about databases)</li>
<li>Hot-Swapable Harddisks</li>
</ul>
<p>This is part I of our journey towards building a storage system ourself. Part II is <a href="http://blog.ronnyegner-consulting.de/2010/01/11/building-a-custom-and-cheap-storage-server-yourself-part-ii/" target="_blank">here</a> and Part III <a href="http://blog.ronnyegner-consulting.de/2010/02/16/building-a-custom-and-cheap-storage-server-yourself-%e2%80%93-part-iii/" target="_blank">here</a>.</p>
<p><span id="more-784"></span></p>
<h2>Required Components</h2>
<p>What components do we need?</p>
<ul>
<li>Case</li>
<li>Power Supply</li>
<li>Motherboard</li>
<li>CPU</li>
<li>Memory</li>
<li>Storage HBAs for the hard disks (SAS/SATA)</li>
<li>SAS/SATA Expander if needed</li>
<li>Network Interface Cards (1 / 10 Gbit/s)</li>
<li>SAN HBA</li>
<li>Operating System</li>
</ul>
<h3>Case</h3>
<p>Cases in several sized are available by several vendors. Most of them are built in Asia and sold by resellers all over the world.</p>
<p>For supporting hot swapping hard disks the case needed to support this as well. So we looked at several cases and came up with<a href="http://www.hq-solutions.de/index.php?id=111" target="_blank"> this case here</a>:</p>
<ul>
<li>up to 42 hot swappable hard disks</li>
<li>three redundant power supplies</li>
<li>room for a quad cpu board</li>
<li>Backplane
<ul>
<li>ten backplanes connecting four drives each</li>
<li>backplane connected by SFF 8087 (aka. &#8220;mini-SAS&#8221;) cables with controller</li>
<li>separate SFF8087 cable required for connecting system drives with controller</li>
<li>so for connecting all 42 disks (40 + 2) you need 11 SFF8087 cables</li>
</ul>
</li>
</ul>
<p>We are running this project in Germany so we choose a German vendor. Searching at Google for the case&#8217;s name (&#8220;RSC-8ED-0Q1&#8243;)  yield you can buy the case at several vendors, for instance <a href="http://www.aicipc.com/ProductDetail.aspx?ref=RSC-8ED-0Q1" target="_blank">here</a>.</p>
<p>Some pictures and technical data sheet:</p>
<p><a href="http://ronnyegner.files.wordpress.com/2009/10/cheap_storage_case.jpg"><img class="alignnone size-full wp-image-786" title="cheap_storage_case" src="http://ronnyegner.files.wordpress.com/2009/10/cheap_storage_case.jpg" alt="cheap_storage_case" width="292" height="277" /></a></p>
<p><a href="http://ronnyegner.files.wordpress.com/2009/10/cheap_storage_case_datasheet1.png"><img class="alignnone size-full wp-image-787" title="cheap_storage_case_datasheet1" src="http://ronnyegner.files.wordpress.com/2009/10/cheap_storage_case_datasheet1.png" alt="cheap_storage_case_datasheet1" width="440" height="431" /></a></p>
<p><a href="http://ronnyegner.files.wordpress.com/2009/10/cheap_storage_case_datasheet2.png"><img class="alignnone size-full wp-image-788" title="cheap_storage_case_datasheet2" src="http://ronnyegner.files.wordpress.com/2009/10/cheap_storage_case_datasheet2.png" alt="cheap_storage_case_datasheet2" width="442" height="478" /></a></p>
<h3>Mainboard</h3>
<p>According to the data sheet there are two system boards available. We have chosen the Tyan board with the following technical specifications:</p>
<ul>
<li>integrated graphic controller</li>
<li>Slots
<ul>
<li>2x PCI-E x16; x16 sig.</li>
<li>2x PCI-E x16; x4 sig.</li>
<li>1 PCI 32-bit</li>
</ul>
</li>
<li>8 SATA-2-Ports</li>
<li>2x USB 2.0</li>
<li>3x 1 Gbit/s-NIC</li>
<li>4x 1207-pin socket for AMD Opteron (Rev. F) 8000er CPUs</li>
<li>16x DDR2-DIMMS (max. 64 GB RAM)</li>
</ul>
<p>The costs for the system board are approx 250 euros or 368 US-$.</p>
<p><a href="http://ronnyegner.files.wordpress.com/2009/10/cheap_storage_mainboard.gif"><img class="alignnone size-full wp-image-791" title="cheap_storage_mainboard" src="http://ronnyegner.files.wordpress.com/2009/10/cheap_storage_mainboard.gif" alt="cheap_storage_mainboard" width="450" height="215" /></a></p>
<h3>Memory</h3>
<ul>
<li>system board supports up to 16 DDR2 DIMM Modules</li>
<li>up to 64 GB Memory</li>
</ul>
<p>Due to the current low price of memory we decided to put the largest allowed amount of memory (64 GB) at the system board.</p>
<p>One 4 GB memory module costs approx. 80 euros (or 117.8 US-$). So sixteen of these modules add up to 1280 euros or 1885 US-$.</p>
<h3>CPU</h3>
<p>The board imposes some restrictions on the type of CPU we can use. We decided to fit four quad core CPUs. The CPU chosen was:</p>
<ul>
<li> AMD Opteron 2350 (Sockel F, 65nm, Barcelona, OS2350WAL4BGHWOF)</li>
<li>Socket F</li>
<li>Clock Frequency: 2.000 MHz</li>
<li>Quad-Core</li>
<li>Type of core: Barcelona</li>
<li>Stepping (Revision): B3-Stepping</li>
<li>FSB: 1.000 MHz</li>
<li>QuickPath Interfac</li>
<li>Hypertransport: 2,0 GT/s</li>
<li>Second-Level-Cache: 4 x 512 KB</li>
</ul>
<p>250 Euros (or 368.2 US-$) for one CPU or 1000 euros (or 1472.75 US-$) for four CPUs.</p>
<h3>SAS/SATA HBA and Expander Cards</h3>
<p>We needed to attach 42 hard disks to the system board. Two disks are attached with a single mini-SAS cable. The remaining fourty disks are attached by ten mini-SAS cables (each cable attaching four disks). So we had to attach a total of eleven mini-SAS (SFF8087) cables.</p>
<h4>SAS Cabling</h4>
<ul>
<li>two disk connected over a single mini-SAS cable directly to the system board</li>
<li>20 disks connected over 5 mini-SAS cables to sas-expander A
<ul>
<li>sas-expander A connected over one mini-SAS cable to sas-controller A</li>
</ul>
</li>
<li>20 disks connected over 5 mini-SAS cables to sas-expander B
<ul>
<li>sas-expander B connected over one min-SAS cable to sas-controller B</li>
</ul>
</li>
</ul>
<h4>Storage Layout</h4>
<p>Summarizing our cabling layout we have three different failure components from storage point of view:</p>
<ul>
<li>sas controller on system board</li>
<li>sas controller A</li>
<li>sas controller B</li>
</ul>
<p>We bear this in mind when designing raid groups lateron.</p>
<h4>Components</h4>
<ul>
<li>2x SAS Controller: Adaptec 2405 with the following features according to the documentation:
<ul>
<li>128 MB Cache</li>
<li>Supports 4 direct-attached or up to 128 SATA or SAS disk drives using SAS expanders</li>
<li>Quick initialization</li>
<li>Online Capacity Expansion</li>
<li>Copyback Hot Spare</li>
<li>Dynamic caching algorithm</li>
<li>Native Command Queuing (NCQ)</li>
<li>Background initialization</li>
<li>Hot-plug drive support</li>
<li>RAID Level Migration</li>
<li>Hot spares – global, dedicated, and pooled</li>
<li>Automatic/manual rebuild of hot spares</li>
<li>SAF-TE enclosure management</li>
<li>Configurable stripe size</li>
<li>S.M.A.R.T. support</li>
<li>Multiple arrays per disk drive</li>
<li>Bad stripe table</li>
<li>Dynamic sector repair</li>
<li>Staggered drive spin-up</li>
<li>Bootable array support</li>
<li>Optimized Disk Utilization</li>
<li>Controller bietet RAID0, RAID1 und RAID10 in „Hardware“; nutzen wir aber nicht, da wir ZFS nutzen</li>
</ul>
</li>
</ul>
<p><a href="http://ronnyegner.files.wordpress.com/2009/10/cheap_storage_controller.png"><img class="alignnone size-full wp-image-789" title="cheap_storage_controller" src="http://ronnyegner.files.wordpress.com/2009/10/cheap_storage_controller.png" alt="cheap_storage_controller" width="285" height="220" /></a></p>
<ul>
<li>2x SAS Expander: CHENBRO Low Profile 28-port SAS expander card
<ul>
<li>Input: up to 6 mini-SAS cables coming from the back plane</li>
<li>Output: one mini-SAS cable to the raid controller</li>
</ul>
</li>
</ul>
<p><a href="http://ronnyegner.files.wordpress.com/2009/10/cheap_storage_expander.jpg"><img class="alignnone size-full wp-image-790" title="cheap_storage_expander" src="http://ronnyegner.files.wordpress.com/2009/10/cheap_storage_expander.jpg" alt="cheap_storage_expander" width="336" height="223" /></a></p>
<h3>NICs (1 Gbit/s)</h3>
<p>Three NICs already on-board so nothing needs to be done here.</p>
<p>When building the system we have to measure the network performance of these interfaces. If performance is too bad we need to replace the on-board NICs with dedicated NICs for instance from Intel. When doing so the price for 10 GE-NICs should be checked again.</p>
<h3>NICs (10 Gbit/s)</h3>
<p>Due to lack of support in the core switches we did not embedded a 10 GE-NIC.</p>
<p>If you want to do so the following card will fit:</p>
<ul>
<li>Intel 10 GE XF SR NIC PCI-E</li>
<li>Price: 2300 euro or 3387 US-$</li>
</ul>
<h3>SAN HBAs</h3>
<p>Due to lack of a SAN environment we did not add this feature. But if you want to you can turn your storage system into a SAN target (i.e. exports storage) you can do this with the <a href="http://www.opensolaris.org/os/project/comstar/" target="_blank">COMSTAR project</a> whipping with Open Solaris.</p>
<p>For a impressing demonstration you can refer <a href="http://www.opensolaris.org/os/project/comstar/video/" target="_blank">here</a>.</p>
<p>For enabling SAN features you need a QLogic (not EMulex) HBA. I recommend two Single-Port HBAs. One single-port HBA costs approx. 500 Euros.</p>
<h3>Operating system</h3>
<p>A main goal for our storage project were minimal costs. So paying for an operating system was not desired. In addition to that freely available operating system like *BSD, Linux or Solaris are extremely stable, ship with a lot of features and mostly perform much better than commercial products.</p>
<p>We tested several operating systems and finally ended up with the OpenSolaris Project for many resons:</p>
<ul>
<li>extremely stable</li>
<li>updated on regular basis</li>
<li>wide protocol support available (iSCSI, FCoE, NFS, CIFS, HTTP, &#8230;)</li>
<li>the ZFS file system</li>
<li>the COMSTAR project</li>
</ul>
<p>The last two features are the main reasons for chosing OpenSolaris. ZFS is a highly integrated and extremely powerful and flexible file system. Perhaps the most powerful file system currently available while the COMSTAR project enables us to export our storage over SAN.</p>
<h3>Storage Layout</h3>
<p>Based on the hardware and software configuration we decided not to use the RAID features of our SAS-controller and use ZFS with RAID-DP (RAID with Double Parity) instead. Because parity calculations are cpu resource intensive we fitted four quad-core-cpus.</p>
<p>Due to our cabling we have three main failure components:</p>
<ul>
<li>the internal (on-board) storage controller</li>
<li>sas controller A in PCI-e-Slot</li>
<li>sas controller B in PCI-e-Slot</li>
</ul>
<p>For keeping the failure groups separated we designed the following raid configuration:</p>
<ul>
<li>two disk attached to the on-board controller:
<ul>
<li>one two-way-mirror for the operating system</li>
</ul>
</li>
<li>twenty disks attached via an sas expander card to sas raid controller A
<ul>
<li>one raid group consisting of 20 disks with one Hot Spare, two Parity Disks and 17 data disks</li>
</ul>
</li>
<li>twenty disk attached via an sas expander card to sas controller B
<ul>
<li>one raid group consisting of 20 disks with one Hot Spare, two Parity Disks and 17 data disks</li>
</ul>
</li>
</ul>
<h3>Hard disks</h3>
<p>In our calculation there is one component left: The hard disks. We equip 42 of them. While fourty disks are used as storage to be exported two disks are used for the operating system. These two disks can be somewhat smaller.</p>
<p>In the follwing calculations we will evaluate four different scenarios:</p>
<ul>
<li>40 disks with 1 TB capacity each</li>
<li>40 &#8220;server&#8221; disks with 1 TB capacity each</li>
<li>40 disks with 1.5 TB each</li>
<li>40 disks with  2 TB each</li>
</ul>
<p>For the operating system we choose two disks with 500 GB each. The total price of these two disk is approx. 100 Euros or 148 US-$.</p>
<h5>Scenario #1: 40 disks with 1 TB each</h5>
<ul>
<li>1 TB HDD: 53 Euro (for instance: &#8220;Maxtor DiamondMax 23&#8243;)</li>
<li>40x 1 TB HDDs: 2120 Euros</li>
</ul>
<h5>Scenario #2: 40 &#8220;server&#8221; disks with 1 TB capacity each</h5>
<ul>
<li>1 TB &#8220;server&#8221; HDD: 84 Euros (for instance: &#8220;Samsung F1 RAID Class&#8221;)</li>
<li>40x 1 TB &#8220;server&#8221; HDDs: 3360 €</li>
</ul>
<h5>Scenario #3: 40 disks with 1.5 TB each</h5>
<ul>
<li>1.5 TB HDD: 84 Euros (for instance: &#8220;Seagate Barracuda 7200.11&#8243;)</li>
<li>40x 1.5 TB HDDs: 3360 Euros</li>
</ul>
<h5>Scenario #4: 40 disks with 2 TB each</h5>
<ul>
<li>2 TB HDD: 134 Euro (for instance: &#8220;WD20000CSRTL2&#8243;)</li>
<li>40x 2 TB HDDs: 5360 Euros</li>
</ul>
<h2>Calculations</h2>
<h3>Useable Capacity</h3>
<ul>
<li>Useable capacity with 1 TB disks:
<ul>
<li>2 pools x (20-1 Hot Spare – 2 Parity ) x 1000 GB ~ 34 TB</li>
<li>usable capacity approx. 31.6 TB</li>
</ul>
</li>
<li>Useable capacity with 1,5 TB Disks:
<ul>
<li>2 Pools x (20-1 HS – 2 Parity ) x 1500 GB ~ 51 TB</li>
<li>usable capacity approx. 47.5 TB</li>
</ul>
</li>
<li>Useable capacity with 2 TB Disks:
<ul>
<li>2 Pools x (20-1 HS – 2 Parity ) x 2000 GB ~ 68 TB</li>
<li>usable capacity approx. 63.3 TB</li>
</ul>
</li>
</ul>
<h3>Throughput calculation</h3>
<ul>
<li>34 data disks total</li>
<li>tecnical data details for Seagate 1,5 TB disk:
<ul>
<li>Spindle Speed      7,200 rpm</li>
<li>Average latency     4.16 msec</li>
<li>Random read seek time      &lt;8.5 msec</li>
<li>Random write seek time     &lt;10.0 msec</li>
<li>Calculation: → avg. seek time = (8,5+10)/2 = 9,25 ms (for a 50/50 mixture of reads and writes)</li>
</ul>
</li>
<li>IO calculations:
<ul>
<li>Rotation per milli secons = 7200 rpm / 60000 = 0,12 ms</li>
<li>Full Rotation Time = 1/rot. per ms = 1/0,12 = 8.33 ms</li>
<li>avg. rot. latency = 8,33 / 2 = 4,17 ms</li>
<li>IO time in ms = avg. seek time + rot. Latency = 4,17 + 9,25 = 13.42 ms</li>
<li>IOPS = (1/IO time*1000) = 74.53 IOPS</li>
</ul>
</li>
<li><strong>Total possible IOPS: 34 Disks a 74 IOPS ~~ 2516 IOPS</strong></li>
</ul>
<h3>Costs calculation</h3>
<h4>System components</h4>
<table style="height: 188px;" border="3" width="644">
<tbody>
<tr>
<td><strong>Component</strong></td>
<td><strong>Price in Euros</strong></td>
</tr>
<tr>
<td>Case</td>
<td>2375 €</td>
</tr>
<tr>
<td>system board</td>
<td>250 €</td>
</tr>
<tr>
<td>4x CPU</td>
<td>1000 €</td>
</tr>
<tr>
<td>2x Adaptec 2405 SAS Controller</td>
<td>270 €</td>
</tr>
<tr>
<td>11x SFF8077-SFF8087 Cable 0.5 meters</td>
<td>228 €</td>
</tr>
<tr>
<td>2x CHENBRO Low Profile 28-port SAS expander card</td>
<td>356 €</td>
</tr>
<tr>
<td>2x 500 GB HDDs for Operating system</td>
<td>100 €</td>
</tr>
<tr>
<td>SUM</td>
<td><strong>5859 € or 8729 US-$<br />
</strong></td>
</tr>
</tbody>
</table>
<h4>Harddisks</h4>
<table border="3">
<tbody>
<tr>
<td><strong>Component</strong></td>
<td><strong>Price in Euros</strong></td>
<td><strong>Total price for 40 HDDs in euros</strong></td>
</tr>
<tr>
<td>1 TB HDD</td>
<td>53 €</td>
<td>2120 €</td>
</tr>
<tr>
<td>1 TB &#8220;server&#8221; HDD</td>
<td>84 €</td>
<td>3360 €</td>
</tr>
<tr>
<td>1.5 TB HDD</td>
<td>84 €</td>
<td>3360 €</td>
</tr>
<tr>
<td>2 TB HDD</td>
<td>134 €</td>
<td>5360 €</td>
</tr>
</tbody>
</table>
<table border="3">
<tbody>
<tr>
<td><strong>Hard disk size</strong></td>
<td><strong>Base components price in Euros</strong></td>
<td><strong>Hard disks price in Euros</strong></td>
<td><strong>Total price in Euros</strong></td>
<td><strong>Total price in US-$</strong></td>
<td><strong>Price in Euros per TB</strong></td>
</tr>
<tr>
<td>1 TB</td>
<td>5859 €</td>
<td>2120 €</td>
<td>7979 €</td>
<td>11877 US-$</td>
<td>252.5 €</td>
</tr>
<tr>
<td>1 TB &#8220;server&#8221;</td>
<td>5859 €</td>
<td>3360 €</td>
<td>9219 €</td>
<td>13735 US-$</td>
<td>194 €</td>
</tr>
<tr>
<td>1.5 TB</td>
<td>5859 €</td>
<td>3360 €</td>
<td>9219 €</td>
<td>13735 US-$</td>
<td>194 €</td>
</tr>
<tr>
<td>2 TB</td>
<td>5859 €</td>
<td>5360 €</td>
<td>11219 €</td>
<td>16714 US-$</td>
<td>177.24 €</td>
</tr>
</tbody>
</table>
<h2>Usage scenarios</h2>
<ul>
<li>cheap, but reliable and performant multi-purpose storage</li>
<li>if one system seems not reliable enough:
<ul>
<li>use two of them</li>
<li>and mirror mith oracle ASM for highest availability</li>
</ul>
</li>
</ul>
<h2>The authors</h2>
<p><strong>
<table id="wp-table-reloaded-id-9-no-1" class="wp-table-reloaded wp-table-reloaded-id-9">
<thead>
	<tr class="row-1 odd">
		<th class="column-1">Name</th><th class="column-2">Volodymr Dubinin</th><th class="column-3">Ronny Egner</th>
	</tr>
</thead>
<tbody>
	<tr class="row-2 even">
		<td class="column-1">Company</td><td class="column-2">SIV.AG<br />
Konrad-Zuse-Straße 1<br />
18184 Rogentin<br />
Germany</td><td class="column-3">Ronny Egner Consulting<br />
Vinckestraße 22<br />
40470 Dusseldorf<br />
Germany</td>
	</tr>
	<tr class="row-3 odd">
		<td class="column-1">EMail</td><td class="column-2">vd@siv.de</td><td class="column-3">ronnyegner@gmx.de</td>
	</tr>
	<tr class="row-4 even">
		<td class="column-1"></td><td class="column-2"></td><td class="column-3"></td>
	</tr>
</tbody>
</table>
</strong><strong> </strong></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.ronnyegner-consulting.de/2009/11/06/building-a-custom-and-cheap-storage-server-yourself/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	<img style='margin:0;padding:0;border:0;' width='1px' height='1px' src="http://blog.ronnyegner-consulting.de/wp-content/plugins/mystat/mystat.php?act=time_load&id=668908&rnd=1236264829" /></channel>
</rss>

