« How to Measure Storage Efficiency - Part I - Raw Storage | Main | Storage Efficient Data Protection?!?! »

August 14, 2009

How to Measure Storage Efficiency - Part II - Taxes

200283206-001_4-rev2 Only two things in life are sure, Death, and Taxes, this applies not only to life, but also to storage arrays. Drives die, data is written to the wrong place and data gets lost, however we, like most other storage  vendors do all we can to make sure our customers are not affected by these events. This is done via a variety of practices, most of which involve taking away some of the "Raw" capacity in order to provide for some level of redundancy. For that reason I’ve decided to call them taxes. My rational for this is that while none of us like taxes, most of us value the services that they pay for. I could have called them reserves, or something less scary and more marketyish (I just made that word up, I kind of like it), but as a label I think it's sufficiently descriptive.

In this post, I'll only be covering those areas that I think should be measured in Base-10 SI units, and only include those things over which the customer has little or no choice (much like real taxes), or where there are default or best practice recommendations that are implemented most of the time. If you think the breakdown wrong, confusing, misleading, or that I've left something out, let me know, and I'll try to address this in subsequent posts

Whole Disk Taxes

HotSpare Tax

Explanation

Disks fail, often at inconvenient times, this is why we have RAID which I'll discuss later. Unfortunately when one of the disks fail, the RAID group in question runs in a “degraded” mode.  Depending on the RAID configuration, this degraded mode may have a negative performance impact and may leave the RAID group unprotected. Although neither of these two things are true of RAID-DP, we, like all other vendors strongly recommend that some disks or disk capacity is reserved so that the data on the failed disk can be reconstructed from data contained the rest of the RAID group quickly and easily. For this reason its usually a good idea to have two disks of each type available for reconstruction, as you never want to be left without at least one hot spare if you can avoid it. Now in theory, if you’ve got dual parity RAID and a fairly short delivery time for a replacement disk, then you should be able to get by without any hot spares at all, but this is hardly what I’d call “Best Practice” and goes against NetApp's engineering approach which emphasizes reliability and preservation of data above all else.

Definition

The "Raw" Capacity of the disks allocated to Hot Spares

Measurement Units

SI Gigabytes or SI Terabytes i.e. 1 Gigabyte = 1,000,000,000 bytes

Issues

Some architectures don’t use dedicated disks for hot spares,  but instead allocate spare areas on a number of disks to fulfill the same function. As far as I know, EVA and XIV both fall into this category, however the same amount of disk space is allocated to hot spare space as would be done for dedicated physical disks. So logically it ends up the same.

Other “Whole Disk” Taxes

Explanation

An example of this would be where a NetApp customer uses dedicated disks for root volumes. This configuration may be  considered to be a “desirable” configuration where there are a significant number of spindles in the overall configuration.  For those of you who are interested, the reasons for using (or not using) dedicated disks for root volumes, can be found at http://media.netapp.com/documents/tr-3437.pdf

Another example of this are the disks used by Clariion hold the FLARE operating system and act as a location to dump uncommitted writes from cache in case of a complete power failure.  I’ve seen configurations where these disks were dedicated for this purpose and the customer would never place any production loads on them. I’m not sure if this is typical, however I assume it too would be considered as “desirable” if not a best practice.

Definition

The RAW Capacity of whole disks allocated to "Vendor only" functions other than "hot spares"

Unit of Measurement

SI Gigabytes or SI Terabytes i.e. 1 Gigabyte = 1,000,000,000 bytes

RAID Tax

Explanation

I wasn’t sure whether to put RAID under data protection, or Whole disk taxes, or under a category all of its own.  For the most part traditional RAID is a kind of whole disk tax, but while this holds true for many vendors, there are examples like Lefthand and NetApp who mix RAID and cross site replication, and others who use de-clustered RAID schemes where RAID groups are not built out of whole disks. Because of that, and because its such a well known aspect of storage efficiency I think it deserves a category of it’s own. Given it’s close ties to the physical disk infrastructure, I feel that it should also be measured in "Raw" capacity (Base-10) units, just like the physical disk it protects.

Definition

The amount of "Raw" capacity allocated to RAID Protection.

e.g. “5+1” RAID-5 group made up of six 300GB disks has 300GB of capacity allocated to RAID protection, whereas a RAID-10 group made of the same six 300GB disks has 900GB of capacity allocated to RAID protection.

Measurement Units

SI Gigabytes or SI Terabytes i.e. 1 Gigabyte = 1,000,000,000 bytes

Rightsize Tax

Explanation

Again, after a fair amount of thought I’m combining a couple of things into the Rightsizing tax, the first is “homogenizing” the disk drives so that all disks of a certain “size” end up having exactly the same number of usable blocks and the other is converting from 512 to 520 byte sectors.

Homogenising the disks

If you look at the following output from the OnTap command sysconfig –r, you’ll see an entry for one kind of “144GB” drive

Device            Used (MB/blks)    Phys (MB/blks)
------------ ...  --------------    --------------
2a.20             136000/278528000  138959/284589376

The thing I’d like to focus on for the moment is the number of physical blocks reported  which is 284589376 number on the far right hand side. Now that is the number of 520 byte formatted sectors reported by that particular drive type. If you do the math, you’ll see that this “144GB drive” actually has 147986475520 bytes of data, so its very nearly a 148GB drive. So how big is this drive, 147.98GB ? like it reports, or 146GB like EMC and many other Vendors would sell it, or 144GB as NetApp would sell it ?

The answer is none of the above, well at least from a NetApp perspective, what we do is to standardize every drive by saying we will only use 278528000 520 byte blocks regardless of how many might actually be on there. This works out to 144834560000 bytes or 144.83GB Raw. This is why we sell our drives as 144GB drives. I’m pretty sure that many other vendors also do similar kinds of rightsizing, it allows them to get drives from multiple vendors and provides some resiliency to slight changes in technology within the same drive vendor. As a customer, this rightsizing simplifies purchasing and design decisions.

The next thing you’ll notice is that I’ve just said this is disk reports as being 147.98GB, then how does this work out to 138959 MB ?. There are two reasons for this, the first is that although each sector is 520bytes long, 8 bytes of each of these sectors are reserved for checksum overheads and data integrity purposes. The other reason is that the MB column is actually a Base-2 number not a Base-10 and IMHO should read MiB, but more on that later. The same factors also explain how we get 136000MB (really MiB) out of the 144GB “rightsized capacity”, so what’s up with this checksum overhead ?

Adding in checksum information

Most storage vendors use some form of checksums to improve data integrity at the block level. For most vendors, this involves reformatting the disks to change the block size on the disks from 512 bytes per sector to 520 bytes per sector. For SATA disks where the block size is fixed and cannot be changed, a variety of other techniques can be used such as slip mask checksums or zoned checksums can also be used with varying capacity and performance tradeoffs.

NetApp has two different methods of adding checksum information. The first approach we used was called Zone Check Sums (ZCS), and subsequently, the now more generally preferable method is something called Block Check Sums (BCS). There has been a bit of confusion about both of these approaches, and numerous explanations. One of the best can be found as a response by Steve Strange on John Toigos blog here (http://www.drunkendata.com/?p=385), which I’ve edited slightly for readability and included below..

“ZCS works by taking every 64th 4K block in the filesystem and using it to store a checksum on the preceding 63 4K blocks. We originally did it this way so we could do on-the-fly upgrades of WAFL volumes (from not-checksum-protected to checksum-protected). Clearly, reformatting each drive from 512 sectors to 520 would not make for an easy, on-line upgrade. One of the primary drawback to ZCS is performance, particularly on reads. Since the data does not always live adjacent to its checksum, a 4K read from WAFL often turns into two I/O requests to the disk. Thus was born the NetApp 520-byte-formatted drive and Block Checksums (BCS), this is the preferred checksum method. Note that a volume cannot use a combination of both methods — a volume is either ZCS or BCS.

When ATA drives came along, we were stuck with 512-byte sectors. But we wanted to use BCS for performance reasons. So rather than going back to using ZCS, we use what we call and “8/9ths” scheme down in the storage layer of the software stack (underneath RAID). Every 9th 512-byte sector is deemed a checksum sector that contains checksums for each of the previous 8 512-byte sectors (which is a single 4K WAFL block). This scheme allows RAID to treat the disk as if it were formatted with 520-byte sectors, and therefore they are considered BCS drives. And because the checksum data lives adjacent to the data it protects, a single disk I/O can read both the data and checksum, so it really does perform similarly to a 520-byte sector FC drive (modulo the fact that ATA drives have slower seek times and data transfer/rotational speeds).”

Now one thing about BCS is that for FC drives you lose around 1% of your available space to checksums, on SATA,which uses 8/9ths BCS, that figure is a little over to 11% (ouch!). If we use ZCS, the net loss from checksums is a little under 2% which is consumed from the wafl reserve (explained later) so the net loss is zero, so why do we bother, why add another 11% tax if it's not neccesary ?

Well, for a start, BCS is one of the technologies that helps us maintain a high level of performance, including high performance for FAS Deduplication. This means that while you might lose 11% from BCS on SATA, you’ll probably save a lot more than that in dedup, so overall you’re ahead of the game. Another thing about maintaining high performance for SATA is that when you combine what is can be thought of as “wide striping” via flexvols on large aggregates, high performance dual parity RAID and intelligent caching, we can start using SATA for workloads previously reserved previously for RAID-10 on high speed FC drives. Its this combination of efficiency technologies that makes the big difference, but more on that later.

The other reason for using BCS is that we store a lot of interesting metadata inside those 8 bytes, not just CRC checksums. That metadata allows us to do some cool things, the first one of which is “Lost write protection”, something that is as far as I’m aware unique to OnTap.  I’m going to quote Steve here again as this is one of the better explanation of this.

“Though it is rare, disk drives occasionally indicate that they have written a block (or series of blocks) of data, when in fact they have not. Or, they have written it in the wrong place! Because we control both the filesystem and RAID, we have a unique ability to catch these errors when the blocks are subsequently read. In addition to the checksum of the data, we also store some WAFL metadata in each checksum block, which can help us determine if the block we are reading is valid. For example, we might store the inode number of the file containing the block, along with the offset of that block in the file, in the checksum block. If it doesn’t match what WAFL was expecting, RAID can reconstruct the data from the other drives and see if that result is what is expected. With RAID-DP, this can be done even if a disk is currently missing!”

We’ve also found ways of leveraging this lost write capability to safely and transparently move blocks from one part of a disk to another part of the same or even a completely different disk. This is used in a number of ways to maintain and improve the performance of a FAS via mechanisms such as the read_realloc volume option.

Definition

The amount of RAW capacity reserved by the array for the purposes of homogenizing and checksumming disks

Data Layout Taxes

C-42-16786402-c-rgb Modern storage arrays all have some form of data layout engines where the storage which is presented as a single logical  LUN to a host is assigned to a number of physical disks within the array. The methods for doing so can be broadly categorized in the following ways as defined in the SNIA dictionary

Algorithmic mapping

If a volume is algorithmically mapped, the physical location of a block of data may be calculated from its virtual volume address using known characteristics of the volume (e.g., stripe depth and number of member disks).

Dynamic mapping

A form of mapping in which the correspondence between addresses in the two address spaces can change over time

Tabular mapping

A form of mapping in which a lookup table contains the correspondence between the two address spaces being mapped to each other. If a mapping between two address spaces is tabular, there is no mathematical formula that will convert addresses in one space to addresses in the other. 

Thoughts

Most array vendors are beginning to move more towards Dynamic and Tabular mapping methods because this allows them to provide functionality such as thin provisioning and to non-disruptively allocate more spindles to existing workloads. While this is relatively new for most array vendors, NetApp has been using tabular mapping since it’s inception. This allows us to make some substantial savings later on,  however Keeping this table metadata requires that some are of disk capacity is dedicated for the system. In NetApp’s case the amount of space reserved is 10% of the rightsized disk capacity as this provides space for both the metadata and makes the write allocators job a lot easier (and hence faster).

I’ve seen a lot of announcements regarding thin/dynamic/virtual provisioning from various vendors, but little if any disclosure on how much space needs to be reserved to keep the mapping information, its possible the overheads are negligible, but in the interests of full disclosure and apples-for-apples comparisons, I think the is a general category that should be included, or explicitly excluded in any efficiency comparisons/claims  that are made.

Just because approaches that provide a mathematical formula for finding the location of a requested block have no need to keep metadata, doesn’t exclude them from this category. The requirement to have rigidly defined stripe sizes and widths etc often leads to areas of storage that cannot allocated to users. Any wastage/tax required for algorithmic mapping should also be included.

Definition

The amount of storage hidden from the user by the array in the process of creating a map between the physical storage on the array and the logical storage presented to the users.

Measurement Units

Either SI Units (GB / TB) for Base-10  or IEC Units (GiB / TiB) depending on whatever makes for the fairest comparison or most understandable calculation provided that the units are explicitly stated.

No More Taxes

Ok, no more taxes, from here on I'll start talking about what we can do with the high performing, highly available, self repairing storage that array vendors create out the same disk drives that you can buy at your local computer store. If there are taxes you think I've left out are areas where I've been unclear, please let me know.

Regards

John Martin

Consulting Sytems Engineer - ANZ

Comments

I never liked to word "overhead", as it makes it seem something is taken without any return.

So "checksum overheads" doesn't have a nice sound to it, while checksums are a very, very valuable component in the overall data protection realm. Like you said, it is a "tax", something we may not like but does have an intrinsic value (which you explained very well BTW).

Checksum "reserve" would be the next best term IMHO. Same goes for the "wastage" reference made later on. Maybe it's nitpicking, but just my $0.02...

Great series of posts, can't wait for the rest of 'em.

Thanks Geert,
I think I've read and said "checksum overhead" so often that I say it unconciously. Reserve is a pretty good word, but I have plans for keeping the word "reserve" for something the storage admin / users controls and sets. If I'm going to try and keep my terminology consistent I should probably call it a tax.

I should probably edit out wastage too because its emotive and not particularly descriptive. The trouble is that I dont see what benefit the customer derives from those pieces of unusable storage in traditional algorithmic mapping methods. Maybe once I've done some of the worked examples I've got planned it will become clearer.

Post a comment

Subscribe to This Blog


About this blog

  • Trends and strategies on storage efficiency, dedupe, virtualization, provisioning and backup.

Photos

TRUSTe CLICK TO VERIFY