XFS

From Wikitech

XFS and our servers

This page serves as a description of how we format our xfs partitions and why.

How they're formatted

root@db1047:/a/sqldata# xfs_info /dev/sda6
meta-data=/dev/sda6              isize=256    agcount=4, agsize=109108672 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=436434688, imaxpct=5
         =                       sunit=64     swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=213120, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

How they get formatted

fenari:/home/midom/xfsfix is some python that gets the right device and UUID names, then spits out executable bash code that looks something like this:

root@db1047:/a/sqldata# python /root/xfsfix
umount /dev/sda6
mkfs.xfs -f -d sunit=512,swidth=4096 -L data /dev/sda6
xfs_admin -U f1363f7d-8a44-4abe-9e38-bf2171e265c8 /dev/sda6
mount /dev/sda6

You'll notice that the sunit and swidth numbers put out by the script don't match what xfs_info prints out. Domas guesses that the script's numbers are too large, but states that the numbers printed by xfs_info are acceptable as is.

Why they're formatted this way


[11:55 AM] <domas> 70U
[11:56 AM] <maplebed> so domas any thoughts on that pastebin?
[11:57 AM] <domas> hmmmm
[11:57 AM] <domas> *shrug*
[11:57 AM] • domas looks some more
[11:57 AM] <maplebed> Jeff_Green notices that the sunit=64 and swidth=512 is also present on db26
[11:57 AM] <domas> not on other machines?
[11:57 AM] <maplebed> (in the quest to see what's "right" that seems like a good place to start)
[11:57 AM] <domas> db26 is LVM
[11:57 AM] <maplebed> I haven't looked at other machines yet.
[11:58 AM] <maplebed> db42 is the same...
[11:59 AM] <domas> I guess I just have too high numbers there
[11:59 AM] <domas> it is not in bytes but in 512b sectors
[12:00 PM] <maplebed> not blocks? (which are set to 4096)?
[12:01 PM] <domas> pain oh pain
[12:01 PM] <Jeff_Green> ha
[12:01 PM] <domas> 'sectors' is usually in 512 in linux
[12:02 PM] <domas> 512*512 is 256k alignment
[12:02 PM] <maplebed> at any rate, I've got to run; if you think the current settings are fine I'll update the docs.
[12:02 PM] <domas> they are good enough
[12:02 PM] <domas> 32k alignment is good enough too
[12:02 PM] <domas> the major thing is not to have 16k partitioned
[12:02 PM] <domas> meh, we're talking 10% perf here
[12:03 PM] <domas> and we're not overloading i/o anyway
[12:03 PM] <Jeff_Green> domas: could you email/wiki/something us some notes on your tweaks?
[12:03 PM] <domas> jeff_green: there're not that many!
[12:03 PM] <domas> but I can try!
[12:03 PM] <Jeff_Green> i saw we tweak raid-related stripe stuff only?
[12:04 PM] <Jeff_Green> at CL we ended up tweaking only agcount (to 32) and the usual mount options, curious what/why you tweak
[12:04 PM] <domas> jeff_green a/g doesn't matter much, we have just one file that is big enough =)
[12:05 PM] <domas> jeff_green: stripe alignment is to avoid multiple reads for one block
[12:08 PM] <Jeff_Green> how does that interact with striping in hardware RAID?
[12:10 PM] <domas> well
[12:10 PM] <domas> if you don't align files on stripe boundaries
[12:10 PM] <domas> if a file is made out of 16k blocks
[12:10 PM] <domas> and you have 64k stripe
[12:10 PM] <domas> and it is not aligned
[12:11 PM] <domas> so, 25% of blocks will need two I/Os instead of one
[12:11 PM] <domas> because the block will reside on two disks
[12:11 PM] <domas> now, if you align them all on stripe boundary, all blocks are residing just on one disk
[12:11 PM] <domas> (I'm not counting mirrors)
[12:12 PM] <domas> back in the day it was much more painful, as we had to align partitions too
[12:12 PM] <domas> (which is what xfsfix was mostly for)
[12:12 PM] <domas> we were editing partitiontable with xfsfix before
[12:12 PM] <Jeff_Green> ok. I'm going to apply this to db1040 as a comprehension exercise