Obsolete:Labs NFS: Difference between revisions

From Wikitech
Content deleted Content added
Coren (talk | contribs)
Coren (talk | contribs)
→‎Software RAID: Add information about shelf numbering
Line 12: Line 12:
== Software RAID ==
== Software RAID ==


The external shelves are configured as raid10 arrays of 12 drives, constructed from six drives on one shelf, and six drives on a different shelf (such that no single raid10 array relies on any one shelf). MD numbering is not guaranteed to be stable between boots, but the current arrays are normally numbered <code>md122</code>-<code>md125</code>.
The external shelves are configured as raid10 arrays of 12 drives, constructed from six drives on one shelf, and six drives on a different shelf (such that no single raid10 array relies on any one shelf). MD numbering is not guaranteed to be stable between boots, but the current arrays are normally numbered <code>md122</code>-<code>md126</code>.


When the raid arrays were originally constructed, they were named arbitrarily according to the order in which they were connected (since, at the time, each shelf was a self-contained raid6 array) as <code>shelf1</code>-<code>shelf4</code> matching labstore1001-shelf1 to labstore1001-shelf4. When a fifth shelf was installed, requiring a split between two ports, labstore1001-shelf4 was renamed to labstore1002-shelf1 and the new shelf was added as labstore1002-shelf2 (and named <code>shelf5</code>).
The arrays are stably named:
* <code>/dev/md/shelf32</code> (First 6 drives of shelf 3, last 6 drives of shelf 2)
* <code>/dev/md/shelf23</code> (First 6 drives of shelf 2, last 6 drives of shelf 3)
* <code>/dev/md/shelf51</code> (First 6 drives of shelf 5, last 6 drives of shelf 1)
* <code>/dev/md/shelf15</code> (First 6 drives of shelf 1, last 6 drives of shelf 5)
* <code>/dev/md/shelf44</code> (All 12 drives of shelf 4 (6-6))


This naming was kept conceptually when the raids were converted to raid 10:
There is one shelf that is known to have had issues with the controller on labstore1002 (shelf 4, above), and which was avoided in the current setup - it is configured as raid10 (as <code>md126</code>) but not currently used at all.
* <code>/dev/md/shelf32</code> (First 6 drives of <code>shelf3</code>, last 6 drives of <code>shelf2</code>)
* <code>/dev/md/shelf23</code> (First 6 drives of <code>shelf2</code>, last 6 drives of <code>shelf3</code>)
* <code>/dev/md/shelf51</code> (First 6 drives of <code>shelf5</code>, last 6 drives of <code>shelf1</code>)
* <code>/dev/md/shelf15</code> (First 6 drives of <code>shelf1</code>, last 6 drives of <code>shelf5</code>)
* <code>/dev/md/shelf44</code> (All 12 drives of <code>shelf4</code> (6-6))

There is one shelf that is known to have had issues with the controller on labstore1002 (<code>shelf4</code>, above), which was avoided in the current setup and is not currently used.


In addition, the first two drives of the internal bay are configured as a raid1 (<code>md0</code>) for the OS.
In addition, the first two drives of the internal bay are configured as a raid1 (<code>md0</code>) for the OS.

Revision as of 14:30, 16 September 2015

NFS is served to eqiad labs from one of two servers (labstore1001 and labstore1002) which are connected to a set of five MD1200 disk shelves.

Hardware setup

Each server is (ostensibly, see below) connected to all five shelves, with three shelves on one port of the controller and two shelves on the other. Each shelf holds 12 1.8TB SAS drives, and the controller is configured to expose them as single-disk raid 0 to the OS (The H800 controller does not support actual JBOD configuration). In addition, both servers have (independently) 12 more 1.8TB SAS drives in the internal bays.

The shelves are currently disconnected from labstore1001 since the July outage as we no longer trust the OS to not attempt to assemble the raid arrays simultaneously - this is intended to return once SCSI reservation has been tested.

The internal disks are visible to the OS as /dev/sda to /dev/sdl, and the shelves' disks are /dev/sdm to /dev/sdbt. (A quick early diagnostic is visible at the end of POST as the PERCs start up; normal operation should report 72 exported disks).

Software RAID

The external shelves are configured as raid10 arrays of 12 drives, constructed from six drives on one shelf, and six drives on a different shelf (such that no single raid10 array relies on any one shelf). MD numbering is not guaranteed to be stable between boots, but the current arrays are normally numbered md122-md126.

When the raid arrays were originally constructed, they were named arbitrarily according to the order in which they were connected (since, at the time, each shelf was a self-contained raid6 array) as shelf1-shelf4 matching labstore1001-shelf1 to labstore1001-shelf4. When a fifth shelf was installed, requiring a split between two ports, labstore1001-shelf4 was renamed to labstore1002-shelf1 and the new shelf was added as labstore1002-shelf2 (and named shelf5).

This naming was kept conceptually when the raids were converted to raid 10:

  • /dev/md/shelf32 (First 6 drives of shelf3, last 6 drives of shelf2)
  • /dev/md/shelf23 (First 6 drives of shelf2, last 6 drives of shelf3)
  • /dev/md/shelf51 (First 6 drives of shelf5, last 6 drives of shelf1)
  • /dev/md/shelf15 (First 6 drives of shelf1, last 6 drives of shelf5)
  • /dev/md/shelf44 (All 12 drives of shelf4 (6-6))

There is one shelf that is known to have had issues with the controller on labstore1002 (shelf4, above), which was avoided in the current setup and is not currently used.

In addition, the first two drives of the internal bay are configured as a raid1 (md0) for the OS.

LVM

Each shelf array is configured as a LVM physical volume, and pooled in the labstore volume group, from which all shared volumes are allocated.

There is still a backup volume group containing the internal drives of labstore1002 (not counting the OS-allocated drives) that contains old images – but that VG is not in active use anymore.

The labstore volume contains four primary logical volumes:

  • labstore/tools, shared storage for the tools project
  • labstore/maps, shared storage for the maps project
  • labstore/others, containing storage for all other labs project
  • labstore/scratch, containing the labs-wide scratch storage

Conceptually, the volumes are mounted under /srv/{project,others}/$project, with /srv/others being the mountpoint of the "others" volume, and the project-specific volumes mounted under /srv/project/; this is configured in /etc/fstab and must be adjusted accordingly if new project-specific volumes are made.

In addition to the shared storage volume, the volume group also contains transient snapshots made during the backup process.

NFS Exports

NFS version 4 exports from a single, unified tree (/exp/ in our setup). This tree is populated with bind mounts taking the various subdirectories of /srv and kept in sync with changes there by the /usr/local/sbin/sync-exports. This is matched with the actual NFS exports in /etc/exports.d, one file per project.

One huge caveat that needs to be noted: it is imperative that sync-exports be executed before NFS is started, as this sets up the actual filesystems to be exported (through the bind mounts) - if NFS is started before that point any NFS client will notice the changed root inode and will remain stuck in "stale NFS handle" errors until a reboot (whereas they should otherwise be able to recover from any outage since all NFS mounts are hard).

Actual NFS service is provided through a service IP (distinct from the servers') which is set up by the start-nfs script as the last step before actual the actual NFS server - this allows the IP to be moved to whichever server is the active one. Provided that the same filesystems are presented, the clients will not even notice the interruption in service.

Backups

Backups are handled through systemd units, invoked by timers. Copies are made by (a) making a snapshot of the filesystem, (b) mounting it readonly and (c) doing a rsync to codfw's labstore to update that copy.

Every "true" filesystem is copied daily though the replicate-maps, replicate-tools, replicate-others units for each respectively named filesystem. The snapshots are kept until full, and cleaned up by the cleanup-snapshots-labstore unit.

There are icinga alerts for any of those units not having been run (successfully) in the past 25 hours.