Obsolete:Labs NFS: Difference between revisions
→Software RAID: +clarity |
→Software RAID: Add information about shelf numbering |
||
Line 12: | Line 12: | ||
== Software RAID == |
== Software RAID == |
||
The external shelves are configured as raid10 arrays of 12 drives, constructed from six drives on one shelf, and six drives on a different shelf (such that no single raid10 array relies on any one shelf). MD numbering is not guaranteed to be stable between boots, but the current arrays are normally numbered <code>md122</code>-<code> |
The external shelves are configured as raid10 arrays of 12 drives, constructed from six drives on one shelf, and six drives on a different shelf (such that no single raid10 array relies on any one shelf). MD numbering is not guaranteed to be stable between boots, but the current arrays are normally numbered <code>md122</code>-<code>md126</code>. |
||
When the raid arrays were originally constructed, they were named arbitrarily according to the order in which they were connected (since, at the time, each shelf was a self-contained raid6 array) as <code>shelf1</code>-<code>shelf4</code> matching labstore1001-shelf1 to labstore1001-shelf4. When a fifth shelf was installed, requiring a split between two ports, labstore1001-shelf4 was renamed to labstore1002-shelf1 and the new shelf was added as labstore1002-shelf2 (and named <code>shelf5</code>). |
|||
The arrays are stably named: |
|||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
This naming was kept conceptually when the raids were converted to raid 10: |
|||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
In addition, the first two drives of the internal bay are configured as a raid1 (<code>md0</code>) for the OS. |
In addition, the first two drives of the internal bay are configured as a raid1 (<code>md0</code>) for the OS. |
Revision as of 14:30, 16 September 2015
NFS is served to eqiad labs from one of two servers (labstore1001 and labstore1002) which are connected to a set of five MD1200 disk shelves.
Hardware setup
Each server is (ostensibly, see below) connected to all five shelves, with three shelves on one port of the controller and two shelves on the other. Each shelf holds 12 1.8TB SAS drives, and the controller is configured to expose them as single-disk raid 0 to the OS (The H800 controller does not support actual JBOD configuration). In addition, both servers have (independently) 12 more 1.8TB SAS drives in the internal bays.
The shelves are currently disconnected from labstore1001 since the July outage as we no longer trust the OS to not attempt to assemble the raid arrays simultaneously - this is intended to return once SCSI reservation has been tested.
The internal disks are visible to the OS as /dev/sda
to /dev/sdl
, and the shelves' disks are /dev/sdm
to /dev/sdbt
. (A quick early diagnostic is visible at the end of POST as the PERCs start up; normal operation should report 72 exported disks).
Software RAID
The external shelves are configured as raid10 arrays of 12 drives, constructed from six drives on one shelf, and six drives on a different shelf (such that no single raid10 array relies on any one shelf). MD numbering is not guaranteed to be stable between boots, but the current arrays are normally numbered md122
-md126
.
When the raid arrays were originally constructed, they were named arbitrarily according to the order in which they were connected (since, at the time, each shelf was a self-contained raid6 array) as shelf1
-shelf4
matching labstore1001-shelf1 to labstore1001-shelf4. When a fifth shelf was installed, requiring a split between two ports, labstore1001-shelf4 was renamed to labstore1002-shelf1 and the new shelf was added as labstore1002-shelf2 (and named shelf5
).
This naming was kept conceptually when the raids were converted to raid 10:
/dev/md/shelf32
(First 6 drives ofshelf3
, last 6 drives ofshelf2
)/dev/md/shelf23
(First 6 drives ofshelf2
, last 6 drives ofshelf3
)/dev/md/shelf51
(First 6 drives ofshelf5
, last 6 drives ofshelf1
)/dev/md/shelf15
(First 6 drives ofshelf1
, last 6 drives ofshelf5
)/dev/md/shelf44
(All 12 drives ofshelf4
(6-6))
There is one shelf that is known to have had issues with the controller on labstore1002 (shelf4
, above), which was avoided in the current setup and is not currently used.
In addition, the first two drives of the internal bay are configured as a raid1 (md0
) for the OS.
LVM
Each shelf array is configured as a LVM physical volume, and pooled in the labstore
volume group, from which all shared volumes are allocated.
There is still a backup
volume group containing the internal drives of labstore1002 (not counting the OS-allocated drives) that contains old images – but that VG is not in active use anymore.
The labstore volume contains four primary logical volumes:
labstore/tools
, shared storage for the tools projectlabstore/maps
, shared storage for the maps projectlabstore/others
, containing storage for all other labs projectlabstore/scratch
, containing the labs-wide scratch storage
Conceptually, the volumes are mounted under /srv/{project,others}/$project
, with /srv/others
being the mountpoint of the "others" volume, and the project-specific volumes mounted under /srv/project/
; this is configured in /etc/fstab
and must be adjusted accordingly if new project-specific volumes are made.
In addition to the shared storage volume, the volume group also contains transient snapshots made during the backup process.
NFS Exports
NFS version 4 exports from a single, unified tree (/exp/
in our setup). This tree is populated with bind mounts taking the various subdirectories of /srv
and kept in sync with changes there by the /usr/local/sbin/sync-exports
. This is matched with the actual NFS exports in /etc/exports.d
, one file per project.
One huge caveat that needs to be noted: it is imperative that sync-exports
be executed before NFS is started, as this sets up the actual filesystems to be exported (through the bind mounts) - if NFS is started before that point any NFS client will notice the changed root inode and will remain stuck in "stale NFS handle" errors until a reboot (whereas they should otherwise be able to recover from any outage since all NFS mounts are hard).
Actual NFS service is provided through a service IP (distinct from the servers') which is set up by the start-nfs
script as the last step before actual the actual NFS server - this allows the IP to be moved to whichever server is the active one. Provided that the same filesystems are presented, the clients will not even notice the interruption in service.
Backups
Backups are handled through systemd units, invoked by timers. Copies are made by (a) making a snapshot of the filesystem, (b) mounting it readonly and (c) doing a rsync
to codfw's labstore to update that copy.
Every "true" filesystem is copied daily though the replicate-maps
, replicate-tools
, replicate-others
units for each respectively named filesystem. The snapshots are kept until full, and cleaned up by the cleanup-snapshots-labstore
unit.
There are icinga alerts for any of those units not having been run (successfully) in the past 25 hours.