| LVM device fault handling |
| ========================= |
| |
| Introduction |
| ------------ |
| This document is to serve as the definitive source for information |
| regarding the policies and procedures surrounding device failures |
| in LVM. It codifies LVM's responses to device failures as well as |
| the responsibilities of administrators. |
| |
| Device failures can be permanent or transient. A permanent failure |
| is one where a device becomes inaccessible and will never be |
| revived. A transient failure is a failure that can be recovered |
| from (e.g. a power failure, intermittent network outage, block |
| relocation, etc). The policies for handling both types of failures |
| is described herein. |
| |
| Users need to be aware that there are two implementations of RAID1 in LVM. |
| The first is defined by the "mirror" segment type. The second is defined by |
| the "raid1" segment type. The characteristics of each of these are defined |
| in lvm.conf under 'mirror_segtype_default' - the configuration setting used to |
| identify the default RAID1 implementation used for LVM operations. |
| |
| Available Operations During a Device Failure |
| -------------------------------------------- |
| When there is a device failure, LVM behaves somewhat differently because |
| only a subset of the available devices will be found for the particular |
| volume group. The number of operations available to the administrator |
| is diminished. It is not possible to create new logical volumes while |
| PVs cannot be accessed, for example. Operations that create, convert, or |
| resize logical volumes are disallowed, such as: |
| - lvcreate |
| - lvresize |
| - lvreduce |
| - lvextend |
| - lvconvert (unless '--repair' is used) |
| Operations that activate, deactivate, remove, report, or repair logical |
| volumes are allowed, such as: |
| - lvremove |
| - vgremove (will remove all LVs, but not the VG until consistent) |
| - pvs |
| - vgs |
| - lvs |
| - lvchange -a [yn] |
| - vgchange -a [yn] |
| Operations specific to the handling of failed devices are allowed and |
| are as follows: |
| |
| - 'vgreduce --removemissing <VG>': This action is designed to remove |
| the reference of a failed device from the LVM metadata stored on the |
| remaining devices. If there are (portions of) logical volumes on the |
| failed devices, the ability of the operation to proceed will depend |
| on the type of logical volumes found. If an image (i.e leg or side) |
| of a mirror is located on the device, that image/leg of the mirror |
| is eliminated along with the failed device. The result of such a |
| mirror reduction could be a no-longer-redundant linear device. If |
| a linear, stripe, or snapshot device is located on the failed device |
| the command will not proceed without a '--force' option. The result |
| of using the '--force' option is the entire removal and complete |
| loss of the non-redundant logical volume. If an image or metadata area |
| of a RAID logical volume is on the failed device, the sub-LV affected is |
| replace with an error target device - appearing as <unknown> in 'lvs' |
| output. RAID logical volumes cannot be completely repaired by vgreduce - |
| 'lvconvert --repair' (listed below) must be used. Once this operation is |
| complete on volume groups not containing RAID logical volumes, the volume |
| group will again have a complete and consistent view of the devices it |
| contains. Thus, all operations will be permitted - including creation, |
| conversion, and resizing operations. It is currently the preferred method |
| to call 'lvconvert --repair' on the individual logical volumes to repair |
| them followed by 'vgreduce --removemissing' to extract the physical volume's |
| representation in the volume group. |
| |
| - 'lvconvert --repair <VG/LV>': This action is designed specifically |
| to operate on individual logical volumes. If, for example, a failed |
| device happened to contain the images of four distinct mirrors, it would |
| be necessary to run 'lvconvert --repair' on each of them. The ultimate |
| result is to leave the faulty device in the volume group, but have no logical |
| volumes referencing it. (This allows for 'vgreduce --removemissing' to |
| removed the physical volumes cleanly.) In addition to removing mirror or |
| RAID images that reside on failed devices, 'lvconvert --repair' can also |
| replace the failed device if there are spare devices available in the |
| volume group. The user is prompted whether to simply remove the failed |
| portions of the mirror or to also allocate a replacement, if run from the |
| command-line. Optionally, the '--use-policies' flag can be specified which |
| will cause the operation not to prompt the user, but instead respect |
| the policies outlined in the LVM configuration file - usually, |
| /etc/lvm/lvm.conf. Once this operation is complete, the logical volumes |
| will be consistent. However, the volume group will still be inconsistent - |
| due to the refernced-but-missing device/PV - and operations will still be |
| restricted to the aformentioned actions until either the device is |
| restored or 'vgreduce --removemissing' is run. |
| |
| Device Revival (transient failures): |
| ------------------------------------ |
| During a device failure, the above section describes what limitations |
| a user can expect. However, if the device returns after a period of |
| time, what to expect will depend on what has happened during the time |
| period when the device was failed. If no automated actions (described |
| below) or user actions were necessary or performed, then no change in |
| operations or logical volume layout will occur. However, if an |
| automated action or one of the aforementioned repair commands was |
| manually run, the returning device will be perceived as having stale |
| LVM metadata. In this case, the user can expect to see a warning |
| concerning inconsistent metadata. The metadata on the returning |
| device will be automatically replaced with the latest copy of the |
| LVM metadata - restoring consistency. Note, while most LVM commands |
| will automatically update the metadata on a restored devices, the |
| following possible exceptions exist: |
| - pvs (when it does not read/update VG metadata) |
| |
| Automated Target Response to Failures: |
| -------------------------------------- |
| The only LVM target types (i.e. "personalities") that have an automated |
| response to failures are the mirror and RAID logical volumes. The other target |
| types (linear, stripe, snapshot, etc) will simply propagate the failure. |
| [A snapshot becomes invalid if its underlying device fails, but the |
| origin will remain valid - presuming the origin device has not failed.] |
| |
| Starting with the "mirror" segment type, there are three types of errors that |
| a mirror can suffer - read, write, and resynchronization errors. Each is |
| described in depth below. |
| |
| Mirror read failures: |
| If a mirror is 'in-sync' (i.e. all images have been initialized and |
| are identical), a read failure will only produce a warning. Data is |
| simply pulled from one of the other images and the fault is recorded. |
| Sometimes - like in the case of bad block relocation - read errors can |
| be recovered from by the storage hardware. Therefore, it is up to the |
| user to decide whether to reconfigure the mirror and remove the device |
| that caused the error. Managing the composition of a mirror is done with |
| 'lvconvert' and removing a device from a volume group can be done with |
| 'vgreduce'. |
| |
| If a mirror is not 'in-sync', a read failure will produce an I/O error. |
| This error will propagate all the way up to the applications above the |
| logical volume (e.g. the file system). No automatic intervention will |
| take place in this case either. It is up to the user to decide what |
| can be done/salvaged in this senario. If the user is confident that the |
| images of the mirror are the same (or they are willing to simply attempt |
| to retreive whatever data they can), 'lvconvert' can be used to eliminate |
| the failed image and proceed. |
| |
| Mirror resynchronization errors: |
| A resynchronization error is one that occurs when trying to initialize |
| all mirror images to be the same. It can happen due to a failure to |
| read the primary image (the image considered to have the 'good' data), or |
| due to a failure to write the secondary images. This type of failure |
| only produces a warning, and it is up to the user to take action in this |
| case. If the error is transient, the user can simply reactivate the |
| mirrored logical volume to make another attempt at resynchronization. |
| If attempts to finish resynchronization fail, 'lvconvert' can be used to |
| remove the faulty device from the mirror. |
| |
| TODO... |
| Some sort of response to this type of error could be automated. |
| Since this document is the definitive source for how to handle device |
| failures, the process should be defined here. If the process is defined |
| but not implemented, it should be noted as such. One idea might be to |
| make a single attempt to suspend/resume the mirror in an attempt to |
| redo the sync operation that failed. On the other hand, if there is |
| a permanent failure, it may simply be best to wait for the user or the |
| automated response that is sure to follow from a write failure. |
| ...TODO |
| |
| Mirror write failures: |
| When a write error occurs on a mirror constituent device, an attempt |
| to handle the failure is automatically made. This is done by calling |
| 'lvconvert --repair --use-policies'. The policies implied by this |
| command are set in the LVM configuration file. They are: |
| - mirror_log_fault_policy: This defines what action should be taken |
| if the device containing the log fails. The available options are |
| "remove" and "allocate". Either of these options will cause the |
| faulty log device to be removed from the mirror. The "allocate" |
| policy will attempt the further action of trying to replace the |
| failed disk log by using space that might be available in the |
| volume group. If the allocation fails (or the "remove" policy |
| is specified), the mirror log will be maintained in memory. Should |
| the machine be rebooted or the logical volume deactivated, a |
| complete resynchronization of the mirror will be necessary upon |
| the follow activation - such is the nature of a mirror with a 'core' |
| log. The default policy for handling log failures is "allocate". |
| The service disruption incurred by replacing the failed log is |
| negligible, while the benefits of having persistent log is |
| pronounced. |
| - mirror_image_fault_policy: This defines what action should be taken |
| if a device containing an image fails. Again, the available options |
| are "remove" and "allocate". Both of these options will cause the |
| faulty image device to be removed - adjusting the logical volume |
| accordingly. For example, if one image of a 2-way mirror fails, the |
| mirror will be converted to a linear device. If one image of a |
| 3-way mirror fails, the mirror will be converted to a 2-way mirror. |
| The "allocate" policy takes the further action of trying to replace |
| the failed image using space that is available in the volume group. |
| Replacing a failed mirror image will incure the cost of |
| resynchronizing - degrading the performance of the mirror. The |
| default policy for handling an image failure is "remove". This |
| allows the mirror to still function, but gives the administrator the |
| choice of when to incure the extra performance costs of replacing |
| the failed image. |
| |
| RAID logical volume device failures are handled differently from the "mirror" |
| segment type. Discussion of this can be found in lvm2-raid.txt. |