| The cluster MD is a shared-device RAID for a cluster. | 
 |  | 
 |  | 
 | 1. On-disk format | 
 |  | 
 | Separate write-intent-bitmap are used for each cluster node. | 
 | The bitmaps record all writes that may have been started on that node, | 
 | and may not yet have finished. The on-disk layout is: | 
 |  | 
 | 0                    4k                     8k                    12k | 
 | ------------------------------------------------------------------- | 
 | | idle                | md super            | bm super [0] + bits | | 
 | | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   | | 
 | | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  | | 
 | | bm bits [3, contd]  |                     |                     | | 
 |  | 
 | During "normal" functioning we assume the filesystem ensures that only one | 
 | node writes to any given block at a time, so a write | 
 | request will | 
 |  - set the appropriate bit (if not already set) | 
 |  - commit the write to all mirrors | 
 |  - schedule the bit to be cleared after a timeout. | 
 |  | 
 | Reads are just handled normally.  It is up to the filesystem to | 
 | ensure one node doesn't read from a location where another node (or the same | 
 | node) is writing. | 
 |  | 
 |  | 
 | 2. DLM Locks for management | 
 |  | 
 | There are two locks for managing the device: | 
 |  | 
 | 2.1 Bitmap lock resource (bm_lockres) | 
 |  | 
 |  The bm_lockres protects individual node bitmaps. They are named in the | 
 |  form bitmap001 for node 1, bitmap002 for node and so on. When a node | 
 |  joins the cluster, it acquires the lock in PW mode and it stays so | 
 |  during the lifetime the node is part of the cluster. The lock resource | 
 |  number is based on the slot number returned by the DLM subsystem. Since | 
 |  DLM starts node count from one and bitmap slots start from zero, one is | 
 |  subtracted from the DLM slot number to arrive at the bitmap slot number. | 
 |  | 
 | 3. Communication | 
 |  | 
 | Each node has to communicate with other nodes when starting or ending | 
 | resync, and metadata superblock updates. | 
 |  | 
 | 3.1 Message Types | 
 |  | 
 |  There are 3 types, of messages which are passed | 
 |  | 
 |  3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been | 
 |    updated, and the node must re-read the md superblock. This is performed | 
 |    synchronously. | 
 |  | 
 |  3.1.2 RESYNC: informs other nodes that a resync is initiated or ended | 
 |    so that each node may suspend or resume the region. | 
 |  | 
 | 3.2 Communication mechanism | 
 |  | 
 |  The DLM LVB is used to communicate within nodes of the cluster. There | 
 |  are three resources used for the purpose: | 
 |  | 
 |   3.2.1 Token: The resource which protects the entire communication | 
 |    system. The node having the token resource is allowed to | 
 |    communicate. | 
 |  | 
 |   3.2.2 Message: The lock resource which carries the data to | 
 |    communicate. | 
 |  | 
 |   3.2.3 Ack: The resource, acquiring which means the message has been | 
 |    acknowledged by all nodes in the cluster. The BAST of the resource | 
 |    is used to inform the receive node that a node wants to communicate. | 
 |  | 
 | The algorithm is: | 
 |  | 
 |  1. receive status | 
 |  | 
 |    sender                         receiver                   receiver | 
 |    ACK:CR                          ACK:CR                     ACK:CR | 
 |  | 
 |  2. sender get EX of TOKEN | 
 |     sender get EX of MESSAGE | 
 |     sender                        receiver                 receiver | 
 |     TOKEN:EX                       ACK:CR                   ACK:CR | 
 |     MESSAGE:EX | 
 |     ACK:CR | 
 |  | 
 |     Sender checks that it still needs to send a message. Messages received | 
 |     or other events that happened while waiting for the TOKEN may have made | 
 |     this message inappropriate or redundant. | 
 |  | 
 |  3. sender write LVB. | 
 |     sender down-convert MESSAGE from EX to CR | 
 |     sender try to get EX of ACK | 
 |     [ wait until all receiver has *processed* the MESSAGE ] | 
 |  | 
 |                                      [ triggered by bast of ACK ] | 
 |                                      receiver get CR of MESSAGE | 
 |                                      receiver read LVB | 
 |                                      receiver processes the message | 
 |                                      [ wait finish ] | 
 |                                      receiver release ACK | 
 |  | 
 |    sender                         receiver                   receiver | 
 |    TOKEN:EX                       MESSAGE:CR                 MESSAGE:CR | 
 |    MESSAGE:CR | 
 |    ACK:EX | 
 |  | 
 |  4. triggered by grant of EX on ACK (indicating all receivers have processed | 
 |     message) | 
 |     sender down-convert ACK from EX to CR | 
 |     sender release MESSAGE | 
 |     sender release TOKEN | 
 |                                receiver upconvert to EX of MESSAGE | 
 |                                receiver get CR of ACK | 
 |                                receiver release MESSAGE | 
 |  | 
 |    sender                      receiver                   receiver | 
 |    ACK:CR                       ACK:CR                     ACK:CR | 
 |  | 
 |  | 
 | 4. Handling Failures | 
 |  | 
 | 4.1 Node Failure | 
 |  When a node fails, the DLM informs the cluster with the slot. The node | 
 |  starts a cluster recovery thread. The cluster recovery thread: | 
 | 	- acquires the bitmap<number> lock of the failed node | 
 | 	- opens the bitmap | 
 | 	- reads the bitmap of the failed node | 
 | 	- copies the set bitmap to local node | 
 | 	- cleans the bitmap of the failed node | 
 | 	- releases bitmap<number> lock of the failed node | 
 | 	- initiates resync of the bitmap on the current node | 
 |  | 
 |  The resync process, is the regular md resync. However, in a clustered | 
 |  environment when a resync is performed, it needs to tell other nodes | 
 |  of the areas which are suspended. Before a resync starts, the node | 
 |  send out RESYNC_START with the (lo,hi) range of the area which needs | 
 |  to be suspended. Each node maintains a suspend_list, which contains | 
 |  the list  of ranges which are currently suspended. On receiving | 
 |  RESYNC_START, the node adds the range to the suspend_list. Similarly, | 
 |  when the node performing resync finishes, it send RESYNC_FINISHED | 
 |  to other nodes and other nodes remove the corresponding entry from | 
 |  the suspend_list. | 
 |  | 
 |  A helper function, should_suspend() can be used to check if a particular | 
 |  I/O range should be suspended or not. | 
 |  | 
 | 4.2 Device Failure | 
 |  Device failures are handled and communicated with the metadata update | 
 |  routine. | 
 |  | 
 | 5. Adding a new Device | 
 | For adding a new device, it is necessary that all nodes "see" the new device | 
 | to be added. For this, the following algorithm is used: | 
 |  | 
 |     1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues | 
 |        ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) | 
 |     2. Node 1 sends NEWDISK with uuid and slot number | 
 |     3. Other nodes issue kobject_uevent_env with uuid and slot number | 
 |        (Steps 4,5 could be a udev rule) | 
 |     4. In userspace, the node searches for the disk, perhaps | 
 |        using blkid -t SUB_UUID="" | 
 |     5. Other nodes issue either of the following depending on whether the disk | 
 |        was found: | 
 |        ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and | 
 |                 disc.number set to slot number) | 
 |        ioctl(CLUSTERED_DISK_NACK) | 
 |     6. Other nodes drop lock on no-new-devs (CR) if device is found | 
 |     7. Node 1 attempts EX lock on no-new-devs | 
 |     8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk | 
 |        as SpareLocal | 
 |     9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED | 
 |     10. Other nodes get the information whether a disk is added or not | 
 | 	by the following METADATA_UPDATED. |