|  | The cluster MD is a shared-device RAID for a cluster. | 
|  |  | 
|  |  | 
|  | 1. On-disk format | 
|  |  | 
|  | Separate write-intent-bitmap are used for each cluster node. | 
|  | The bitmaps record all writes that may have been started on that node, | 
|  | and may not yet have finished. The on-disk layout is: | 
|  |  | 
|  | 0                    4k                     8k                    12k | 
|  | ------------------------------------------------------------------- | 
|  | | idle                | md super            | bm super [0] + bits | | 
|  | | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   | | 
|  | | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  | | 
|  | | bm bits [3, contd]  |                     |                     | | 
|  |  | 
|  | During "normal" functioning we assume the filesystem ensures that only one | 
|  | node writes to any given block at a time, so a write | 
|  | request will | 
|  | - set the appropriate bit (if not already set) | 
|  | - commit the write to all mirrors | 
|  | - schedule the bit to be cleared after a timeout. | 
|  |  | 
|  | Reads are just handled normally.  It is up to the filesystem to | 
|  | ensure one node doesn't read from a location where another node (or the same | 
|  | node) is writing. | 
|  |  | 
|  |  | 
|  | 2. DLM Locks for management | 
|  |  | 
|  | There are two locks for managing the device: | 
|  |  | 
|  | 2.1 Bitmap lock resource (bm_lockres) | 
|  |  | 
|  | The bm_lockres protects individual node bitmaps. They are named in the | 
|  | form bitmap001 for node 1, bitmap002 for node and so on. When a node | 
|  | joins the cluster, it acquires the lock in PW mode and it stays so | 
|  | during the lifetime the node is part of the cluster. The lock resource | 
|  | number is based on the slot number returned by the DLM subsystem. Since | 
|  | DLM starts node count from one and bitmap slots start from zero, one is | 
|  | subtracted from the DLM slot number to arrive at the bitmap slot number. | 
|  |  | 
|  | 3. Communication | 
|  |  | 
|  | Each node has to communicate with other nodes when starting or ending | 
|  | resync, and metadata superblock updates. | 
|  |  | 
|  | 3.1 Message Types | 
|  |  | 
|  | There are 3 types, of messages which are passed | 
|  |  | 
|  | 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been | 
|  | updated, and the node must re-read the md superblock. This is performed | 
|  | synchronously. | 
|  |  | 
|  | 3.1.2 RESYNC: informs other nodes that a resync is initiated or ended | 
|  | so that each node may suspend or resume the region. | 
|  |  | 
|  | 3.2 Communication mechanism | 
|  |  | 
|  | The DLM LVB is used to communicate within nodes of the cluster. There | 
|  | are three resources used for the purpose: | 
|  |  | 
|  | 3.2.1 Token: The resource which protects the entire communication | 
|  | system. The node having the token resource is allowed to | 
|  | communicate. | 
|  |  | 
|  | 3.2.2 Message: The lock resource which carries the data to | 
|  | communicate. | 
|  |  | 
|  | 3.2.3 Ack: The resource, acquiring which means the message has been | 
|  | acknowledged by all nodes in the cluster. The BAST of the resource | 
|  | is used to inform the receive node that a node wants to communicate. | 
|  |  | 
|  | The algorithm is: | 
|  |  | 
|  | 1. receive status | 
|  |  | 
|  | sender                         receiver                   receiver | 
|  | ACK:CR                          ACK:CR                     ACK:CR | 
|  |  | 
|  | 2. sender get EX of TOKEN | 
|  | sender get EX of MESSAGE | 
|  | sender                        receiver                 receiver | 
|  | TOKEN:EX                       ACK:CR                   ACK:CR | 
|  | MESSAGE:EX | 
|  | ACK:CR | 
|  |  | 
|  | Sender checks that it still needs to send a message. Messages received | 
|  | or other events that happened while waiting for the TOKEN may have made | 
|  | this message inappropriate or redundant. | 
|  |  | 
|  | 3. sender write LVB. | 
|  | sender down-convert MESSAGE from EX to CR | 
|  | sender try to get EX of ACK | 
|  | [ wait until all receiver has *processed* the MESSAGE ] | 
|  |  | 
|  | [ triggered by bast of ACK ] | 
|  | receiver get CR of MESSAGE | 
|  | receiver read LVB | 
|  | receiver processes the message | 
|  | [ wait finish ] | 
|  | receiver release ACK | 
|  |  | 
|  | sender                         receiver                   receiver | 
|  | TOKEN:EX                       MESSAGE:CR                 MESSAGE:CR | 
|  | MESSAGE:CR | 
|  | ACK:EX | 
|  |  | 
|  | 4. triggered by grant of EX on ACK (indicating all receivers have processed | 
|  | message) | 
|  | sender down-convert ACK from EX to CR | 
|  | sender release MESSAGE | 
|  | sender release TOKEN | 
|  | receiver upconvert to EX of MESSAGE | 
|  | receiver get CR of ACK | 
|  | receiver release MESSAGE | 
|  |  | 
|  | sender                      receiver                   receiver | 
|  | ACK:CR                       ACK:CR                     ACK:CR | 
|  |  | 
|  |  | 
|  | 4. Handling Failures | 
|  |  | 
|  | 4.1 Node Failure | 
|  | When a node fails, the DLM informs the cluster with the slot. The node | 
|  | starts a cluster recovery thread. The cluster recovery thread: | 
|  | - acquires the bitmap<number> lock of the failed node | 
|  | - opens the bitmap | 
|  | - reads the bitmap of the failed node | 
|  | - copies the set bitmap to local node | 
|  | - cleans the bitmap of the failed node | 
|  | - releases bitmap<number> lock of the failed node | 
|  | - initiates resync of the bitmap on the current node | 
|  |  | 
|  | The resync process, is the regular md resync. However, in a clustered | 
|  | environment when a resync is performed, it needs to tell other nodes | 
|  | of the areas which are suspended. Before a resync starts, the node | 
|  | send out RESYNC_START with the (lo,hi) range of the area which needs | 
|  | to be suspended. Each node maintains a suspend_list, which contains | 
|  | the list  of ranges which are currently suspended. On receiving | 
|  | RESYNC_START, the node adds the range to the suspend_list. Similarly, | 
|  | when the node performing resync finishes, it send RESYNC_FINISHED | 
|  | to other nodes and other nodes remove the corresponding entry from | 
|  | the suspend_list. | 
|  |  | 
|  | A helper function, should_suspend() can be used to check if a particular | 
|  | I/O range should be suspended or not. | 
|  |  | 
|  | 4.2 Device Failure | 
|  | Device failures are handled and communicated with the metadata update | 
|  | routine. | 
|  |  | 
|  | 5. Adding a new Device | 
|  | For adding a new device, it is necessary that all nodes "see" the new device | 
|  | to be added. For this, the following algorithm is used: | 
|  |  | 
|  | 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues | 
|  | ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) | 
|  | 2. Node 1 sends NEWDISK with uuid and slot number | 
|  | 3. Other nodes issue kobject_uevent_env with uuid and slot number | 
|  | (Steps 4,5 could be a udev rule) | 
|  | 4. In userspace, the node searches for the disk, perhaps | 
|  | using blkid -t SUB_UUID="" | 
|  | 5. Other nodes issue either of the following depending on whether the disk | 
|  | was found: | 
|  | ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and | 
|  | disc.number set to slot number) | 
|  | ioctl(CLUSTERED_DISK_NACK) | 
|  | 6. Other nodes drop lock on no-new-devs (CR) if device is found | 
|  | 7. Node 1 attempts EX lock on no-new-devs | 
|  | 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk | 
|  | as SpareLocal | 
|  | 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED | 
|  | 10. Other nodes get the information whether a disk is added or not | 
|  | by the following METADATA_UPDATED. |