doc/lvmetad_design.txt - manifest_repos/lvm2 - Git at Google

 The design of LVMetaD
 =====================

 Invocation and setup
 --------------------

 The daemon should be started automatically by the first LVM command issued on
 the system, when needed. The usage of the daemon should be configurable in
 lvm.conf, probably with its own section. Say

     lvmetad {
         enabled = 1 # default
         autostart = 1 # default
         socket = "/path/to/socket" # defaults to /var/run/lvmetad or such
     }

 Library integration
 -------------------

 When a command needs to access metadata, it currently needs to perform a scan
 of the physical devices available in the system. This is a possibly quite
 expensive operation, especially if many devices are attached to the system. In
 most cases, LVM needs a complete image of the system's PVs to operate
 correctly, so all devices need to be read, to at least determine presence (and
 content) of a PV label. Additional IO is done to obtain or write metadata
 areas, but this is only marginally related and addressed by Dave's
 metadata-balancing work.

 In the existing scanning code, a cache layer exists, under
 lib/cache/lvmcache.[hc]. This layer is keeping a textual copy of the metadata
 for a given volume group, in a format_text form, as a character string. We can
 plug the lvmetad interface at this level: in lvmcache_get_vg, which is
 responsible for looking up metadata in a local cache, we can, if the metadata
 is not available in the local cache, query lvmetad. Under normal circumstances,
 when a VG is not cached yet, this operation fails and prompts the caller to
 perform a scan. Under the lvmetad enabled scenario, this would never happen and
 the fall-through would only be activated when lvmetad is disabled, which would
 lead to local cache being populated as usual through a locally executed scan.

 Therefore, existing stand-alone (i.e. no lvmetad) functionality of the tools
 would be not compromised by adding lvmetad. With lvmetad enabled, however,
 significant portions of the code would be short-circuited.

 Scanning
 --------

 Initially (at least), the lvmetad will be not allowed to read disks: it will
 rely on an external program to provide the metadata. In the ideal case, this
 will be triggered by udev. The role of lvmetad is then to collect and maintain
 an accurate (up to the data it has received) image of the VGs available in the
 system. I imagine we could extend the pvscan command (or add a new one, say
 lvmetad_client, if pvscan is found to be inappropriate):

     $ pvscan --cache /dev/foo
     $ pvscan --cache --remove /dev/foo

 These commands would simply read the label and the MDA (if applicable) from the
 given PV and feed that data to the running lvmetad, using
 lvmetad_{add,remove}_pv (see lvmetad_client.h).

 We however need to ensure a couple of things here:

 1) only LVM commands ever touch PV labels and VG metadata
 2) when a device is added or removed, udev fires a rule to notify lvmetad

 While the latter is straightforward, there are issues with the first. We
 *might* want to invoke the dreaded "watch" udev rule in this case, however it
 ends up being implemented. Of course, we can also rely on the sysadmin to be
 reasonable and not write over existing LVM metadata without first telling LVM
 to let go of the respective device(s).

 Even if we simply ignore the problem, metadata write should fail in these
 cases, so the admin should be unable to do substantial damage to the system. If
 there were active LVs on top of the vanished PV, they are in trouble no matter
 what happens there.

 Incremental scan
 ----------------

 There are some new issues arising with the "udev" scan mode. Namely, the
 devices of a volume group will be appearing one by one. The behaviour in this
 case will be very similar to the current behaviour when devices are missing:
 the volume group, until *all* its physical volumes have been discovered and
 announced by udev, will be in a state with some of its devices flagged as
 MISSING_PV. This means that the volume group will be, for most purposes,
 read-only until it is complete and LVs residing on yet-unknown PVs won't
 activate without --partial. Under usual circumstances, this is not a problem
 and the current code for dealing with MISSING_PVs should be adequate.

 However, the code for reading volume groups from disks will need to be adapted,
 since it currently does not work incrementally. Such support will need to track
 metadata-less PVs that have been encountered so far and to provide a way to
 update an existing volume group. When the first PV with metadata of a given VG
 is encountered, the VG is created in lvmetad (probably in the form of "struct
 volume_group") and it is assigned any previously cached metadata-less PVs it is
 referencing. Any PVs that were not yet encountered will be marked as MISSING_PV
 in the "struct volume_group". Upon scanning a new PV, if it belongs to any
 already-known volume group, this PV is checked for consistency with the already
 cached metadata (in a case of mismatch, the VG needs to be recovered or
 declared conflicted), and is subsequently unmarked MISSING_PV. Care need be
 taken not to unmark MISSING_PV on PVs that have this flag in their persistent
 metadata, though.

 The most problematic aspect of the whole design may be orphan PVs. At any given
 point, a metadata-less PV may appear orphaned, if a PV of its VG with metadata
 has not been scanned yet. Eventually, we will have to decide that this PV is
 really an orphan and enable its usage for creating or extending VGs. In
 practice, the decision might be governed by a timeout or assumed immediately --
 the former case is a little safer, the latter is probably more transparent. I
 am not very keen on using timeouts and we can probably assume that the admin
 won't blindly try to re-use devices in a way that would trip up LVM in this
 respect. I would be in favour of just assuming that metadata-less VGs with no
 known referencing VGs are orphans -- after all, this is the same approach as we
 use today. The metadata balancing support may stress this a bit more than the
 usual contemporary setups do, though.

 Automatic activation
 --------------------

 It may also be prudent to provide a command that will block until a volume
 group is complete, so that scripts can reliably activate/mount LVs and such. Of
 course, some PVs may never appear, so a timeout is necessary. Again, this is
 something not handled by current tools, but may become more important in
 future. It probably does not need to be implemented right away though.

 The other aspect of the progressive VG assembly is automatic activation. The
 currently only problem with that is that we would like to avoid having
 activation code in lvmetad, so we would prefer to fire up an event of some sort
 and let someone else handle the activation and whatnot.

 Cluster support
 ---------------

 When working in a cluster, clvmd integration will be necessary: clvmd will need
 to instruct lvmetad to re-read metadata as appropriate due to writes on remote
 hosts. Overall, this is not hard, but the devil is in the details. I would
 possibly disable lvmetad for clustered volume groups in the first phase and
 only proceed when the local mode is robust and well tested.

 With lvmlockd, lvmetad state is kept up to date by flagging either an
 individual VG as "invalid", or the global state as "invalid".  When either
 the VG or the global state are read, this invalid flag is returned along
 with the data.  The client command can check for this invalid state and
 decide to read the information from disk rather than use the stale cached
 data.  After the latest data is read from disk, the command may choose to
 send it to lvmetad to update the cache.  lvmlockd uses version numbers
 embedded in its VG and global locks to detect when cached data becomes
 invalid, and it then tells lvmetad to set the related invalid flag.
 dct, 2015-06-23

 Protocol & co.
 --------------

 I expect a simple text-based protocol executed on top of an Unix Domain Socket
 to be the communication interface for lvmetad. Ideally, the requests and
 replies will be well-formed "config file" style strings, so we can re-use
 existing parsing infrastructure.

 Since we already have two daemons, I would probably look into factoring some
 common code for daemon-y things, like sockets, communication (including thread
 management) and maybe logging and re-using it in all the daemons (clvmd,
 dmeventd and lvmetad). This shared infrastructure should live under
 daemons/common, and the existing daemons shall be gradually migrated to the
 shared code.

 Future extensions
 -----------------

 The above should basically cover the use of lvmetad as a cache-only
 daemon. Writes could still be executed locally, and the new metadata version
 can be provided to lvmetad through the socket the usual way. This is fairly
 natural and in my opinion reasonable. The lvmetad acts like a cache that will
 hold metadata, no more no less.

 Above this, there is a couple of things that could be worked on later, when the
 above basic design is finished and implemented.

 _Metadata writing_: We may want to support writing new metadata through
 lvmetad. This may or may not be a better design, but the write itself should be
 more or less orthogonal to the rest of the story outlined above.

 _Locking_: Other than directing metadata writes through lvmetad, one could
 conceivably also track VG/LV locking through the same.

 _Clustering_: A deeper integration of lvmetad with clvmd might be possible and
 maybe desirable. Since clvmd communicates over the network with other clvmd
 instances, this could be extended to metadata exchange between lvmetad's,
 further cutting down scanning costs. This would combine well with the
 write-through-lvmetad approach.

 Testing
 -------

 Since (at least bare-bones) lvmetad has no disk interaction and is fed metadata
 externally, it should be very amenable to automated testing. We need to provide
 a client that can feed arbitrary, synthetic metadata to the daemon and request
 the data back, providing reasonable (nearly unit-level) testing infrastructure.

 Battle plan & code layout
 =========================

 - config_tree from lib/config needs to move to libdm/
 - daemon/common *client* code can go to libdm/ as well (say
   libdm/libdm-daemon.{h,c} or such)
 - daemon/common *server* code stays, is built in daemon/ toplevel as a static
   library, say libdaemon-common.a
 - daemon/lvmetad *client* code goes to lib/lvmetad
 - daemon/lvmetad *server* code stays (links in daemon/libdaemon_common.a)
	The design of LVMetaD
	=====================

	Invocation and setup
	--------------------

	The daemon should be started automatically by the first LVM command issued on
	the system, when needed. The usage of the daemon should be configurable in
	lvm.conf, probably with its own section. Say

	lvmetad {
	enabled = 1 # default
	autostart = 1 # default
	socket = "/path/to/socket" # defaults to /var/run/lvmetad or such
	}

	Library integration
	-------------------

	When a command needs to access metadata, it currently needs to perform a scan
	of the physical devices available in the system. This is a possibly quite
	expensive operation, especially if many devices are attached to the system. In
	most cases, LVM needs a complete image of the system's PVs to operate
	correctly, so all devices need to be read, to at least determine presence (and
	content) of a PV label. Additional IO is done to obtain or write metadata
	areas, but this is only marginally related and addressed by Dave's
	metadata-balancing work.

	In the existing scanning code, a cache layer exists, under
	lib/cache/lvmcache.[hc]. This layer is keeping a textual copy of the metadata
	for a given volume group, in a format_text form, as a character string. We can
	plug the lvmetad interface at this level: in lvmcache_get_vg, which is
	responsible for looking up metadata in a local cache, we can, if the metadata
	is not available in the local cache, query lvmetad. Under normal circumstances,
	when a VG is not cached yet, this operation fails and prompts the caller to
	perform a scan. Under the lvmetad enabled scenario, this would never happen and
	the fall-through would only be activated when lvmetad is disabled, which would
	lead to local cache being populated as usual through a locally executed scan.

	Therefore, existing stand-alone (i.e. no lvmetad) functionality of the tools
	would be not compromised by adding lvmetad. With lvmetad enabled, however,
	significant portions of the code would be short-circuited.

	Scanning
	--------

	Initially (at least), the lvmetad will be not allowed to read disks: it will
	rely on an external program to provide the metadata. In the ideal case, this
	will be triggered by udev. The role of lvmetad is then to collect and maintain
	an accurate (up to the data it has received) image of the VGs available in the
	system. I imagine we could extend the pvscan command (or add a new one, say
	lvmetad_client, if pvscan is found to be inappropriate):

	$ pvscan --cache /dev/foo
	$ pvscan --cache --remove /dev/foo

	These commands would simply read the label and the MDA (if applicable) from the
	given PV and feed that data to the running lvmetad, using
	lvmetad_{add,remove}_pv (see lvmetad_client.h).

	We however need to ensure a couple of things here:

	1) only LVM commands ever touch PV labels and VG metadata
	2) when a device is added or removed, udev fires a rule to notify lvmetad

	While the latter is straightforward, there are issues with the first. We
	might want to invoke the dreaded "watch" udev rule in this case, however it
	ends up being implemented. Of course, we can also rely on the sysadmin to be
	reasonable and not write over existing LVM metadata without first telling LVM
	to let go of the respective device(s).

	Even if we simply ignore the problem, metadata write should fail in these
	cases, so the admin should be unable to do substantial damage to the system. If
	there were active LVs on top of the vanished PV, they are in trouble no matter
	what happens there.

	Incremental scan
	----------------

	There are some new issues arising with the "udev" scan mode. Namely, the
	devices of a volume group will be appearing one by one. The behaviour in this
	case will be very similar to the current behaviour when devices are missing:
	the volume group, until all its physical volumes have been discovered and
	announced by udev, will be in a state with some of its devices flagged as
	MISSING_PV. This means that the volume group will be, for most purposes,
	read-only until it is complete and LVs residing on yet-unknown PVs won't
	activate without --partial. Under usual circumstances, this is not a problem
	and the current code for dealing with MISSING_PVs should be adequate.

	However, the code for reading volume groups from disks will need to be adapted,
	since it currently does not work incrementally. Such support will need to track
	metadata-less PVs that have been encountered so far and to provide a way to
	update an existing volume group. When the first PV with metadata of a given VG
	is encountered, the VG is created in lvmetad (probably in the form of "struct
	volume_group") and it is assigned any previously cached metadata-less PVs it is
	referencing. Any PVs that were not yet encountered will be marked as MISSING_PV
	in the "struct volume_group". Upon scanning a new PV, if it belongs to any
	already-known volume group, this PV is checked for consistency with the already
	cached metadata (in a case of mismatch, the VG needs to be recovered or
	declared conflicted), and is subsequently unmarked MISSING_PV. Care need be
	taken not to unmark MISSING_PV on PVs that have this flag in their persistent
	metadata, though.

	The most problematic aspect of the whole design may be orphan PVs. At any given
	point, a metadata-less PV may appear orphaned, if a PV of its VG with metadata
	has not been scanned yet. Eventually, we will have to decide that this PV is
	really an orphan and enable its usage for creating or extending VGs. In
	practice, the decision might be governed by a timeout or assumed immediately --
	the former case is a little safer, the latter is probably more transparent. I
	am not very keen on using timeouts and we can probably assume that the admin
	won't blindly try to re-use devices in a way that would trip up LVM in this
	respect. I would be in favour of just assuming that metadata-less VGs with no
	known referencing VGs are orphans -- after all, this is the same approach as we
	use today. The metadata balancing support may stress this a bit more than the
	usual contemporary setups do, though.

	Automatic activation
	--------------------

	It may also be prudent to provide a command that will block until a volume
	group is complete, so that scripts can reliably activate/mount LVs and such. Of
	course, some PVs may never appear, so a timeout is necessary. Again, this is
	something not handled by current tools, but may become more important in
	future. It probably does not need to be implemented right away though.

	The other aspect of the progressive VG assembly is automatic activation. The
	currently only problem with that is that we would like to avoid having
	activation code in lvmetad, so we would prefer to fire up an event of some sort
	and let someone else handle the activation and whatnot.

	Cluster support
	---------------

	When working in a cluster, clvmd integration will be necessary: clvmd will need
	to instruct lvmetad to re-read metadata as appropriate due to writes on remote
	hosts. Overall, this is not hard, but the devil is in the details. I would
	possibly disable lvmetad for clustered volume groups in the first phase and
	only proceed when the local mode is robust and well tested.

	With lvmlockd, lvmetad state is kept up to date by flagging either an
	individual VG as "invalid", or the global state as "invalid". When either
	the VG or the global state are read, this invalid flag is returned along
	with the data. The client command can check for this invalid state and
	decide to read the information from disk rather than use the stale cached
	data. After the latest data is read from disk, the command may choose to
	send it to lvmetad to update the cache. lvmlockd uses version numbers
	embedded in its VG and global locks to detect when cached data becomes
	invalid, and it then tells lvmetad to set the related invalid flag.
	dct, 2015-06-23

	Protocol & co.
	--------------

	I expect a simple text-based protocol executed on top of an Unix Domain Socket
	to be the communication interface for lvmetad. Ideally, the requests and
	replies will be well-formed "config file" style strings, so we can re-use
	existing parsing infrastructure.

	Since we already have two daemons, I would probably look into factoring some
	common code for daemon-y things, like sockets, communication (including thread
	management) and maybe logging and re-using it in all the daemons (clvmd,
	dmeventd and lvmetad). This shared infrastructure should live under
	daemons/common, and the existing daemons shall be gradually migrated to the
	shared code.

	Future extensions
	-----------------

	The above should basically cover the use of lvmetad as a cache-only
	daemon. Writes could still be executed locally, and the new metadata version
	can be provided to lvmetad through the socket the usual way. This is fairly
	natural and in my opinion reasonable. The lvmetad acts like a cache that will
	hold metadata, no more no less.

	Above this, there is a couple of things that could be worked on later, when the
	above basic design is finished and implemented.

	_Metadata writing_: We may want to support writing new metadata through
	lvmetad. This may or may not be a better design, but the write itself should be
	more or less orthogonal to the rest of the story outlined above.

	_Locking_: Other than directing metadata writes through lvmetad, one could
	conceivably also track VG/LV locking through the same.

	_Clustering_: A deeper integration of lvmetad with clvmd might be possible and
	maybe desirable. Since clvmd communicates over the network with other clvmd
	instances, this could be extended to metadata exchange between lvmetad's,
	further cutting down scanning costs. This would combine well with the
	write-through-lvmetad approach.

	Testing
	-------

	Since (at least bare-bones) lvmetad has no disk interaction and is fed metadata
	externally, it should be very amenable to automated testing. We need to provide
	a client that can feed arbitrary, synthetic metadata to the daemon and request
	the data back, providing reasonable (nearly unit-level) testing infrastructure.

	Battle plan & code layout
	=========================

	- config_tree from lib/config needs to move to libdm/
	- daemon/common client code can go to libdm/ as well (say
	libdm/libdm-daemon.{h,c} or such)
	- daemon/common server code stays, is built in daemon/ toplevel as a static
	library, say libdaemon-common.a
	- daemon/lvmetad client code goes to lib/lvmetad
	- daemon/lvmetad server code stays (links in daemon/libdaemon_common.a)