boost/libs/numeric/odeint/doc/tutorial_parallel.qbk - nest-cam/4320010/boost - Git at Google

 [/============================================================================
   Boost.odeint

   Copyright 2013 Karsten Ahnert
   Copyright 2013 Pascal Germroth
   Copyright 2013 Mario Mulansky

   Use, modification and distribution is subject to the Boost Software License,
   Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at
   http://www.boost.org/LICENSE_1_0.txt)
 =============================================================================/]


 [section Parallel computation with OpenMP and MPI]

 Parallelization is a key feature for modern numerical libraries due to the vast
 availability of many cores nowadays, even on Laptops.
 odeint currently supports parallelization with OpenMP and MPI, as described in
 the following sections.
 However, it should be made clear from the beginning that the difficulty of
 efficiently distributing ODE integration on many cores/machines lies in the
 parallelization of the system function, which is still the user's
 responsibility.
 Simply using a parallel odeint backend without parallelizing the system function
 will bring you almost no performance gains.

 [section OpenMP]

 [import ../examples/openmp/phase_chain.cpp]

 odeint's OpenMP support is implemented as an external backend, which needs to be
 manually included. Depending on the compiler some additional flags may be
 needed, i.e. [^-fopenmp] for GCC.
 [phase_chain_openmp_header]

 In the easiest parallelization approach with OpenMP we use a standard `vector`
 as the state type:
 [phase_chain_vector_state]

 We initialize the state with some random data:
 [phase_chain_init]

 Now we have to configure the stepper to use the OpenMP backend.
 This is done by explicitly providing the `openmp_range_algebra` as a template
 parameter to the stepper.
 This algebra requires the state type to be a model of Random Access Range and
 will be used from multiple threads by the algebra.
 [phase_chain_stepper]

 Additional to providing the stepper with OpenMP parallelization we also need
 a parallelized system function to exploit the available cores.
 Here this is shown for a simple one-dimensional chain of phase oscillators with
 nearest neighbor coupling:
 [phase_chain_rhs]

 [note In the OpenMP backends the system function will always be called
 sequentially from the thread used to start the integration.]

 Finally, we perform the integration by using one of the integrate functions from
 odeint.
 As you can see, the parallelization is completely hidden in the stepper and the
 system function.
 OpenMP will take care of distributing the work among the threads and join them
 automatically.
 [phase_chain_integrate]

 After integrating, the data can be accessed immediately and be processed
 further.
 Note, that you can specify the OpenMP scheduling by calling `omp_set_schedule`
 in the beginning of your program:
 [phase_chain_scheduling]

 See [github_link examples/openmp/phase_chain.cpp
 openmp/phase_chain.cpp] for the complete example.

 [heading Split state]

 [import ../examples/openmp/phase_chain_omp_state.cpp]

 For advanced cases odeint offers another approach to use OpenMP that allows for
 a more exact control of the parallelization.
 For example, for odd-sized data where OpenMP's thread boundaries don't match
 cache lines and hurt performance it might be advisable to copy the data from the
 continuous `vector<T>` into separate, individually aligned, vectors.
 For this, odeint provides the `openmp_state<T>` type, essentially an alias for
 `vector<vector<T>>`.

 Here, the initialization is done with a `vector<double>`, but then we use
 odeint's `split` function to fill an `openmp_state`.
 The splitting is done such that the sizes of the individual regions differ at
 most by 1 to make the computation as uniform as possible.
 [phase_chain_state_init]

 Of course, the system function has to be changed to deal with the
 `openmp_state`.
 Note that each sub-region of the state is computed in a single task, but at the
 borders read access to the neighbouring regions is required.
 [phase_chain_state_rhs]

 Using the `openmp_state<T>` state type automatically selects `openmp_algebra`
 which executes odeint's internal computations on parallel regions.
 Hence, no manual configuration of the stepper is necessary.
 At the end of the integration, we use `unsplit` to concatenate the sub-regions
 back together into a single vector.
 [phase_chain_state_integrate]

 [note You don't actually need to use `openmp_state<T>` for advanced use cases,
 `openmp_algebra` is simply an alias for `openmp_nested_algebra<range_algebra>`
 and supports any model of Random Access Range as the outer, parallel state type,
 and will use the given algebra on its elements.]

 See [github_link examples/openmp/phase_chain_omp_state.cpp
 openmp/phase_chain_omp_state.cpp] for the complete example.

 [endsect]

 [section MPI]

 [import ../examples/mpi/phase_chain.cpp]

 To expand the parallel computation across multiple machines we can use MPI.

 The system function implementation is similar to the OpenMP variant with split
 data, the main difference being that while OpenMP uses a spawn/join model where
 everything not explicitly paralleled is only executed in the main thread, in
 MPI's model each node enters the `main()` method independently, diverging based
 on its rank and synchronizing through message-passing and explicit barriers.

 odeint's MPI support is implemented as an external backend, too.
 Depending on the MPI implementation the code might need to be compiled with i.e.
 [^mpic++].
 [phase_chain_mpi_header]

 Instead of reading another thread's data, we asynchronously send and receive the
 relevant data from neighbouring nodes, performing some computation in the interim
 to hide the latency.
 [phase_chain_rhs]

 Analogous to `openmp_state<T>` we use `mpi_state< InnerState<T> >`, which
 automatically selects `mpi_nested_algebra` and the appropriate MPI-oblivious
 inner algebra (since our inner state is a `vector`, the inner algebra will be
 `range_algebra` as in the OpenMP example).
 [phase_chain_state]

 In the main program we construct a `communicator` which tells us the `size` of
 the cluster and the current node's `rank` within that.
 We generate the input data on the master node only, avoiding unnecessary work on
 the other nodes.
 Instead of simply copying chunks, `split` acts as a MPI collective function here
 and sends/receives regions from master to each slave.
 The input argument is ignored on the slaves, but the master node receives
 a region in its output and will participate in the computation.
 [phase_chain_mpi_init]

 Now that `x_split` contains (only) the local chunk for each node, we start the
 integration.

 To print the result on the master node, we send the processed data back using
 `unsplit`.
 [phase_chain_mpi_integrate]

 [note `mpi_nested_algebra::for_each`[~N] doesn't use any MPI constructs, it
 simply calls the inner algebra on the local chunk and the system function is not
 guarded by any barriers either, so if you don't manually place any (for example
 in parameter studies cases where the elements are completely independent) you
 might see the nodes diverging, returning from this call at different times.]

 See [github_link examples/mpi/phase_chain.cpp
 mpi/phase_chain.cpp] for the complete example.

 [endsect]

 [section Concepts]

 [section MPI State]
 As used by `mpi_nested_algebra`.
 [heading Notation]
 [variablelist
     [[`InnerState`] [The inner state type]]
     [[`State`] [The MPI-state type]]
     [[`state`] [Object of type `State`]]
     [[`world`] [Object of type `boost::mpi::communicator`]]
 ]
 [heading Valid Expressions]
 [table
     [[Name] [Expression] [Type] [Semantics]]
     [[Construct a state with a communicator]
      [`State(world)`] [`State`] [Constructs the State.]]
     [[Construct a state with the default communicator]
      [`State()`] [`State`] [Constructs the State.]]
     [[Get the current node's inner state]
      [`state()`] [`InnerState`] [Returns a (const) reference.]]
     [[Get the communicator]
      [`state.world`] [`boost::mpi::communicator`] [See __boost_mpi.]]
 ]
 [heading Models]
 * `mpi_state<InnerState>`

 [endsect]

 [section OpenMP Split State]
 As used by `openmp_nested_algebra`, essentially a Random Access Container with
 `ValueType = InnerState`.
 [heading Notation]
 [variablelist
     [[`InnerState`] [The inner state type]]
     [[`State`] [The split state type]]
     [[`state`] [Object of type `State`]]
 ]
 [heading Valid Expressions]
 [table
     [[Name] [Expression] [Type] [Semantics]]
     [[Construct a state for `n` chunks]
      [`State(n)`] [`State`] [Constructs underlying `vector`.]]
     [[Get a chunk]
      [`state[i]`] [`InnerState`] [Accesses underlying `vector`.]]
     [[Get the number of chunks]
      [`state.size()`] [`size_type`] [Returns size of underlying `vector`.]]
 ]
 [heading Models]
 * `openmp_state<ValueType>` with `InnerState = vector<ValueType>`

 [endsect]

 [section Splitter]
 [heading Notation]
 [variablelist
     [[`Container1`] [The continuous-data container type]]
     [[`x`] [Object of type `Container1`]]
     [[`Container2`] [The chunked-data container type]]
     [[`y`] [Object of type `Container2`]]
 ]
 [heading Valid Expressions]
 [table
     [[Name] [Expression] [Type] [Semantics]]
     [[Copy chunks of input to output elements]
      [`split(x, y)`] [`void`]
      [Calls `split_impl<Container1, Container2>::split(x, y)`, splits `x` into
      `y.size()` chunks.]]
     [[Join chunks of input elements to output]
      [`unsplit(y, x)`] [`void`]
      [Calls `unsplit_impl<Container2, Container1>::unsplit(y, x)`, assumes `x`
       is of the correct size ['__sigma `y[i].size()`], does not resize `x`.]]
 ]
 [heading Models]
 * defined for `Container1` = __boost_range and `Container2 = openmp_state`
 * and `Container2 = mpi_state`.

 To implement splitters for containers incompatible with __boost_range,
 specialize the `split_impl` and `unsplit_impl` types:
 ```
 template< class Container1, class Container2 , class Enabler = void >
 struct split_impl {
     static void split( const Container1 &from , Container2 &to );
 };

 template< class Container2, class Container1 , class Enabler = void >
 struct unsplit_impl {
     static void unsplit( const Container2 &from , Container1 &to );
 };
 ```
 [endsect]

 [endsect]

 [endsect]
	[/============================================================================
	Boost.odeint

	Copyright 2013 Karsten Ahnert
	Copyright 2013 Pascal Germroth
	Copyright 2013 Mario Mulansky

	Use, modification and distribution is subject to the Boost Software License,
	Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at
	http://www.boost.org/LICENSE_1_0.txt)
	=============================================================================/]


	[section Parallel computation with OpenMP and MPI]

	Parallelization is a key feature for modern numerical libraries due to the vast
	availability of many cores nowadays, even on Laptops.
	odeint currently supports parallelization with OpenMP and MPI, as described in
	the following sections.
	However, it should be made clear from the beginning that the difficulty of
	efficiently distributing ODE integration on many cores/machines lies in the
	parallelization of the system function, which is still the user's
	responsibility.
	Simply using a parallel odeint backend without parallelizing the system function
	will bring you almost no performance gains.

	[section OpenMP]

	[import ../examples/openmp/phase_chain.cpp]

	odeint's OpenMP support is implemented as an external backend, which needs to be
	manually included. Depending on the compiler some additional flags may be
	needed, i.e. [^-fopenmp] for GCC.
	[phase_chain_openmp_header]

	In the easiest parallelization approach with OpenMP we use a standard `vector`
	as the state type:
	[phase_chain_vector_state]

	We initialize the state with some random data:
	[phase_chain_init]

	Now we have to configure the stepper to use the OpenMP backend.
	This is done by explicitly providing the `openmp_range_algebra` as a template
	parameter to the stepper.
	This algebra requires the state type to be a model of Random Access Range and
	will be used from multiple threads by the algebra.
	[phase_chain_stepper]

	Additional to providing the stepper with OpenMP parallelization we also need
	a parallelized system function to exploit the available cores.
	Here this is shown for a simple one-dimensional chain of phase oscillators with
	nearest neighbor coupling:
	[phase_chain_rhs]

	[note In the OpenMP backends the system function will always be called
	sequentially from the thread used to start the integration.]

	Finally, we perform the integration by using one of the integrate functions from
	odeint.
	As you can see, the parallelization is completely hidden in the stepper and the
	system function.
	OpenMP will take care of distributing the work among the threads and join them
	automatically.
	[phase_chain_integrate]

	After integrating, the data can be accessed immediately and be processed
	further.
	Note, that you can specify the OpenMP scheduling by calling `omp_set_schedule`
	in the beginning of your program:
	[phase_chain_scheduling]

	See [github_link examples/openmp/phase_chain.cpp
	openmp/phase_chain.cpp] for the complete example.

	[heading Split state]

	[import ../examples/openmp/phase_chain_omp_state.cpp]

	For advanced cases odeint offers another approach to use OpenMP that allows for
	a more exact control of the parallelization.
	For example, for odd-sized data where OpenMP's thread boundaries don't match
	cache lines and hurt performance it might be advisable to copy the data from the
	continuous `vector<T>` into separate, individually aligned, vectors.
	For this, odeint provides the `openmp_state<T>` type, essentially an alias for
	`vector<vector<T>>`.

	Here, the initialization is done with a `vector<double>`, but then we use
	odeint's `split` function to fill an `openmp_state`.
	The splitting is done such that the sizes of the individual regions differ at
	most by 1 to make the computation as uniform as possible.
	[phase_chain_state_init]

	Of course, the system function has to be changed to deal with the
	`openmp_state`.
	Note that each sub-region of the state is computed in a single task, but at the
	borders read access to the neighbouring regions is required.
	[phase_chain_state_rhs]

	Using the `openmp_state<T>` state type automatically selects `openmp_algebra`
	which executes odeint's internal computations on parallel regions.
	Hence, no manual configuration of the stepper is necessary.
	At the end of the integration, we use `unsplit` to concatenate the sub-regions
	back together into a single vector.
	[phase_chain_state_integrate]

	[note You don't actually need to use `openmp_state<T>` for advanced use cases,
	`openmp_algebra` is simply an alias for `openmp_nested_algebra<range_algebra>`
	and supports any model of Random Access Range as the outer, parallel state type,
	and will use the given algebra on its elements.]

	See [github_link examples/openmp/phase_chain_omp_state.cpp
	openmp/phase_chain_omp_state.cpp] for the complete example.

	[endsect]

	[section MPI]

	[import ../examples/mpi/phase_chain.cpp]

	To expand the parallel computation across multiple machines we can use MPI.

	The system function implementation is similar to the OpenMP variant with split
	data, the main difference being that while OpenMP uses a spawn/join model where
	everything not explicitly paralleled is only executed in the main thread, in
	MPI's model each node enters the `main()` method independently, diverging based
	on its rank and synchronizing through message-passing and explicit barriers.

	odeint's MPI support is implemented as an external backend, too.
	Depending on the MPI implementation the code might need to be compiled with i.e.
	[^mpic++].
	[phase_chain_mpi_header]

	Instead of reading another thread's data, we asynchronously send and receive the
	relevant data from neighbouring nodes, performing some computation in the interim
	to hide the latency.
	[phase_chain_rhs]

	Analogous to `openmp_state<T>` we use `mpi_state< InnerState<T> >`, which
	automatically selects `mpi_nested_algebra` and the appropriate MPI-oblivious
	inner algebra (since our inner state is a `vector`, the inner algebra will be
	`range_algebra` as in the OpenMP example).
	[phase_chain_state]

	In the main program we construct a `communicator` which tells us the `size` of
	the cluster and the current node's `rank` within that.
	We generate the input data on the master node only, avoiding unnecessary work on
	the other nodes.
	Instead of simply copying chunks, `split` acts as a MPI collective function here
	and sends/receives regions from master to each slave.
	The input argument is ignored on the slaves, but the master node receives
	a region in its output and will participate in the computation.
	[phase_chain_mpi_init]

	Now that `x_split` contains (only) the local chunk for each node, we start the
	integration.

	To print the result on the master node, we send the processed data back using
	`unsplit`.
	[phase_chain_mpi_integrate]

	[note `mpi_nested_algebra::for_each`[~N] doesn't use any MPI constructs, it
	simply calls the inner algebra on the local chunk and the system function is not
	guarded by any barriers either, so if you don't manually place any (for example
	in parameter studies cases where the elements are completely independent) you
	might see the nodes diverging, returning from this call at different times.]

	See [github_link examples/mpi/phase_chain.cpp
	mpi/phase_chain.cpp] for the complete example.

	[endsect]

	[section Concepts]

	[section MPI State]
	As used by `mpi_nested_algebra`.
	[heading Notation]
	[variablelist
	[[`InnerState`] [The inner state type]]
	[[`State`] [The MPI-state type]]
	[[`state`] [Object of type `State`]]
	[[`world`] [Object of type `boost::mpi::communicator`]]
	]
	[heading Valid Expressions]
	[table
	[[Name] [Expression] [Type] [Semantics]]
	[[Construct a state with a communicator]
	[`State(world)`] [`State`] [Constructs the State.]]
	[[Construct a state with the default communicator]
	[`State()`] [`State`] [Constructs the State.]]
	[[Get the current node's inner state]
	[`state()`] [`InnerState`] [Returns a (const) reference.]]
	[[Get the communicator]
	[`state.world`] [`boost::mpi::communicator`] [See __boost_mpi.]]
	]
	[heading Models]
	* `mpi_state<InnerState>`

	[endsect]

	[section OpenMP Split State]
	As used by `openmp_nested_algebra`, essentially a Random Access Container with
	`ValueType = InnerState`.
	[heading Notation]
	[variablelist
	[[`InnerState`] [The inner state type]]
	[[`State`] [The split state type]]
	[[`state`] [Object of type `State`]]
	]
	[heading Valid Expressions]
	[table
	[[Name] [Expression] [Type] [Semantics]]
	[[Construct a state for `n` chunks]
	[`State(n)`] [`State`] [Constructs underlying `vector`.]]
	[[Get a chunk]
	[`state[i]`] [`InnerState`] [Accesses underlying `vector`.]]
	[[Get the number of chunks]
	[`state.size()`] [`size_type`] [Returns size of underlying `vector`.]]
	]
	[heading Models]
	* `openmp_state<ValueType>` with `InnerState = vector<ValueType>`

	[endsect]

	[section Splitter]
	[heading Notation]
	[variablelist
	[[`Container1`] [The continuous-data container type]]
	[[`x`] [Object of type `Container1`]]
	[[`Container2`] [The chunked-data container type]]
	[[`y`] [Object of type `Container2`]]
	]
	[heading Valid Expressions]
	[table
	[[Name] [Expression] [Type] [Semantics]]
	[[Copy chunks of input to output elements]
	[`split(x, y)`] [`void`]
	[Calls `split_impl<Container1, Container2>::split(x, y)`, splits `x` into
	`y.size()` chunks.]]
	[[Join chunks of input elements to output]
	[`unsplit(y, x)`] [`void`]
	[Calls `unsplit_impl<Container2, Container1>::unsplit(y, x)`, assumes `x`
	is of the correct size ['__sigma `y[i].size()`], does not resize `x`.]]
	]
	[heading Models]
	* defined for `Container1` = __boost_range and `Container2 = openmp_state`
	* and `Container2 = mpi_state`.

	To implement splitters for containers incompatible with __boost_range,
	specialize the `split_impl` and `unsplit_impl` types:
	```
	template< class Container1, class Container2 , class Enabler = void >
	struct split_impl {
	static void split( const Container1 &from , Container2 &to );
	};

	template< class Container2, class Container1 , class Enabler = void >
	struct unsplit_impl {
	static void unsplit( const Container2 &from , Container1 &to );
	};
	```
	[endsect]

	[endsect]

	[endsect]