| <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
| |
| <html> |
| <head> |
| <meta http-equiv="Content-Language" content="en-us"> |
| <meta http-equiv="Content-Type" content="text/html; charset=us-ascii"> |
| <meta name="GENERATOR" content="Microsoft FrontPage 6.0"> |
| <meta name="ProgId" content="FrontPage.Editor.Document"> |
| <link rel="stylesheet" type="text/css" href="../../../boost.css"> |
| |
| <title>The Boost Statechart Library - Performance</title> |
| </head> |
| |
| <body link="#0000FF" vlink="#800080"> |
| <table border="0" cellpadding="7" cellspacing="0" width="100%" summary= |
| "header"> |
| <tr> |
| <td valign="top" width="300"> |
| <h3><a href="../../../index.htm"><img alt="C++ Boost" src= |
| "../../../boost.png" border="0" width="277" height="86"></a></h3> |
| </td> |
| |
| <td valign="top"> |
| <h1 align="center">The Boost Statechart Library</h1> |
| |
| <h2 align="center">Performance</h2> |
| </td> |
| </tr> |
| </table> |
| <hr> |
| |
| <dl class="index"> |
| <dt><a href="#SpeedVersusScalabilityTradeoffs">Speed versus scalability |
| tradeoffs</a></dt> |
| |
| <dt><a href="#MemoryManagementCustomization">Memory management |
| customization</a></dt> |
| |
| <dt><a href="#RttiCustomization">RTTI customization</a></dt> |
| |
| <dt><a href="#ResourceUsage">Resource usage</a></dt> |
| </dl> |
| |
| <h2><a name="SpeedVersusScalabilityTradeoffs" id= |
| "SpeedVersusScalabilityTradeoffs">Speed versus scalability |
| tradeoffs</a></h2> |
| |
| <p>Quite a bit of effort has gone into making the library fast for small |
| simple machines <b>and</b> scaleable at the same time (this applies only to |
| <code>state_machine<></code>, there still is some room for optimizing |
| <code>fifo_scheduler<></code>, especially for multi-threaded builds). |
| While I believe it should perform reasonably in most applications, the |
| scalability does not come for free. Small, carefully handcrafted state |
| machines will thus easily outperform equivalent Boost.Statechart machines. |
| To get a picture of how big the gap is, I implemented a simple benchmark in |
| the BitMachine example. The Handcrafted example is a handcrafted variant of |
| the 1-bit-BitMachine implementing the same benchmark.</p> |
| |
| <p>I tried to create a fair but somewhat unrealistic <b>worst-case</b> |
| scenario:</p> |
| |
| <ul> |
| <li>For both machines exactly one object of the only event is allocated |
| before starting the test. This same object is then sent to the machines |
| over and over</li> |
| |
| <li>The Handcrafted machine employs GOF-visitor double dispatch. The |
| states are preallocated so that event dispatch & transition amounts |
| to nothing more than two virtual calls and one pointer assignment</li> |
| </ul> |
| |
| <p>The Benchmarks - running on a 3.2GHz Intel Pentium 4 - produced the |
| following dispatch and transition times per event:</p> |
| |
| <ul> |
| <li>Handcrafted: |
| |
| <ul> |
| <li>MSVC7.1: 10ns</li> |
| |
| <li>GCC3.4.2: 20ns</li> |
| |
| <li>Intel9.0: 20ns</li> |
| </ul> |
| </li> |
| |
| <li>1-bit-BitMachine with customized memory management: |
| |
| <ul> |
| <li>MSVC7.1: 150ns</li> |
| |
| <li>GCC3.4.2: 320ns</li> |
| |
| <li>Intel9.0: 170ns</li> |
| </ul> |
| </li> |
| </ul> |
| |
| <p>Although this is a big difference I still think it will not be |
| noticeable in most real-world applications. No matter whether an |
| application uses handcrafted or Boost.Statechart machines it will...</p> |
| |
| <ul> |
| <li>almost never run into a situation where a state machine is swamped |
| with as many events as in the benchmarks. A state machine will almost |
| always spend a good deal of time waiting for events (which typically come |
| from a human operator, from machinery or from electronic devices over |
| often comparatively slow I/O channels). Parsers are just about the only |
| application of FSMs where this is not the case. However, parser FSMs are |
| usually not directly specified on the state machine level but on a higher |
| one that is better suited for the task. Examples of such higher levels |
| are: Boost.Regex, Boost.Spirit, XML Schemas, etc. Moreover, the nature of |
| parsers allows for a number of optimizations that are not possible in a |
| general-purpose FSM framework.<br> |
| Bottom line: While it is possible to implement a parser with this |
| library, it is almost never advisable to do so because other approaches |
| lead to better performing and more expressive code</li> |
| |
| <li>often run state machines in their own threads. This adds considerable |
| locking and thread-switching overhead. Performance tests with the |
| PingPong example, where two asynchronous state machines exchange events, |
| gave the following times to process one event and perform the resulting |
| in-state reaction (using the library with |
| <code>boost::fast_pool_allocator<></code>): |
| |
| <ul> |
| <li>Single-threaded (no locking and waiting): 840ns / 840ns</li> |
| |
| <li>Multi-threaded with one thread (the scheduler uses mutex locking |
| but never has to wait for events): 6500ns / 4800ns</li> |
| |
| <li>Multi-threaded with two threads (both schedulers use mutex |
| locking and exactly one always waits for an event): 14000ns / |
| 7000ns</li> |
| </ul> |
| |
| <p>As mentioned above, there definitely is some room to improve the |
| timings for the asynchronous machines. Moreover, these are very crude |
| benchmarks, designed to show the overhead of locking and thread context |
| switching. The overhead in a real-world application will typically be |
| smaller and other operating systems can certainly do better in this |
| area. However, I strongly believe that on most platforms the threading |
| overhead is usually larger than the time that Boost.Statechart spends |
| for event dispatch and transition. Handcrafted machines will inevitably |
| have the same overhead, making raw single-threaded dispatch and |
| transition speed much less important</p> |
| </li> |
| |
| <li>almost always allocate events with <code>new</code> and destroy them |
| after consumption. This will add a few cycles, even if event memory |
| management is customized</li> |
| |
| <li>often use state machines that employ orthogonal states and other |
| advanced features. This forces the handcrafted machines to use a more |
| adequate and more time-consuming book-keeping</li> |
| </ul> |
| |
| <p>Therefore, in real-world applications event dispatch and transition not |
| normally constitutes a bottleneck and the relative gap between handcrafted |
| and Boost.Statechart machines also becomes much smaller than in the |
| worst-case scenario.</p> |
| |
| <h3>Detailed performance data</h3> |
| |
| <p>In an effort to identify the main performance bottleneck, the example |
| program "Performance" has been written. It measures the time that is spent |
| to process one event in different BitMachine variants. In contrast to the |
| BitMachine example, which uses only transitions, Performance uses a varying |
| number of in-state reactions together with transitions. The only difference |
| between in-state-reactions and transitions is that the former neither enter |
| nor exit any states. Apart from that, the same amount of code needs to be |
| run to dispatch an event and execute the resulting action.</p> |
| |
| <p>The following diagrams show the average time the library spends to |
| process one event, depending on the percentage of in-state reactions |
| employed. 0% means that all triggered reactions are transitions. 100% means |
| that all triggered reactions are in-state reactions. I draw the following |
| conclusions from these measurements:</p> |
| |
| <ul> |
| <li>The fairly linear course of the curves suggests that the measurements |
| with a 100% in-state reaction ratio are accurate and not merely a product |
| of optimizations in the compiler. Such optimizations might have been |
| possible due to the fact that in the 100% case it is known at |
| compile-time that the current state will never change</li> |
| |
| <li>The data points with 100% in-state reaction ratio and speed optimized |
| RTTI show that modern compilers seem to inline the complex-looking |
| dispatch code so aggressively that dispatch is reduced to little more |
| than it actually is, one virtual function call followed by a linear |
| search for a suitable reaction. For instance, in the case of the 1-bit |
| Bitmachine, Intel9.0 seems to produce dispatch code that is equally |
| efficient like the two virtual function calls in the Handcrafted |
| machine</li> |
| |
| <li>On all compilers and in all variants the time spent in event dispatch |
| is dwarfed by the time spent to exit the current state and enter the |
| target state. It is worth noting that BitMachine is a flat and |
| non-orthogonal state machine, representing a close-to-worst case. |
| Real-world machines will often exit and enter multiple states during a |
| transition, what further dwarfs pure dispatch time. This makes the |
| implementation of constant-time dispatch (as requested by a few people |
| during formal review) an undertaking with little merit. Instead, the |
| future optimization effort will concentrate on state-entry and |
| state-exit</li> |
| |
| <li>Intel9.0 seems to have problems to optimize/inline code as soon as |
| the amount of code grows over a certain threshold. Unlike with the other |
| two compilers, I needed to compile the tests for the 1, 2, 3 and 4-bit |
| BitMachine into separate executables to get good performance. Even then |
| was the performance overly bad for the 4-bit BitMachine. It was much |
| worse when I compiled all 4 tests into the same executable. This surely |
| looks like a bug in the compiler</li> |
| </ul> |
| |
| <h4>Out of the box</h4> |
| |
| <p>The library is used as is, without any optimizations/modifications.</p> |
| |
| <p><img alt="PerformanceNormal1" src="PerformanceNormal1.gif" border="0" |
| width="371" height="284"><img alt="PerformanceNormal2" src= |
| "PerformanceNormal2.gif" border="0" width="371" height="284"><img alt= |
| "PerformanceNormal3" src="PerformanceNormal3.gif" border="0" width="371" |
| height="284"><img alt="PerformanceNormal4" src="PerformanceNormal4.gif" |
| border="0" width="371" height="284"></p> |
| |
| <h4>Native RTTI</h4> |
| |
| <p>The library is used with <code><a href= |
| "configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI</a></code> |
| defined.</p> |
| |
| <p><img alt="PerformanceNative1" src="PerformanceNative1.gif" border="0" |
| width="371" height="284"><img alt="PerformanceNative2" src= |
| "PerformanceNative2.gif" border="0" width="371" height="284"><img alt= |
| "PerformanceNative3" src="PerformanceNative3.gif" border="0" width="371" |
| height="284"><img alt="PerformanceNative4" src="PerformanceNative4.gif" |
| border="0" width="371" height="284"></p> |
| |
| <h4>Customized memory-management</h4> |
| |
| <p>The library is used with customized memory management |
| (<code>boost::fast_pool_allocator</code>).</p> |
| |
| <p><img alt="PerformanceCustom1" src="PerformanceCustom1.gif" border="0" |
| width="371" height="284"><img alt="PerformanceCustom2" src= |
| "PerformanceCustom2.gif" border="0" width="371" height="284"><img alt= |
| "PerformanceCustom3" src="PerformanceCustom3.gif" border="0" width="371" |
| height="284"><img alt="PerformanceCustom4" src="PerformanceCustom4.gif" |
| border="0" width="371" height="284"></p> |
| |
| <h3><a name="DoubleDispatch" id="DoubleDispatch">Double dispatch</a></h3> |
| |
| <p>At the heart of every state machine lies an implementation of double |
| dispatch. This is due to the fact that the incoming event <b>and</b> the |
| active state define exactly which <a href= |
| "definitions.html#Reaction">reaction</a> the state machine will produce. |
| For each event dispatch, one virtual call is followed by a linear search |
| for the appropriate reaction, using one RTTI comparison per reaction. The |
| following alternatives were considered but rejected:</p> |
| |
| <ul> |
| <li><a href= |
| "http://www.objectmentor.com/resources/articles/acv.pdf">Acyclic |
| visitor</a>: This double-dispatch variant satisfies all scalability |
| requirements but performs badly due to costly inheritance tree |
| cross-casts. Moreover, a state must store one v-pointer for <b>each</b> |
| reaction what slows down construction and makes memory management |
| customization inefficient. In addition, C++ RTTI must inevitably be |
| turned on, with negative effects on executable size. Boost.Statechart |
| originally employed acyclic visitor and was about 4 times slower than it |
| is now (MSVC7.1 on Intel Pentium M). The dispatch speed might be better |
| on other platforms but the other negative effects will remain</li> |
| |
| <li><a href="http://en.wikipedia.org/wiki/Visitor_pattern">GOF |
| Visitor</a>: The GOF Visitor pattern inevitably makes the whole machine |
| depend upon all events. That is, whenever a new event is added there is |
| no way around recompiling the whole state machine. This is contrary to |
| the scalability requirements</li> |
| |
| <li>Single two-dimensional array of function pointers: To satisfy |
| requirement 6, it should be possible to spread a single state machine |
| over several translation units. This however means that the dispatch |
| table must be filled at runtime and the different translation units must |
| somehow make themselves "known", so that their part of the state machine |
| can be added to the table. There simply is no way to do this |
| automatically <b>and</b> portably. The only portable way that a state |
| machine distributed over several translation units could employ |
| table-based double dispatch relies on the user. The programmer(s) would |
| somehow have to <b>manually</b> tie together the various pieces of the |
| state machine. Not only does this scale badly but is also quite |
| error-prone</li> |
| </ul> |
| |
| <h2><a name="MemoryManagementCustomization" id= |
| "MemoryManagementCustomization">Memory management customization</a></h2> |
| |
| <p>Out of the box, everything (event objects, state objects, internal data, |
| etc.) is allocated through <code>std::allocator< void ></code> (the |
| default for the Allocator template parameter). This should be satisfactory |
| for applications meeting the following prerequisites:</p> |
| |
| <ul> |
| <li>There are no deterministic reaction time (hard real-time) |
| requirements</li> |
| |
| <li>The application will never run long enough for heap fragmentation to |
| become a problem. This is of course an issue for all long running |
| programs not only the ones employing this library. However, it should be |
| noted that fragmentation problems could show up earlier than with |
| traditional FSM frameworks</li> |
| </ul> |
| |
| <p>Should an application not meet these prerequisites, Boost.Statechart's |
| memory management can be customized as follows:</p> |
| |
| <ul> |
| <li>By passing a model of the standard Allocator concept to the class |
| templates that support a corresponding parameter |
| (<code>event<></code>, <code>state_machine<></code>, |
| <code>asynchronous_state_machine<></code>, |
| <code>fifo_scheduler<></code>, <code>fifo_worker<></code>). |
| This redirects all allocations to the passed custom allocator and should |
| satisfy the needs of just about any project</li> |
| |
| <li>Additionally, it is possible to <b>separately</b> customize |
| <b>state</b> memory management by overloading <code>operator new()</code> |
| and <code>operator delete()</code> for all state classes but this is |
| probably only useful under rare circumstances</li> |
| </ul> |
| |
| <h2><a name="RttiCustomization" id="RttiCustomization">RTTI |
| customization</a></h2> |
| |
| <p>RTTI is used for event dispatch and |
| <code>state_downcast<>()</code>. Currently, there are exactly two |
| options:</p> |
| |
| <ol> |
| <li>By default, a speed-optimized internal implementation is |
| employed</li> |
| |
| <li>The library can be instructed to use native C++ RTTI instead by |
| defining <code><a href= |
| "configuration.html#ApplicationDefinedMacros">BOOST_STATECHART_USE_NATIVE_RTTI</a></code></li> |
| </ol> |
| |
| <p>There are 2 reasons to favor 2:</p> |
| |
| <ul> |
| <li>When a state machine (or parts of it) are compiled into a DLL, |
| problems could arise from the use of the internal RTTI mechanism (see the |
| FAQ item "<a href="faq.html#Dll">How can I compile a state machine into a |
| dynamic link library (DLL)?</a>"). Using option 2 is one way to work |
| around such problems (on some platforms, it seems to be the only |
| way)</li> |
| |
| <li>State and event objects need to store one pointer less, meaning that |
| in the best case the memory footprint of a state machine object could |
| shrink by 15% (an empty event is typically 30% smaller, what can be an |
| advantage when there are bursts of events rather than a steady flow). |
| However, on most platforms executable size grows when C++ RTTI is turned |
| on. So, given the small per machine object savings, this only makes sense |
| in applications where both of the following conditions hold: |
| |
| <ul> |
| <li>Event dispatch will never become a bottleneck</li> |
| |
| <li>There is a need to reduce the memory allocated at runtime (at the |
| cost of a larger executable)</li> |
| </ul> |
| |
| <p>Obvious candidates are embedded systems where the executable resides |
| in ROM. Other candidates are applications running a large number of |
| identical state machines where this measure could even reduce the |
| <b>overall</b> memory footprint</p> |
| </li> |
| </ul> |
| |
| <h2><a name="ResourceUsage" id="ResourceUsage">Resource usage</a></h2> |
| |
| <h3>Memory</h3> |
| |
| <p>On a 32-bit box, one empty active state typically needs less than 50 |
| bytes of memory. Even <b>very</b> complex machines will usually have less |
| than 20 simultaneously active states so just about every machine should run |
| with less than one kilobyte of memory (not counting event queues). |
| Obviously, the per-machine memory footprint is offset by whatever |
| state-local members the user adds.</p> |
| |
| <h3>Processor cycles</h3> |
| |
| <p>The following ranking should give a rough picture of what feature will |
| consume how many cycles:</p> |
| |
| <ol> |
| <li><code>state_cast<>()</code>: By far the most cycle-consuming |
| feature. Searches linearly for a suitable state, using one |
| <code>dynamic_cast</code> per visited state</li> |
| |
| <li>State entry and exit: Profiling of the fully optimized |
| 1-bit-BitMachine suggested that roughly 3 quarters of the total event |
| processing time is spent destructing the exited state and constructing |
| the entered state. Obviously, transitions where the <a href= |
| "definitions.html#InnermostCommonContext">innermost common context</a> is |
| "far" from the leaf states and/or with lots of orthogonal states can |
| easily cause the destruction and construction of quite a few states |
| leading to significant amounts of time spent for a transition</li> |
| |
| <li><code>state_downcast<>()</code>: Searches linearly for the |
| requested state, using one virtual call and one RTTI comparison per |
| visited state</li> |
| |
| <li>Deep history: For all innermost states inside a state passing either |
| <code>has_deep_history</code> or <code>has_full_history</code> to its |
| state base class, a binary search through the (usually small) history map |
| must be performed on each exit. History slot allocation is performed |
| exactly once, at first exit</li> |
| |
| <li>Shallow history: For all direct inner states of a state passing |
| either <code>has_shallow_history</code> or <code>has_full_history</code> |
| to its state base class, a binary search through the (usually small) |
| history map must be performed on each exit. History slot allocation is |
| performed exactly once, at first exit</li> |
| |
| <li>Event dispatch: One virtual call followed by a linear search for a |
| suitable <a href="definitions.html#Reaction">reaction</a>, using one RTTI |
| comparison per visited reaction</li> |
| |
| <li>Orthogonal states: One additional virtual call for each exited state |
| <b>if</b> there is more than one active leaf state before a transition. |
| It should also be noted that the worst-case event dispatch time is |
| multiplied in the presence of orthogonal states. For example, if two |
| orthogonal leaf states are added to a given state configuration, the |
| worst-case time is tripled</li> |
| </ol> |
| <hr> |
| |
| <p><a href="http://validator.w3.org/check?uri=referer"><img border="0" src= |
| "../../../doc/images/valid-html401.png" alt="Valid HTML 4.01 Transitional" |
| height="31" width="88"></a></p> |
| |
| <p>Revised |
| <!--webbot bot="Timestamp" s-type="EDITED" s-format="%d %B, %Y" startspan -->03 December, 2006<!--webbot bot="Timestamp" endspan i-checksum="38512" --></p> |
| |
| <p><i>Copyright © 2003-<!--webbot bot="Timestamp" s-type="EDITED" s-format="%Y" startspan -->2006<!--webbot bot="Timestamp" endspan i-checksum="770" --> |
| <a href="contact.html">Andreas Huber Dönni</a></i></p> |
| |
| <p><i>Distributed under the Boost Software License, Version 1.0. (See |
| accompanying file <a href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</a> or |
| copy at <a href= |
| "http://www.boost.org/LICENSE_1_0.txt">http://www.boost.org/LICENSE_1_0.txt</a>)</i></p> |
| </body> |
| </html> |