Parallel STL is an implementation of the C++ standard library algorithms with support for execution policies, as specified in the working draft N4659 for the next version of the C++ standard, commonly called C++17. The implementation also supports the unsequenced execution policy specified in the ISO* C++ working group paper P0076R3.

Parallel STL offers efficient support for both parallel and vectorized execution of algorithms for Intel® processors. For sequential execution, it relies on an available implementation of the C++ standard library.

Parallel STL is available as a part of Intel® Parallel Studio XE and Intel® System Studio.

Prerequisites

To use Parallel STL, you must have the following software installed:

The latest version of the Intel® C++ Compiler is recommended for better performance of Parallel STL algorithms, comparing to previous compiler versions.

To build an application that uses Parallel STL on the command line, you need to set the environment variables for compilation and linkage. You can do this by calling suite-level environment scripts such as compilervars.{sh|csh|bat}, or you can set just the Parallel STL environment variables by running pstlvars.{sh|csh|bat} in <install_dir>/{linux|mac|windows}/pstl/bin.

<install_dir> is the installation directory, by default, it is:

For Linux* and macOS*:

For Windows*:

Using Parallel STL

Follow these steps to add Parallel STL to your application:

  1. Add the <install_dir>/pstl/include folder to the compiler include paths. You can do this by calling the pstlvars script.

  2. Add #include "pstl/execution" to your code. Then add a subset of the following set of lines, depending on the algorithms you intend to use:

    • #include "pstl/algorithm"
    • #include "pstl/numeric"
    • #include "pstl/memory"
  3. When using algorithms and execution policies, specify the namespaces std and std::execution, respectively. See the 'Examples' section below.
  4. For any of the implemented algorithms, pass one of the values seq, unseq, par or par_unseq as the first parameter in a call to the algorithm to specify the desired execution policy. The policies have the following meaning:

    Execution policy

    Meaning

    seq

    Sequential execution.

    unseq

    Try to use SIMD. This policy requires that all functions provided are SIMD-safe.

    par

    Use multithreading.

    par_unseq

    Combined effect of unseq and par.

  5. Compile the code as C++11 (or later) and using compiler options for vectorization:

    • For the Intel® C++ Compiler:
      • For Linux* and macOS*: -qopenmp-simd or -qopenmp
      • For Windows*: /Qopenmp-simd or /Qopenmp
    • For other compilers, find a switch that enables OpenMP* 4.0 SIMD constructs.

    To get good performance, specify the target platform. For the Intel C++ Compiler, some of the relevant options are:

    • For Linux* and macOS*: -xHOST, -xSSE4.1, -xCORE-AVX2, -xMIC-AVX512.
    • For Windows*: /QxHOST, /QxSSE4.1, /QxCORE-AVX2, /QxMIC-AVX512.
    If using a different compiler, see its documentation.

  6. Link with the Intel TBB dynamic library for parallelism. For the Intel C++ Compiler, use the options:

    • For Linux* and macOS*: -tbb
    • For Windows*: /Qtbb (optional, this should be handled by #pragma comment(lib, <libname>))

Macros

PSTL_USE_PARALLEL_POLICIES

This macro controls the use of parallel policies.

When set to 0, it disables the par and par_unseq policies, making their use a compilation error. It's recommended for code that only uses vectorization with unseq policy, to avoid dependency on Intel® TBB runtime library.

When the macro is not defined (default) or evaluates to a non-zero value all execution policies are enabled.

PSTL_USE_NONTEMPORAL_STORES

This macro enables the use of #pragma vector nontemporal in the algorithms std::copy, std::copy_n, std::fill, std::fill_n, std::generate, std::generate_n with the unseq policy. For further details about the pragma, see the User and Reference Guide for the Intel® C++ Compiler at https://software.intel.com/en-us/node/524559.

If the macro evaluates to a non-zero value, the use of #pragma vector nontemporal is enabled.

When the macro is not defined (default) or set to 0, the macro does nothing.

Examples

Example 1

The following code calls vectorized copy:

#include "pstl/execution"
#include "pstl/algorithm"
void foo(float* a, float* b, int n) {
    std::copy(std::execution::unseq, a, a+n, b);
}

Example 2

This example calls the parallelized version of fill_n:

#include <vector>
#include "pstl/execution"
#include "pstl/algorithm"

int main()
{
    std::vector<int> data(10000000);
    std::fill_n(std::execution::par_unseq, data.begin(), data.size(), -1);  // Fill the vector with -1

    return 0;
}

Implemented Algorithms

Parallel STL supports all of the aforementioned execution policies only for the algorithms listed in the following table. Adding a policy argument to any of the rest of the C++ standard library algorithms will result in sequential execution.

Algorithm

Algorithm page at cppreference.com

adjacent_find

http://en.cppreference.com/w/cpp/algorithm/adjacent_find

all_of

http://en.cppreference.com/w/cpp/algorithm/all_any_none_of

any_of

http://en.cppreference.com/w/cpp/algorithm/all_any_none_of

copy

http://en.cppreference.com/w/cpp/algorithm/copy

copy_if

http://en.cppreference.com/w/cpp/algorithm/copy

copy_n

http://en.cppreference.com/w/cpp/algorithm/copy_n

count

http://en.cppreference.com/w/cpp/algorithm/count

count_if

http://en.cppreference.com/w/cpp/algorithm/count

destroy

http://en.cppreference.com/w/cpp/memory/destroy

destroy_n

http://en.cppreference.com/w/cpp/memory/destroy_n

equal

http://en.cppreference.com/w/cpp/algorithm/equal

exclusive_scan

http://en.cppreference.com/w/cpp/algorithm/exclusive_scan

fill

http://en.cppreference.com/w/cpp/algorithm/fill

fill_n

http://en.cppreference.com/w/cpp/algorithm/fill_n

find

http://en.cppreference.com/w/cpp/algorithm/find

find_if

http://en.cppreference.com/w/cpp/algorithm/find

find_if_not

http://en.cppreference.com/w/cpp/algorithm/find

for_each

http://en.cppreference.com/w/cpp/algorithm/for_each

for_each_n

http://en.cppreference.com/w/cpp/algorithm/for_each_n

generate

http://en.cppreference.com/w/cpp/algorithm/generate

generate_n

http://en.cppreference.com/w/cpp/algorithm/generate_n

inclusive_scan

http://en.cppreference.com/w/cpp/algorithm/inclusive_scan

is_sorted

http://en.cppreference.com/w/cpp/algorithm/is_sorted

is_sorted_until

http://en.cppreference.com/w/cpp/algorithm/is_sorted_until

none_of

http://en.cppreference.com/w/cpp/algorithm/all_any_none_of

reduce

http://en.cppreference.com/w/cpp/algorithm/reduce

remove_copy

http://en.cppreference.com/w/cpp/algorithm/remove_copy

remove_copy_if

http://en.cppreference.com/w/cpp/algorithm/remove_copy

sort

http://en.cppreference.com/w/cpp/algorithm/sort

stable_sort

http://en.cppreference.com/w/cpp/algorithm/stable_sort

transform

http://en.cppreference.com/w/cpp/algorithm/transform

transform_exclusive_scan

http://en.cppreference.com/w/cpp/algorithm/transform_exclusive_scan

transform_inclusive_scan

http://en.cppreference.com/w/cpp/algorithm/transform_inclusive_scan

transform_reduce

http://en.cppreference.com/w/cpp/algorithm/transform_reduce

uninitialized_copy

http://en.cppreference.com/w/cpp/memory/uninitialized_copy

uninitialized_copy_n

http://en.cppreference.com/w/cpp/memory/uninitialized_copy_n

uninitialized_default_construct

http://en.cppreference.com/w/cpp/memory/uninitialized_default_construct

uninitialized_default_construct_n

http://en.cppreference.com/w/cpp/memory/uninitialized_default_construct_n

uninitialized_fill

http://en.cppreference.com/w/cpp/memory/uninitialized_fill

uninitialized_fill_n

http://en.cppreference.com/w/cpp/memory/uninitialized_fill_n

uninitialized_move

http://en.cppreference.com/w/cpp/memory/uninitialized_move

uninitialized_move_n

http://en.cppreference.com/w/cpp/memory/uninitialized_move_n

uninitialized_value_construct

http://en.cppreference.com/w/cpp/memory/uninitialized_value_construct

uninitialized_value_construct_n

http://en.cppreference.com/w/cpp/memory/uninitialized_value_construct_n

unique_copy

http://en.cppreference.com/w/cpp/algorithm/unique_copy

Known limitations

Parallel and vector execution is only supported for a subset of aforementioned algorithms if random access iterators are provided, while for the rest execution will remain serial.