The first real difference between scalar code and SIMD code relates to conditional execution of calculations. C++ is, by design, a language heavily supporting scalar operations. Apart from the arithmetic statements, the language offers also control flow statements, such as: **if-then-else**, **switch** and **goto**. All these can be mixed with UME::SIMD code, but there are few restrictions imposed by the language. We will only discuss here the ‘if-else’ block, but the same considerations apply for other conditional statmeents.

## Scalar conditional statements

A simple scalar conditional statement would look like this:

int x; int a = 2; int b = 3; int c; ... if(x) { c = a + b; } else { c = a - b; }

Now depending on the value of `x`

, the variable `c`

will take either value of ‘5’ or ‘-1’. To understand which of two branches (` if`

branch or `else`

branch) will be selected, we have to look at the *predicate expression* in the `if`

clause:

if(<predicate expression>)

According to C++ rules, the type of the predicate expression, needs to be converted into a boolean. That is, if the expression is using only `bool`

variables, then the result will be prety straightforward to predict. A different case is when we use expressions that evaluate to integers (in our case `x`

is an expression that evaluates to an integer!). In such case we need to know how the C++ converts this integer expression to a boolean expression. The rule to do so is called *implicit conversion* rule, and is defined within the language standard. The rule goes like this:

Zero value, null pointer value, or null member pointer value is converted to false; any other value is converted to true.

In our particular case the same predicate expression will be therefore interpreted as a boolean predicate (or a boolean expression):

bool predicateValue = (x!=0) if(predicateValue)

Understanding this relation is critical in understanding why we can’t use vector types directly within `if()`

or `while()`

statements.

## Vector conditional statements

Let’s visualise what we will be talking about in this part using a piece of code:

UME::SIMD::SIMDVec<int,2> x; UME::SIMD::SIMDVec<int,2> a(2, 5); UME::SIMD::SIMDVec<int,2> b(3, -3); UME::SIMD::SIMDVec<int,2> c; ... if(x) { // Error: What does this mean?! c = a + b; } else { c = a - b; }

In the above code we are running into a very interesting problem: how to convert a vector `x`

into a boolean value? Unfortunately, since C++ is not aware of the vector types, it cannot make any assumptions as for the implicit conversions to boolean expression. More of that: you might want to express different operations by this piece of code. Do you want to execute each block for specific vector elements that fulfill the predicate:

UME::SIMD::SIMDVec<int,2> x; UME::SIMD::SIMDVec<int,2> a(2, 5); UME::SIMD::SIMDVec<int,2> b(3, -3); UME::SIMD::SIMDVec<int,2> c; ... for(int i = 0; i < 2; i++) { if(x[i] != 0) { c[i] = a[i] + b[i]; } else { c[i] = a[i] - b[i]; } }

or perhaps you want to execute whole block only if all values in the vector fulfill certain condition:

UME::SIMD::SIMDVec<int,2> x; UME::SIMD::SIMDVec<int,2> a(2, 5); UME::SIMD::SIMDVec<int,2> b(3, -3); UME::SIMD::SIMDVec<int,2> c; ... bool predicate = (x[0] != 0) && (x[1] != 0) if(predicate) { c = a + b; } else { c = a - b; }

or maybe you want to execute the blocks if any of the elements meets the predicate criteria:

UME::SIMD::SIMDVec<int,2> x; UME::SIMD::SIMDVec<int,2> a(2, 5); UME::SIMD::SIMDVec<int,2> b(3, -3); UME::SIMD::SIMDVec<int,2> c; ... bool predicate = (x[0] != 0) || (x[1] != 0) if(predicate) { c = a + b; } else { c = a - b; }

As you can see, the original language is too ambiguous to let you simpy use the same syntax. At the same time, a library such as UME::SIMD does not have capabilities of a compiler, allowing it to look forward into the code and make a decision based on forward analysis. We therefore need more straightforward, yet unambiguous mechanisms to express our intent.

## Mask types

UME::SIMD uses a concept of *data flow* apart from the already well established *control flow* to steer the way the actual numerical values are being passed through the program. The data flow concept doeasn’t require any conditional statements. It rather uses a concept of an *execution mask*. Mask is a vector consisting of only boolean values. A ‘true’ value means that specific vector lanes should execute operation, while a ‘false’ value means that operation should be blocked, that is the result should not be written to the output.

The library exposes a category of types named *SIMDVecMask* for date-flow control. We will now go through basic operations that are allowed with these types.

### Mask initialization using scalars

A basic declaration of looks similar to a vector declaration:

// Broadcast single scalar value UME::SIMD::SIMDVecMask<4> mask0(true); // Initialize using scalars UME::SIMD::SIMDVecMask<4> mask1(true, false, false, true); ... bool maskValues[4] = {true, false, false, true}; // Initialize using an array UME::SIMD::SIMDVecMask<4> mask2(&maskValues[0]);

### Mask initialization using predicate operations

A mask can be also obtained using one of the predicate operations applied on arithmetic vectors:

UME::SIMD::SIMDVec<float, 8> vec1, vec2; // 'CoMPare EQual' UME::SIMD::SIMDVecMask<8> m0=vec1.cmpeq(vec2); // or 'm0=vec1==vec2' // 'CoMPare Not EQual' UME::SIMD::SIMDVecMask<8> m1=vec1.cmpneq(vec2); // or 'm1=vec1!=vec2' // 'CoMPare Greater Than' UME::SIMD::SIMDVecMask<8> m2=vec1.cmpgt(vec2); // or 'm2=vec1>vec2' // 'CoMPare Greater than or Equal' UME::SIMD::SIMDVecMask<8> m3=vec1.cmpge(vec2); // or 'm3=vec1>=vec2' // 'CoMPare Less Than' UME::SIMD::SIMDVecMask<8> m4=vec1.cmplt(vec2); // or 'm4=vec1<vec2' // 'CoMPare Less than or Equal' UME::SIMD::SIMDVecMask<8> m5=vec1.cmple(vec2); // or 'm5=vec1<=vec2'

For each predicate a specific comparison operation is applied in an elementwise manner for elements of vectors `vec1`

and `vec2`

.

### Mask arithmetics

Masks can be also operated bit there are slightly different operations than for arithmetic vectors. This is due to the fact, that for portability reasons. It is not possible to mix arithmetic vectors, and masks, except for few special situations. Code below shows most of the operations allowed on masks, and their interaction with arithmetic types:

// Some vertical operations: UME::SIMD::SIMDVecMask<8> m0, m1, m2; .... m0 = m1.land(m2); m0 = m1 && m2; // the same as above m0 = m1.lor(m2); m0 = m1 || m2; // the same as above m0 = m1.lxor(m2); m0 = m1 ^ m2; // the same as above m0 = m1.landnot(m2); // equivalent to: !m1 && m2 // Also some horizontal (reduction) operations bool b0; ... b0 = m1.hland() // m1[0] && m1[1] && ... && m1[7] b0 = m1.hlor() // m1[0] || m1[1] && ... && m1[7]

### Using masks

Once the mask is calculated, it can be passed as an optional parameter to most of the arithmetic operations:

UME::SIMD::SIMDVec<float, 4> vec0, vec1, vec2, vec3; UME::SIMD::SIMDVecMask<4> m0; ... m0 = vec0 > vec1; // Add only pairs where m0 is true vec2 = vec0.add(m0, vec1); // Similar expressed using operator syntax vec2[m0] = vec0 + vec1 // 'Blend' two vectors using masking. // 'vec2[i]' becomes 'vec1[i]' where 'mask[i]' // is true, and 'vec0[i]' where false. vec2 = vec0.blend(mask, vec1);

Now let’s look at specific examples we discussed before. **Scenario 1** would look like this in SIMD arithmetic:

UME::SIMD::SIMDVec<int,2> x(0, 10); UME::SIMD::SIMDVec<int,2> a(2, 5); UME::SIMD::SIMDVec<int,2> b(3, -3); UME::SIMD::SIMDVec<int,2> c; UME::SIMD::SIMDVec<int,2> t0, t1; UME::SIMD::SIMDVecMask<2> m0; ... m0 = x!=0; // (false, true) t0 = a+b; // (5, 2) t1 = a-b; // (-1, 8) c = t1.blend(m0, t0); // (-1, 2) // Or briefly: c=(a-b).blend(x!=0, a+b); // (-1, 2)

**PERFORAMNCE PITFALL**: when executing scalar code, only one branch of code gets executed for every element of a vector. The decision about which branch to execute is made on a per-iteration basis. This allows the microprocessor to completely skip certain parts of the code.

In SIMD world, an arithmetic operation happens atomicaly for all elements of the vector, and it is not possible to ‘skip’ the execution. As a result we have to execute both `if`

and ‘else’ statements in every iteration. For long ‘if-else’ blocks this might mean executing large number of additional operations (as compared to scalar code), and therefore might decrease the expected vectorization gains.

The second example is the execution of the variant with ‘if all true’. Mind that in this case, we might still use the same ‘if-else’ statement:

UME::SIMD::SIMDVec<int,2> x; UME::SIMD::SIMDVec<int,2> a(2, 5); UME::SIMD::SIMDVec<int,2> b(3, -3); UME::SIMD::SIMDVec<int,2> c; ... // The original code was using following formula: // bool predicate = (x[0] != 0) && (x[1] != 0) // We first declare the mask, and perform a proper // comparison for values of 'x': UME::SIMD::SIMDVecMask<4> m0 = x.cmpneq(0); // Now we have to perform a 'Horizontal Logical AND' // (reduction) operation: bool predicate = m0.hland(); // At this moment we already know that // we have to execute either 'if' or 'else' block // for all elements of a vector. if(predicate) { c = a + b; } else { c = a - b; }

The same format can be used for **Scenario #3** but using ‘Horizontal Logical OR’ operation. In general case it is up to you, to decide which construct is more usefull in a given situation.

**PERFORMANCE PITFALL** Using reduction operations is expensive on it’s own, so it should be avoided. While we might be able to decrease the number of instructions executed in scenarios #2 and #3, the cost of reduction together with the cost of handling ‘if-else’ statement might be actually higher than execution of equivalent scalar code.

## Summary

This tutorial discussed basic usage of *masks*. While *control-flow* still remains an important part of UME::SIMD based programs, the SIMD programming model requires also *data flow* to open performance opportunities to user programs.

Masks can be treated as vectors of boolean values, and obtained either by explicit initialization or by calling a specific comparison operation engaging arithmetic vectors. Then they can be used to block the flow of data within a UME::SIMD program.