# Vectorizing with vDSP and vecLib

## Introduction

### Background

While finishing the work on my Octave Shifter Audio Unit I spent lots of hours using Apple's profiling tool called Shark. During the process I found out that a lot of the code could be optimized using vector operations instead of so-called scalar code.

Vector operations work on large sets of data and offer a way to apply a single transformation on an array or matrix of values. Usually these vector operations get implemented using *SIMD* hardware instructions. SIMD stands for *single instruction, multiple data*.

Using SIMD instructions speeds up processing quite a lot compared to plain loops iterating over data. By optimizing my Octave Shifter Audio Unit I gained a lot of valuable experience with Apple's *Accelerate* and *vecLib* libraries, both which offer vector operations.

If you're interested in doing image processing or DSP work you'll find the information in this article indispensable for getting awesome performance of your code. Even though the actual code in this article is specifically tailored to use under Mac OS X and *Xcode*, you'll find that most operating systems and programming environments will offer functions quite like the ones mentioned here.

One thing to mention first is that GCC offers a feature called *auto-vectorization* which supposedly automatically vectorizes your code, but it doesn't really achieve the amount of speedup you can get by manually implementing vector operations and array alignment.

### Adding the Accelerate framework to your project

To be able to use the actual vector operations you should add the Accelerate framework to your Xcode project. To do this, follow these steps:

- Right click on the "External Frameworks and Libraries" folder in the project view
- From the popup menu, choose "Add" -> "External frameworks..."
- In the file browser, navigate to the "/System/Library/Frameworks" folder on your main hard disk
- Double click on the "Accelerate.framework" directory
- Confirm the addition by clicking "Add" in the dialog that pops up

With the framework added to your project, you're all set.

## Accelerate vDSP Library

### Function naming

One of the libraries that comes with the Accelerate framework is called *vDSP*. This library contains a lot of functions useful in DSP processing, including some exotic ones specifically tailored to sound processing. vDSP works with regular floating point arrays, both single precision *float*s and *double*s.

All vDSP functions start with the prefix *vDSP_* followed by a "v" if it works on a vector of input values. The "v" can be followed by an "s" indicating that a scalar (single value) is involved in the operation. Example of this is *vDSP_vsmul* which is a vDSP function that will multiply (mul) a vector (v) by a scalar (s).

Optionally the function can be postfixed by a "D" indicating that the operation works on double precision floating point numbers instead of single precision *float*s. As an example, there is a function called *vDSP_vsmulD* which is basically the same as *vDSP_vsmul* but works on double precision numbers.

### Parameter order

Most vDSP functions will take a few vectors and scalars as input and output values. Most function will of course yield a vector.

One thing you should know about vectors is that vDSP wants you to provide a *stride* everytime you supply one. The stride is simply the spacing of each vector element. If you use normal arrays in your code, the stride will always be 1, indicating that elements are stored right next to eachother.

While the stride seems a superficial parameter it does have its uses in cases where you only want the operations to work on even or odd elements of your array. Important to keep in mind is the fact that only operations with strides equal to one will deliver blazingly fast vectorized code.

The parameter order for most vDSP calls is:

- input vector one
- input stride one
- input vector two*
- input stride two*
- output vector one
- output stride one
- number of elements

*) For the functions that take a scalar, like *vDSP_vsmul* the second input vector and stride will be replaced by the single scalar number.

With this introduction out of the way, let's take a look at how to use the vDSP operations.

### Basic vDSP example

To introduce actually working with the vDSP library I'll provide a simple example first. Please note that the following examples all use pseudo C / C++ syntax to quickly illustrate the usage of the library. This first example shows how to add two single precision vectors together:

#include <Accelerate/Accelerate.h>

int main() {

float inVector1[8] = {1, 2, 3, 4, 5, 6, 7, 8};

float inVector2[8] = {1, 2, 3, 4, 5, 6, 7, 8};

float outVector[8];

vDSP_vadd(inVector1, 1, inVector2, 1, outVector, 1, 8);

}

I hope this example is simple enough to introduce vDSP. The first lines includes the header files which contain the function prototypes for the library so that it can actually be used. The main function declaration should be clear to any programmer out there.

The next three lines initialize three arrays of floating point numbers: two input arrays (or, vectors) containing some numbers and an output array. The output array is not initialized since it'll be overwritten by the result of the sum operation anyway.

The most interesting line is the line with the call to *vDSP_vadd* in it. The actual function name tells us it will *add* two vectors (v) together. The parameters are a pointer to the first input vector with a stride of one, a pointer to the second input vector and a stride of one, a pointer to the output vector and a stride of one and lastly the number of elements in each array (or, vector).

The result of this code is that after execution the outVector will contain {2, 4, 6, 8, 10, 12, 14, 16}. These numbers are the sum of each of both input vector's numbers.

To clarify what happened you can picture the following code actually being executed behind the scenes:

outVector[0] = inVector1[0] + inVector2[0];

outVector[1] = inVector1[1] + inVector2[1];

...

outVector[7] = inVector1[7] + inVector2[7];

This is exactly the base of vector operations, for each element of input arrays the output element is calculated and stored in the output vector. If you get this concept you'll be able to apply vector operations in situations where you can benefit most of them.

### Basic arithmetic vector functions

In the first example I showed the *vDSP_vadd* function which adds the input vectors together. For each of the basic arithmetic operations the vDSP library provides a vectorized version, see the following list:

Function | Description | Vectorized code |
---|---|---|

vDSP_vadd | Add two vectors together. | out[n] = in1[n] + in2[n] |

vDSP_vsub | Subtract two vectors. | out[n] = in1[n] - in2[n] |

vDSP_vmul | Multiply two vectors. | out[n] = in1[n] * in2[n] |

vDSP_vdiv | Divide two vectors. | out[n] = in1[n] / in2[n] |

All of these functions take three input vectors (and their strides) plus the number of elements as their parameters. The only function you should test first is *vDSP_vsub* as it reverses its input vectors on some versions of Mac OS X.

### Basic arithmetic scalar functions

On top of the aforementioned vector operations that work on two input vectors there are variants which work on a single input vector and a single scalar floating point number. For example, if you wish to multiply all elements of an array by the number PI for instance, you would use the function *vDSP_vsmul* like in this example:

#include <Accelerate/Accelerate.h>

#include <math.h>

int main() {

float inVector[8] = {1, 2, 3, 4, 5, 6, 7, 8};

float outVector[8];

float pi = M_PI;

vDSP_vsmul(inVector, 1, &pi, outVector, 1, 8);

}

To start, the function name *vDSP_vsmul* indicates that it works on both a vector (v) and a scalar (s). Again, the term *scalar* just means a single floating point number.

After execution of this piece of example code, the output vector will contain the values {π, 2 π, .. , 8 π}. To illustrate what is happening behind the scenes of the *vDSP_vsmul* function, see this:

outVector[0] = inVector[0] * pi;

outVector[1] = inVector[1] * pi;

...

outVector[7] = inVector[7] * pi;

With this in mind, vDSP offers the following functions to do basic arithmetic on vectors and scalars:

Function | Description | Vectorized code |
---|---|---|

vDSP_vsadd | Add a scalar number to each element of the vector. | outVector[n] = inVector[n] + number |

vDSP_vsmul | Multiply each element of a vector by a scalar. | outVector[n] = inVector[n] * number |

vDSP_vsdiv | Divide each element of a vector by a scalar. | outVector[n] = inVector[n] / number |

Oddly enough, a *vDSP_vssub* function seems to be missing, but can easily be replaced by supplying a negative scalar to *vDSP_vsadd*.

All of the vector scalar functions take their parameters in the form of the input vector, its stride, the scalar, the output vector, its stride and the number of elements in the vector. One thing to note is that you need to supply a pointer to the scalar instead of the value itself. Use the & operator do this, see the example above.

### Filling vectors with vDSP - ramp example

The vDSP library offers a lot of different functions besides simple arithmetic on vectors and scalars. If you look through the API documentation you will find functions for doing forward and inverse Fast Fourier Transformations, creating ramped vectors and for indirectly addressing vector elements based on an input vector.

I won't explain every single function of the vDSP library in detail, but let's take a look at one more function which holds the middle ground between vector and scalar operations. It's *vDSP_vramp* which builds a ramped vector, which is an array of monotonically increasing elements, like {2, 4, 6, 8} for instance.

To generate a ramped vector you will need to supply a starting value, the increment for each step and the amount of elements to generate. Consider this example:

#include <Accelerate/Accelerate.h>

#include <math.h>

int main() {

float outVector[8];

float pi = M_PI;

float initial = 0;

vDSP_vramp(&initial, &pi, outVector, 1, 8);

}

This will fill the output vector with values {0, π, 2 π , .. , 7 π}.

### Other vDSP functions

To end the vDSP introduction, here's a small list containing some interesting functions:

Function | Description | Vectorized code |
---|---|---|

vDSP_vdist | Calculate the distance between elements in two vectors. | outVector[n] = sqrt(pow(inVector1[n], 2) + pow(inVector2[n], 2)) |

vDSP_vindex | Copy elements from a source vector to an output vector based on indices in a second input vector. | outVector[inVector2[n]] = inVector1[n] |

vDSP_mmov | Copy elements from a submatrix or vector to another vector. | outVector[n] = inVector[n] |

vDSP_vfill | Set all elements of a vector to a certain scalar. | outVector[n] = number |

vDSP_vclr | Set all elements of a vector to zero. | outVector[n] = 0.0 |

There are far more functions in the vDSP library than the ones mentioned above, see the complete API for more information. In this article we'll look at another library provided by the Accelerate framework called *vecLib*.

## Accelerate vecLib Library

### Got the basic arithmetic? Lets try trigonometric!

Big absentees in the vDSP library are trigonometric functions and other advanced mathematical operations. I looked for them everywhere but vDSP just doesn't offer them. The good thing is that the Accelerate framework in itself does contain a library that provides these functions, called *vecLib*.

The basic use of vecLib is the same as the vDSP library, but parameters are passed in a different order and using different data types. The following example shows you how to use vecLib for vectorized cosine computation:

#include <Accelerate/Accelerate.h>

int main() {

float inVector[8] = {1, 2, 3, 4, 5, 6, 7, 8};

float outVector[8];

vvcosf(outVector, inVector, 8);

}

From this example you can see the naming convention of *vecLib* functions: "vv" indicates the function works and results in a vector, "cos" indicates the actual function and the "f" indicates the function works on single precision *float*s. Several functions have variants without the "f" indicating they work on double precision values, like *vvcos* for example.

The example above will calculate the cosine of 1, 2, .., 8 and store the result in the elements of the output vector. Most vecLib functions take the output vector as their first parameter, followed by the input vectors and ending with the number of elements to process.

Along the lines of *vvcosf* vecLib provides these trigonometric functions, amongst others:

Function | Description | Vectorized code |
---|---|---|

vvcosf | Calculate the cosine of each vector element. | outVector[n] = cosf(inVector[n]) |

vvsinf | Calculate the sine of each vector element. | outVector[n] = sinf(inVector[n]) |

vvtanf | Calculate the tangent of each vector element. | outVector[n] = tanf(inVector[n]) |

vvacosf | Calculate the arc cosine of each vector element. | outVector[n] = acosf(inVector[n]) |

vvasinf | Calculate the arc sine of each vector element. | outVector[n] = asinf(inVector[n]) |

vvatanf | Calculate the arc tangent of each vector element. | outVector[n] = atanf(inVector[n]) |

There are other functions like *vvatan2f* which are listed and described in detail in the vecLib API reference.

### Logarithms and exponential functions

Besides trigonometric functions, vecLib provides everything you need to work with logarithms and exponential functions. Check the following list for an indication:

Function | Description | Vectorized code |
---|---|---|

vvlogf | Calculate the natural logarithm for each vector element. | outVector[n] = logf(inVector[n]) |

vvlog10f | Calculate the base 10 logarithm for each vector element. | outVector[n] = log10f(inVector[n]) |

vvexpf | Calculate the exponential for each vector element. | outVector[n] = expf(inVector[n]) |

vvrecf | Calculate the reciprocal for each vector element. | outVector[n] = 1 / inVector[n] |

As you can see, *vecLib* reaches beyond trigonometric functions. See the vecLib API documentation for a complete list of functions available for use. This concludes the short look at vecLib, we'll move on to some very important information regarding manual or automatic vectorization now.

## Vector Alignment

### The most important thing

After discussing the vector libraries themselves, I've almost forgot to explain the *most* important thing when using with vector functions.

This most important thing is the *alignment* of the variables that act as vectors. If you do not properly align you vectors to certain memory boundaries, the performance of *vDSP* and *vecLib* functions will be about the same as normal code using *for* or other regular loop statements.

To align the vector array variables, use the "*__attribute__ ((aligned))*" postfix for GCC. For example:

#include <Accelerate/Accelerate.h>

int main() {

float inVector[8] = {1, 2, 3, 4, 5, 6, 7, 8} __attribute__ ((aligned));

float outVector[8] __attribute__ ((aligned));

vDSP_mmov(inVector, outVector, 8, 1, 8, 8);

}

You need to make sure that all arrays which are used as vectors have this property assigned to them. You don't need to provide this property to scalar values, at least it won't make a noticable impact on performance.

### Conclusion

In this article we took an introductionary look into the vDSP and vecLib libraries which are part of the Accelerate framework on Mac OS X. By using properly aligned arrays and manually vectorized code you can increase the performance of your hot loops by large amounts. Because it's almost impossible to illustrate the speedup for any given case you should try and implement vectorized versions of your code and benchmark them against normal scalar code.

In the case of my Audio Unit work, I've noticed speedups in the range of 5 to 10 times more performance.

### About this article

This article was added to the site on the 5th of November, 2006.

### Relevant links

For more information and details on the vDSP and vecLib libraries, visit these pages: