Vectorizing with vDSP and vecLib

Introduction

Background

While finishing the work on my Octave Shifter Audio Unit I spent lots of hours using Apple's profiling tool called Shark. During the process I found out that a lot of the code could be optimized using vector operations instead of so-called scalar code.

Vector operations work on large sets of data and offer a way to apply a single transformation on an array or matrix of values. Usually these vector operations get implemented using SIMD hardware instructions. SIMD stands for single instruction, multiple data.

Using SIMD instructions speeds up processing quite a lot compared to plain loops iterating over data. By optimizing my Octave Shifter Audio Unit I gained a lot of valuable experience with Apple's Accelerate and vecLib libraries, both which offer vector operations.

If you're interested in doing image processing or DSP work you'll find the information in this article indispensable for getting awesome performance of your code. Even though the actual code in this article is specifically tailored to use under Mac OS X and Xcode, you'll find that most operating systems and programming environments will offer functions quite like the ones mentioned here.

One thing to mention first is that GCC offers a feature called auto-vectorization which supposedly automatically vectorizes your code, but it doesn't really achieve the amount of speedup you can get by manually implementing vector operations and array alignment.

Adding the Accelerate framework to your project

To be able to use the actual vector operations you should add the Accelerate framework to your Xcode project. To do this, follow these steps:

Right click on the "External Frameworks and Libraries" folder in the project view
From the popup menu, choose "Add" -> "External frameworks..."
In the file browser, navigate to the "/System/Library/Frameworks" folder on your main hard disk
Double click on the "Accelerate.framework" directory
Confirm the addition by clicking "Add" in the dialog that pops up

With the framework added to your project, you're all set.

Accelerate vDSP Library

Function naming

One of the libraries that comes with the Accelerate framework is called vDSP. This library contains a lot of functions useful in DSP processing, including some exotic ones specifically tailored to sound processing. vDSP works with regular floating point arrays, both single precision floats and doubles.

All vDSP functions start with the prefix vDSP_ followed by a "v" if it works on a vector of input values. The "v" can be followed by an "s" indicating that a scalar (single value) is involved in the operation. Example of this is vDSP_vsmul which is a vDSP function that will multiply (mul) a vector (v) by a scalar (s).

Optionally the function can be postfixed by a "D" indicating that the operation works on double precision floating point numbers instead of single precision floats. As an example, there is a function called vDSP_vsmulD which is basically the same as vDSP_vsmul but works on double precision numbers.

Parameter order

Most vDSP functions will take a few vectors and scalars as input and output values. Most function will of course yield a vector.

One thing you should know about vectors is that vDSP wants you to provide a stride everytime you supply one. The stride is simply the spacing of each vector element. If you use normal arrays in your code, the stride will always be 1, indicating that elements are stored right next to eachother.

While the stride seems a superficial parameter it does have its uses in cases where you only want the operations to work on even or odd elements of your array. Important to keep in mind is the fact that only operations with strides equal to one will deliver blazingly fast vectorized code.

The parameter order for most vDSP calls is:

input vector one
input stride one
input vector two*
input stride two*
output vector one
output stride one
number of elements

*) For the functions that take a scalar, like vDSP_vsmul the second input vector and stride will be replaced by the single scalar number.

With this introduction out of the way, let's take a look at how to use the vDSP operations.

Basic vDSP example

To introduce actually working with the vDSP library I'll provide a simple example first. Please note that the following examples all use pseudo C / C++ syntax to quickly illustrate the usage of the library. This first example shows how to add two single precision vectors together:

#include <Accelerate/Accelerate.h>

int main() {
    float inVector1[8] = {1, 2, 3, 4, 5, 6, 7, 8};
    float inVector2[8] = {1, 2, 3, 4, 5, 6, 7, 8};
    float outVector[8];

    vDSP_vadd(inVector1, 1, inVector2, 1, outVector, 1, 8);
}

I hope this example is simple enough to introduce vDSP. The first lines includes the header files which contain the function prototypes for the library so that it can actually be used. The main function declaration should be clear to any programmer out there.

The next three lines initialize three arrays of floating point numbers: two input arrays (or, vectors) containing some numbers and an output array. The output array is not initialized since it'll be overwritten by the result of the sum operation anyway.

The most interesting line is the line with the call to vDSP_vadd in it. The actual function name tells us it will add two vectors (v) together. The parameters are a pointer to the first input vector with a stride of one, a pointer to the second input vector and a stride of one, a pointer to the output vector and a stride of one and lastly the number of elements in each array (or, vector).

The result of this code is that after execution the outVector will contain {2, 4, 6, 8, 10, 12, 14, 16}. These numbers are the sum of each of both input vector's numbers.

To clarify what happened you can picture the following code actually being executed behind the scenes:

outVector[0] = inVector1[0] + inVector2[0];
outVector[1] = inVector1[1] + inVector2[1];
...
outVector[7] = inVector1[7] + inVector2[7];

This is exactly the base of vector operations, for each element of input arrays the output element is calculated and stored in the output vector. If you get this concept you'll be able to apply vector operations in situations where you can benefit most of them.

Basic arithmetic vector functions

In the first example I showed the vDSP_vadd function which adds the input vectors together. For each of the basic arithmetic operations the vDSP library provides a vectorized version, see the following list:

Function	Description	Vectorized code
vDSP_vadd	Add two vectors together.	out[n] = in1[n] + in2[n]
vDSP_vsub	Subtract two vectors.	out[n] = in1[n] - in2[n]
vDSP_vmul	Multiply two vectors.	out[n] = in1[n] * in2[n]
vDSP_vdiv	Divide two vectors.	out[n] = in1[n] / in2[n]

All of these functions take three input vectors (and their strides) plus the number of elements as their parameters. The only function you should test first is vDSP_vsub as it reverses its input vectors on some versions of Mac OS X.

Basic arithmetic scalar functions

On top of the aforementioned vector operations that work on two input vectors there are variants which work on a single input vector and a single scalar floating point number. For example, if you wish to multiply all elements of an array by the number PI for instance, you would use the function vDSP_vsmul like in this example:

#include <Accelerate/Accelerate.h>
#include <math.h>

int main() {
    float inVector[8] = {1, 2, 3, 4, 5, 6, 7, 8};
    float outVector[8];
    float pi = M_PI;

    vDSP_vsmul(inVector, 1, &pi, outVector, 1, 8);
}

To start, the function name vDSP_vsmul indicates that it works on both a vector (v) and a scalar (s). Again, the term scalar just means a single floating point number.

After execution of this piece of example code, the output vector will contain the values {π, 2 π, .. , 8 π}. To illustrate what is happening behind the scenes of the vDSP_vsmul function, see this:

outVector[0] = inVector[0] * pi;
outVector[1] = inVector[1] * pi;
...
outVector[7] = inVector[7] * pi;

With this in mind, vDSP offers the following functions to do basic arithmetic on vectors and scalars:

Function	Description	Vectorized code
vDSP_vsadd	Add a scalar number to each element of the vector.	outVector[n] = inVector[n] + number
vDSP_vsmul	Multiply each element of a vector by a scalar.	outVector[n] = inVector[n] * number
vDSP_vsdiv	Divide each element of a vector by a scalar.	outVector[n] = inVector[n] / number

Oddly enough, a vDSP_vssub function seems to be missing, but can easily be replaced by supplying a negative scalar to vDSP_vsadd.

All of the vector scalar functions take their parameters in the form of the input vector, its stride, the scalar, the output vector, its stride and the number of elements in the vector. One thing to note is that you need to supply a pointer to the scalar instead of the value itself. Use the & operator do this, see the example above.

Filling vectors with vDSP - ramp example

The vDSP library offers a lot of different functions besides simple arithmetic on vectors and scalars. If you look through the API documentation you will find functions for doing forward and inverse Fast Fourier Transformations, creating ramped vectors and for indirectly addressing vector elements based on an input vector.

I won't explain every single function of the vDSP library in detail, but let's take a look at one more function which holds the middle ground between vector and scalar operations. It's vDSP_vramp which builds a ramped vector, which is an array of monotonically increasing elements, like {2, 4, 6, 8} for instance.

To generate a ramped vector you will need to supply a starting value, the increment for each step and the amount of elements to generate. Consider this example:

#include <Accelerate/Accelerate.h>
#include <math.h>

int main() {
    float outVector[8];
    float pi = M_PI;
    float initial = 0;

    vDSP_vramp(&initial, &pi, outVector, 1, 8);
}

This will fill the output vector with values {0, π, 2 π , .. , 7 π}.

Other vDSP functions

To end the vDSP introduction, here's a small list containing some interesting functions:

Function	Description	Vectorized code
vDSP_vdist	Calculate the distance between elements in two vectors.	outVector[n] = sqrt(pow(inVector1[n], 2) + pow(inVector2[n], 2))
vDSP_vindex	Copy elements from a source vector to an output vector based on indices in a second input vector.	outVector[inVector2[n]] = inVector1[n]
vDSP_mmov	Copy elements from a submatrix or vector to another vector.	outVector[n] = inVector[n]
vDSP_vfill	Set all elements of a vector to a certain scalar.	outVector[n] = number
vDSP_vclr	Set all elements of a vector to zero.	outVector[n] = 0.0

There are far more functions in the vDSP library than the ones mentioned above, see the complete API for more information. In this article we'll look at another library provided by the Accelerate framework called vecLib.

Accelerate vecLib Library

Got the basic arithmetic? Lets try trigonometric!

Big absentees in the vDSP library are trigonometric functions and other advanced mathematical operations. I looked for them everywhere but vDSP just doesn't offer them. The good thing is that the Accelerate framework in itself does contain a library that provides these functions, called vecLib.

The basic use of vecLib is the same as the vDSP library, but parameters are passed in a different order and using different data types. The following example shows you how to use vecLib for vectorized cosine computation:

#include <Accelerate/Accelerate.h>

int main() {
    float inVector[8] = {1, 2, 3, 4, 5, 6, 7, 8};
    float outVector[8];

    vvcosf(outVector, inVector, 8);
}

From this example you can see the naming convention of vecLib functions: "vv" indicates the function works and results in a vector, "cos" indicates the actual function and the "f" indicates the function works on single precision floats. Several functions have variants without the "f" indicating they work on double precision values, like vvcos for example.

The example above will calculate the cosine of 1, 2, .., 8 and store the result in the elements of the output vector. Most vecLib functions take the output vector as their first parameter, followed by the input vectors and ending with the number of elements to process.

Along the lines of vvcosf vecLib provides these trigonometric functions, amongst others:

Function	Description	Vectorized code
vvcosf	Calculate the cosine of each vector element.	outVector[n] = cosf(inVector[n])
vvsinf	Calculate the sine of each vector element.	outVector[n] = sinf(inVector[n])
vvtanf	Calculate the tangent of each vector element.	outVector[n] = tanf(inVector[n])
vvacosf	Calculate the arc cosine of each vector element.	outVector[n] = acosf(inVector[n])
vvasinf	Calculate the arc sine of each vector element.	outVector[n] = asinf(inVector[n])
vvatanf	Calculate the arc tangent of each vector element.	outVector[n] = atanf(inVector[n])

There are other functions like vvatan2f which are listed and described in detail in the vecLib API reference.

Logarithms and exponential functions

Besides trigonometric functions, vecLib provides everything you need to work with logarithms and exponential functions. Check the following list for an indication:

Function	Description	Vectorized code
vvlogf	Calculate the natural logarithm for each vector element.	outVector[n] = logf(inVector[n])
vvlog10f	Calculate the base 10 logarithm for each vector element.	outVector[n] = log10f(inVector[n])
vvexpf	Calculate the exponential for each vector element.	outVector[n] = expf(inVector[n])
vvrecf	Calculate the reciprocal for each vector element.	outVector[n] = 1 / inVector[n]

As you can see, vecLib reaches beyond trigonometric functions. See the vecLib API documentation for a complete list of functions available for use. This concludes the short look at vecLib, we'll move on to some very important information regarding manual or automatic vectorization now.

Vector Alignment

The most important thing

After discussing the vector libraries themselves, I've almost forgot to explain the most important thing when using with vector functions.

This most important thing is the alignment of the variables that act as vectors. If you do not properly align you vectors to certain memory boundaries, the performance of vDSP and vecLib functions will be about the same as normal code using for or other regular loop statements.

To align the vector array variables, use the "__attribute__ ((aligned))" postfix for GCC. For example:

#include <Accelerate/Accelerate.h>

int main() {
    float inVector[8] = {1, 2, 3, 4, 5, 6, 7, 8} __attribute__ ((aligned));
    float outVector[8] __attribute__ ((aligned));

    vDSP_mmov(inVector, outVector, 8, 1, 8, 8);
}

You need to make sure that all arrays which are used as vectors have this property assigned to them. You don't need to provide this property to scalar values, at least it won't make a noticable impact on performance.

Conclusion

In this article we took an introductionary look into the vDSP and vecLib libraries which are part of the Accelerate framework on Mac OS X. By using properly aligned arrays and manually vectorized code you can increase the performance of your hot loops by large amounts. Because it's almost impossible to illustrate the speedup for any given case you should try and implement vectorized versions of your code and benchmark them against normal scalar code.

In the case of my Audio Unit work, I've noticed speedups in the range of 5 to 10 times more performance.

About this article

This article was added to the site on the 5th of November, 2006.

Relevant links

For more information and details on the vDSP and vecLib libraries, visit these pages:

Forever For Now

Menu

Search