Download (2.49 MB)Télécharger maintenant

Introduction

In order to better optimize and debug OpenCL kernels, sometimes it is very helpful to look at the underlying assembly. This article shows you the tools available in the Intel® SDK for OpenCL™ Applications that allow you to view assembly generated by the offline compiler for individual kernels, highlight the regions of the assembly code that correspond to OpenCL C code, as well as attempts at a high level explain different portions of the generated assembly. We also give you a brief overview of the register region syntax and semantics, show different types of registers, and summarize available assembly instructions and data types that these instructions can manipulate on. We hope to give you enough ammunition to get started. In the upcoming articles we will cover assembly debugging as well as assembly profiling with Intel® VTune™ Amplifier.

Assembly for Simple OpenCL Kernels

Let us start with a simple kernel:

kernel void empty() {
}

This is as simple as kernels get. We are going to build this kernel in a Code Builder Session Explorer. Go ahead and create a new session by going to CODE-BUILDER/OpenCL Kernel Development/New Session, copying the kernel above to an empty program.cl file and then building it. If you have a 5^th generation Intel processor (Broadwell) or a 6^th generation Intel processor (Skylake), you will notice that one of the artifacts being generated is program_empty.gen file. Go ahead and double-click on it. What you will see is something like this:

The assembly for the kernel is on the right: let me annotate it for you:

// Start of Thread
LABEL0
(W)      and      (1|M0)        r2.6<1>:ud    r0.5<0;1,0>:ud    0x1FF:ud         // id:

// End of thread
(W)      mov      (8|M0)        r127.0<1>:ud  r0.0<8;8,1>:ud   {Compacted}                 // id:
         send     (8|M0)        null          r127              0x27      0x2000010 {EOT}  // id:

Not much, but it is a start.

Now, let’s complicate life a little. Copy the following into program.cl:

kernel void meaning_of_life(global uchar* out)
{
 out[31] = 42;
}

After rebuilding the file you will notice program_meaning_of_life.gen file. After double clicking on it you will see something more complex:

What you can do now is to click on different parts of the kernel on the left, and see different parts of assembly being highlighted:

Here are instructions corresponding to the beginning of the kernel:

The body of the kernel:

And the end of the kernel:

We are going to rearrange the assembly to make it a little bit more understandable:

// Start of Thread
LABEL0
(W)      and      (1|M0)        r2.6<1>:ud    r0.5<0;1,0>:ud    0x1FF:ud         // id:
// r3 and r4 contain the address of out variable (8 unsigned quadwords – uq)
// we are going to place them in r1 and r2
(W)      mov      (8|M0)        r1.0<1>:uq    r3.0<0;1,0>:uq                   // id:


// Move 42 (0x2A:ud – ud is unsigned dword) into 32 slots (our kernel is compiled SIMD32)
// We are going to use registers r7, r10, r13 and r16, each register fitting 8 values
         mov      (8|M0)        r7.0<1>:ud    0x2A:ud          {Compacted}                 // id:
         mov      (8|M8)        r10.0<1>:ud   0x2A:ud          {Compacted}                 // id:
         mov      (8|M16)       r13.0<1>:ud   0x2A:ud                          // id:
         mov      (8|M24)       r16.0<1>:ud   0x2A:ud                          // id:

// Add 31 (0x1F:ud) to eight quadwords in r1 and r2 and place the results in r3 and r4
// Essentially, we get &out[31]
 (W)      add      (8|M0)        r3.0<1>:q     r1.0<0;1,0>:q     0x1F:ud          // id:

// Now we spread &out[31] into r5,r6, r8,r9, r11, r10, and r14, r15 – 32 values in all.
         mov      (8|M0)        r5.0<1>:uq    r3.0<0;1,0>:uq                   // id:
         mov      (8|M8)        r8.0<1>:uq    r3.0<0;1,0>:uq                   // id:1
         mov      (8|M16)       r11.0<1>:uq   r3.0<0;1,0>:uq                   // id:1
         mov      (8|M24)       r14.0<1>:uq   r3.0<0;1,0>:uq                   // id:1

// Write to values in r7 into addresses in r5, r6, etc.
         send     (8|M0)        null          r5                0xC       0x60680FF                 // id:1
         send     (8|M8)        null          r8                0xC       0x60680FF                 // id:1
         send     (8|M16)       null          r11               0xC       0x60680FF                 // id:1
         send     (8|M24)       null          r14               0xC       0x60680FF                 // id:1

// End of thread
(W)      mov      (8|M0)        r127.0<1>:ud  r0.0<8;8,1>:ud   {Compacted}                 // id:
         send     (8|M0)        null          r127              0x27      0x2000010 {EOT}                 // id:1

Now, we are going to complicate life ever so slightly, by using get_global_id(0) instead of a fixed index to write things out:

kernel void meaning_of_life2(global uchar* out)
{
 int i = get_global_id(0);
 out[i] = 42;
}

Note, that the addition of get_global_id(0) increases the size of our kernel by 9 assembly instructions. This mainly has to do with the fact that we will need to calculate increasing addresses for each subsequent workitem in a thread (there 32 work items there):

// Start of Thread
LABEL0
(W)      and      (1|M0)        r7.6<1>:ud    r0.5<0;1,0>:ud    0x1FF:ud         // id:

// Move 42 (0x2A:ud – ud is unsigned dword) into 32 slots (our kernel is compiled SIMD32)
// We are going to use registers r17, r20, r23 and r26, each register fitting 8 values
         mov      (8|M0)        r17.0<1>:ud   0x2A:ud          {Compacted}                 // id:
         mov      (8|M8)        r20.0<1>:ud   0x2A:ud          {Compacted}                 // id:
         mov      (8|M16)       r23.0<1>:ud   0x2A:ud                          // id:
         mov      (8|M24)       r26.0<1>:ud   0x2A:ud                          // id:
// get_global_id(0) calculation, r0.1, r7.0 and r7.3 will contain the necessary starting values
(W)      mul      (1|M0)        r3.0<1>:ud    r0.1<0;1,0>:ud    r7.3<0;1,0>:ud   // id:
(W)      mul      (1|M0)        r5.0<1>:ud    r0.1<0;1,0>:ud    r7.3<0;1,0>:ud   // id:
(W)      add      (1|M0)        r3.0<1>:ud    r3.0<0;1,0>:ud    r7.0<0;1,0>:ud   {Compacted} // id:
(W)      add      (1|M0)        r5.0<1>:ud    r5.0<0;1,0>:ud    r7.0<0;1,0>:ud   {Compacted} // id:1
// r3 thru r6 will contain the get_global_id(0) offsets; r1 and r2 contain 32 increasing values
         add      (16|M0)       r3.0<1>:ud    r3.0<0;1,0>:ud    r1.0<8;8,1>:uw   // id:1
         add      (16|M16)      r5.0<1>:ud    r5.0<0;1,0>:ud    r2.0<8;8,1>:uw   // id:1
// r8 and r9 contain the address of out variable (8 unsigned quadwords – uq)
// we are going to place these addresses in r1 and r2
 (W)      mov      (8|M0)        r1.0<1>:uq    r8.0<0;1,0>:uq                   // id:1

// Move the offsets in r3 thru r6 to r7, r8, r9, r10, r11, r12, r13, r14
         mov      (8|M0)        r7.0<1>:q     r3.0<8;8,1>:d                    // id:1
         mov      (8|M8)        r9.0<1>:q     r4.0<8;8,1>:d                    // id:1
         mov      (8|M16)       r11.0<1>:q    r5.0<8;8,1>:d                    // id:1
         mov      (8|M24)       r13.0<1>:q    r6.0<8;8,1>:d                    // id:1

// Add the offsets to address of out in r1 and place them in r15, r16, r18, r19, r21, r22, r24, r25
         add      (8|M0)        r15.0<1>:q    r1.0<0;1,0>:q     r7.0<4;4,1>:q    // id:1
         add      (8|M8)        r18.0<1>:q    r1.0<0;1,0>:q     r9.0<4;4,1>:q    // id:1
         add      (8|M16)       r21.0<1>:q    r1.0<0;1,0>:q     r11.0<4;4,1>:q   // id:2
         add      (8|M24)       r24.0<1>:q    r1.0<0;1,0>:q     r13.0<4;4,1>:q   // id:2

// write into addresses in r15, r16, values in r17, etc.
         send     (8|M0)        null          r15               0xC       0x60680FF                 // id:2
         send     (8|M8)        null          r18               0xC       0x60680FF                 // id:2
         send     (8|M16)       null          r21               0xC       0x60680FF                 // id:2
         send     (8|M24)       null          r24               0xC       0x60680FF                 // id:2

// End of thread
(W)      mov      (8|M0)        r127.0<1>:ud  r0.0<8;8,1>:ud   {Compacted}                 // id:
         send     (8|M0)        null          r127              0x27      0x2000010 {EOT}                 // id:2

And finally, let’s look at a kernel that does, reading, writing and some math:

kernel void foo(global float* in, global float* out) {
 int i = get_global_id(0);

 float f = in[i];
 float temp = 0.5f * f;
 out[i] = temp;
}

It will be translated to the following (note, that I rearranged some assembly instructions for better understanding):

// Start of Thread
LABEL0
(W)      and      (1|M0)        r7.6<1>:ud    r0.5<0;1,0>:ud    0x1FF:ud         // id:

// r3 and r4 will contain the address of out buffer
(W)      mov      (8|M0)        r3.0<1>:uq    r8.1<0;1,0>:uq                     // id:
// int i = get_global_id(0);
(W)      mul      (1|M0)        r5.0<1>:ud    r0.1<0;1,0>:ud    r7.3<0;1,0>:ud   // id:
(W)      mul      (1|M0)        r9.0<1>:ud    r0.1<0;1,0>:ud    r7.3<0;1,0>:ud   // id:
(W)      add      (1|M0)        r5.0<1>:ud    r5.0<0;1,0>:ud    r7.0<0;1,0>:ud   {Compacted} // id:
(W)      add      (1|M0)        r9.0<1>:ud    r9.0<0;1,0>:ud    r7.0<0;1,0>:ud   {Compacted} // id:
         add      (16|M0)       r5.0<1>:ud    r5.0<0;1,0>:ud    r1.0<8;8,1>:uw   // id:
         add      (16|M16)      r9.0<1>:ud    r9.0<0;1,0>:ud    r2.0<8;8,1>:uw   // id:

// r1 and r2 will contain the address of in buffer
(W)      mov      (8|M0)        r1.0<1>:uq    r8.0<0;1,0>:uq                   // id:1
// r11, r12, r13, r14, r15, r16, r17 and r18 will contain 32 qword offsets
         mov      (8|M0)        r11.0<1>:q    r5.0<8;8,1>:d                    // id:1
         mov      (8|M8)        r13.0<1>:q    r6.0<8;8,1>:d                    // id:1
         mov      (8|M16)       r15.0<1>:q    r9.0<8;8,1>:d                    // id:1
         mov      (8|M24)       r17.0<1>:q    r10.0<8;8,1>:d                   // id:1

//  float f = in[i];
         shl      (8|M0)        r31.0<1>:uq   r11.0<4;4,1>:uq   0x2:ud           // id:1
         shl      (8|M8)        r33.0<1>:uq   r13.0<4;4,1>:uq   0x2:ud           // id:1
         shl      (8|M16)       r35.0<1>:uq   r15.0<4;4,1>:uq   0x2:ud           // id:1
         shl      (8|M24)       r37.0<1>:uq   r17.0<4;4,1>:uq   0x2:ud           // id:1
         add      (8|M0)        r19.0<1>:q    r1.0<0;1,0>:q     r31.0<4;4,1>:q   // id:1
         add      (8|M8)        r21.0<1>:q    r1.0<0;1,0>:q     r33.0<4;4,1>:q   // id:2
         add      (8|M16)       r23.0<1>:q    r1.0<0;1,0>:q     r35.0<4;4,1>:q   // id:2
         add      (8|M24)       r25.0<1>:q    r1.0<0;1,0>:q     r37.0<4;4,1>:q   // id:2
// read in f values at addresses in r19, r20, r21, r22, r23, r24, r25, r26 into r27, r28, r29, r30
         send     (8|M0)        r27           r19               0xC       0x4146EFF                 // id:2
         send     (8|M8)        r28           r21               0xC       0x4146EFF                 // id:2
         send     (8|M16)       r29           r23               0xC       0x4146EFF                 // id:2
         send     (8|M24)       r30           r25               0xC       0x4146EFF                 // id:2

// float temp = 0.5f * f; - 0.5f is 0x3F000000:f
//     We multiply 16 values in r27, r28 by 0.5f and place them in r39, r40
//     We multiple 16 values in r29, r30 by 0.5f and place them in r47, r48
         mul      (16|M0)       r39.0<1>:f    r27.0<8;8,1>:f    0x3F000000:f     // id:3
         mul      (16|M16)      r47.0<1>:f    r29.0<8;8,1>:f    0x3F000000:f     // id:3

//     out[i] = temp;
         add      (8|M0)        r41.0<1>:q    r3.0<0;1,0>:q     r31.0<4;4,1>:q   // id:2
         add      (8|M8)        r44.0<1>:q    r3.0<0;1,0>:q     r33.0<4;4,1>:q   // id:2
         add      (8|M16)       r49.0<1>:q    r3.0<0;1,0>:q     r35.0<4;4,1>:q   // id:2
         add      (8|M24)       r52.0<1>:q    r3.0<0;1,0>:q     r37.0<4;4,1>:q   // id:3

         mov      (8|M0)        r43.0<1>:ud   r39.0<8;8,1>:ud  {Compacted}                 // id:3
         mov      (8|M8)        r46.0<1>:ud   r40.0<8;8,1>:ud  {Compacted}                 // id:3
         mov      (8|M16)       r51.0<1>:ud   r47.0<8;8,1>:ud                  // id:3
         mov      (8|M24)       r54.0<1>:ud   r48.0<8;8,1>:ud                  // id:3

// write into addresses r41, r42 the values in r43, etc.
         send     (8|M0)        null          r41               0xC       0x6066EFF                 // id:3
         send     (8|M8)        null          r44               0xC       0x6066EFF                 // id:3
         send     (8|M16)       null          r49               0xC       0x6066EFF                 // id:3
         send     (8|M24)       null          r52               0xC       0x6066EFF                 // id:4

// End of thread
(W)      mov      (8|M0)        r127.0<1>:ud  r0.0<8;8,1>:ud   {Compacted}                 // id:
         send     (8|M0)        null          r127              0x27      0x2000010 {EOT}                 // id:4

How to Read an Assembly Instruction

Typically, all instructions have the following form:

[(pred)] opcode (exec-size|exec-offset) dst src0 [src1] [src2]

(pred) is the optional predicate. We are going to skip it for now.

opcode is the symbol of the instruction, like add or mov (we have a full table of opcodes below.

exec-size is the SIMD width of the instruction, which of our architecture could be 1, 2, 4, 8, or 16. In SIMD32 compilation, typically two instructions of execution size 8 or 16 are grouped into one.

exec-offset is the part that's telling the EU, which part of the ARF registers to read or write from, e.g. (8|M24) consults the bits 24-31 of the execution mask. When emitting SIMD16 or SIMD32 code like the following:

         mov  (8|M0)   r11.0<1>:q   r5.0<8;8,1>:d   // id:1
         mov  (8|M8)   r13.0<1>:q   r6.0<8;8,1>:d   // id:1
         mov  (8|M16)  r15.0<1>:q   r9.0<8;8,1>:d   // id:1
         mov  (8|M24)  r17.0<1>:q   r10.0<8;8,1>:d  // id:1

the compiler has to emit four 8-wide operations due to a limitation of how many bytes can be accessed per operand in the GRF.

dst is a destination register

src0 is a source register

src1 is an optional source register. Note, that it could also be an immediate value, like 0x3F000000:f (0.5) or 0x2A:ud (42).

src2 is an optional source register.

General Register File (GRF) Registers

Each thread has a dedicated space of 128 registers, r0 through r127. Each register is 256 bits or 32 bytes.

Architecture Register File (ARF) Registers

In the assembly code above, we only saw one of these special registers, the null register, which is typically used as a destination for send instructions used for writing and indicating end of thread. Here is a full table of other architecture registers:

Since our registers are 32 bytes wide and are byte addressable, our assembly has a register region syntax, to be able to access values stored in these registers.

Below, we have a series of diagrams explaining how register region syntax works.

Here we have a register region r4.1<16;8,2>:w. The w at the end of the region indicates that we are talking about word (or two bytes) values. The full table of allowable integer and floating datatypes is below. The origin is at r4.1, which means that we are starting with the second word of register r4. The vertical stride is 16, which means that we need to skip 16 elements to start the second row. Width parameter is 8 and refers to the number of elements in a row; Horizontal stride of 2 means that we are taking every second element. Note, that we refer here to the content of both r4 and r5. The picture below summarizes the result:

In this example, let’s consider a register region r5.0<1;8,2>:w. The region starts at a first element of r5. We have 8 elements in a row, row containing every second element, so the first row is {0, 1, 2, 3, 4, 5, 6, 7}. The second row starts at offset of 1 word, or at r5.2 and so it contains {8, 9, 10, 11, 12, 13, 14, 15}. The picture below summarizes the result:

Consider the following assembly instruction

add(16) r6.0<1>:w r1.7<16;8,1>:b r2.1<16;8,1>:b

The src0 starts at r1.7 and has 8 consecutive bytes in the first row, followed by the second row of 8 bytes, which starts at r1.23.

The src1 starts at r2.1 and has 8 consecutive bytes in the first row, followed by the second row of 8 bytes, which starts at r2.17.

The dst starts at r6.0, stores the values as words, and since the instruction Add(16) will operate on 16 values, stores 16 consecutive words into r6.

Let’s consider the following assembly instruction:

add(16) r6.0<1>:w r1.14<16;8,0>:b r2.17<16;8,1>:b

Src0 is r1.14<16;8,0>:b, which means the we have the first byte sized value at r1.14, 0 in the stride value means that we are going to repeat the value for the width of the region, which is 8, and the region continues at r1.30, and we are going to repeat the value stored there 8 times as well, so we are talking about the following value {1,1,1,1,1,1,1,1, 8, 8, 8, 8, 8, 8, 8, 8}.

Src1 is r2.17<16;8,1>:b, so we actually start with 8 bytes starting from r2.17 and end up with the second row of 8 bytes starting from r3.1.

The letter after : in the register region signifies the data type stored there. Here are two tables summarizing the available integer and floating point types:

The following tables summarize available assembly instructions:

References:

Volume 7 of Intel Graphics documentation is available here:

Volume 7: 3D-Media-GPGPU

Full set of Intel Graphics Documentation is available here:

https://01.org/linuxgraphics/documentation/hardware-specification-prms

About the Author

Robert Ioffe is a Technical Consulting Engineer at Intel’s Software and Solutions Group. He is an expert in OpenCL programming and OpenCL workload optimization on Intel Iris and Intel Iris Pro Graphics with deep knowledge of Intel Graphics Hardware. He was heavily involved in Khronos standards work, focusing on prototyping the latest features and making sure they can run well on Intel architecture. Most recently he has been working on prototyping Nested Parallelism (enqueue_kernel functions) feature of OpenCL 2.0 and wrote a number of samples that demonstrate Nested Parallelism functionality, including GPU-Quicksort for OpenCL 2.0. He also recorded and released two Optimizing Simple OpenCL Kernels videos and is in the process of recording a third video on Nested Parallelism.

You might also be interested in the following:

GPU-Quicksort in OpenCL 2.0: Nested Parallelism and Work-Group Scan Functions

Sierpiński Carpet in OpenCL 2.0

Optimizing Simple OpenCL Kernels: Modulate Kernel Optimization

Optimizing Simple OpenCL Kernels: Sobel Kernel Optimization

Download Code Sample [Zip: 23 KB]

Introduction

Thinking about exploring speech recognition in your code? Do you want more detailed information on the inner workings of the Intel® RealSense™ SDK and voice commands? In this article, we’ll show you a sample application that uses the speech recognition feature of the Intel RealSense SDK, using C# and Visual Studio* 2015, the Intel RealSense SDK R4 or above, and an Intel® RealSense™ camera F200.

Project Structure

In this sample application, I separated out the Intel RealSense SDK functionality from the GUI layer code to make it easier for a developer to focus on the SDK’s speech functionality. I’ve done this by creating a C# wrapper class (RSSpeechEngine) around the Intel RealSense SDK Speech module. Additionally, this sample app is using the “command” mode from the Intel RealSense speech engine.

The Windows* application uses a standard Windows Form class for the GUI controls and interaction with the RSSpeechEngine class. The form class makes use of delegates as well as multithreaded technology to ensure a responsive application.

I am not trying to make a bullet-proof application. I have added some degree of exception handling, but it’s up to you to ensure that proper engineering practices are in place to ensure a stable, user friendly application.

Requirements

Hardware requirements:

4th generation Intel® Core™ processors based on the Intel microarchitecture code name Haswell
8 GB free hard disk space
Intel RealSense camera F200 (required to connect to a USB 3 port)

Software requirements:

Microsoft Windows* 8.1/Win10 OS 64-bit
Microsoft Visual Studio 2010–2015 with the latest service pack
Microsoft .NET* 4.0 (or higher) Framework for C# development
Unity* 5.x or higher for Unity game development

WordEventArg.CS

WordEventArgs derives from the C# EventArgs class. It’s a small wrapper that has one private data member added to it. The private string _detectedWord holds the word that was detected by the speech engine.

This class is used as an event argument when the RSSpeechEngine class dispatches an event back to the Form class indicating the word that was detected.

RSSpeechEngine.CS

RSSpeechEngine is a wrapper class, an engine so to speak, around the speech modules command mode. I wrote the class with the following goals in mind:

Cleanly and clearly isolate as much of the Intel RealSense SDK functionality away from the client application.
Isolate each of the steps needed to get the command mode up and running in easy-to-understand function blocks.
Try to provide comments in the code to help the reader understand what the code is doing.

Below, I describe functions that comprise the RSSpeechEngine class.

public event EventHandler<WordEventArg> OnWordDetected;

The OnWordDetected Event triggers a message back to the client application letting it know that a given word was detected. The client creates an event handler to handle the WordEventArg object.

public RSSpeechEngine( )

RSSpeechEngine is the constructor for the class, and it takes no parameters. The constructor creates the global session object, which is needed in many different areas of the class for initialization.

Next I create the speech recognition module itself. If that succeeds, it creates the speech implementation object, following by the grammar module, the audio source, and the speech event hander. If none of those functions fail, _initialized is set to true, and the client application has the green light to try to use the class.

You might wonder whether the private variable _initialized variable is worth having, given the fact that each function returns a Boolean value, and if it’s false, I manually throw an error. In this example, the only real benefit of this variable is in the StartSpeechRecognition() function, where _initialized acts as a gate, allowing the recognition to start or not.

public void Dispose( )

This is a cleanup function that the client application calls to ensure memory is properly cleaned up.

public bool Initialized

This property exposes the private _initialized variable for the client application to use if they so choose. This example uses it as a gate in the StartSpeechRecognition() function.

private bool CreateSpeechRecognitionModule( )

This sets up the module itself by calling the session object’s CreateImpl function specifying the name of the module to be created. This function has one “out” parameter, which is a PXCMSpeechRecognition object.

private bool CreateSpeechModuleGrammar( )

This function ensures the speech module has a grammar to work with. A grammar for command mode is a set of words that speech recognition will recognize. In this example I’ve used the words “Fire,” “Bomb,” and “Lazer” as a grammar.

It should be noted that it’s possible to have more than one grammar loaded up into the speech recognition engine at one time. As an example, you could do the following which would load up three different grammars into the speech recognition module all waiting to be used.

_speechRecognition.BuildGrammarFromStringList( 1, _commandWordsList1, null ); _speechRecognition.BuildGrammarFromStringList( 2, _commandWordsList2, null ); _speechRecognition.BuildGrammarFromStringList( 3, _commandWordsList3, null );

Then when you want to use them, you would use the following

_speechRecognition.SetGrammar( 1 );
_speechRecognition.SetGrammar( 2 );
_speechRecognition.SetGrammar( 3 );

While you can build multiple grammars for speech recognition, you can only have one grammar loaded at a time.

When might you want to use multiple grammars? Maybe you have a game where each level has a different grammar. You can load one list for one level, and set its grammar. Then on a different level, you can use a different word list that you’ve already loaded. To use it, you simply set that level’s grammar.

As the sample code shows, I created one array that contains three words. I use the BuildGrammarFromStringList function to load that grammar, then use the SetGrammar function to ensure it’s active.

private bool CreateAudioSource( )

The CreateAudioSource function finds and connects to the Intel RealSense camera’s internal microphone. Even though I am specifically targeting this microphone, you can use any microphone attached to your computer. As an example, I’ve even used my Plantronics headset, and it works fine.

The first thing the function does is initialize the _audioSource PXCMAudioSource object by calling the session’s CreateAudioSource() function, and then check to ensure it was successfully created by checking it against null.

If I have a valid audio source, the next step is to create the device information for the audio source, which is talked about in the next function description CreateDeviceInfo(). For now, let’s assume that the valid audio device information was created. I set the volume of the microphone, which controls the volume at which the microphone records the audio signal. Then I set the audio sources device information that was created in the CreateDeviceInfo() function.

private bool CreateDeviceInfo( )

This function queries all the audio devices on a computer. First it instructs the audio device created in the previous function to scan for all devices on the computer by making the call to ScanDevices(). After the scan, the next step—an important one—is to iterate over all the devices found. This step is important because ALL audio devices connected to your computer will be detected by the Intel RealSense SDK. For example, I have a Roland OctaCapture* connected to my computer at home via USB. When I run this function on my computer, I get eight different audio devices listed for just this one Roland unit.

There are several ways to do this, but I’ve chosen what seems to be the standard in all the examples I’ve seen in the SDK: I hop into a for loop and query the system and populate the _audioDeviceInfo object with the “next as specified by i” audio device detected on the system. If the current audio device’s name matches the name of the audio device I want, I set the created variable to true and break out of the loop.

DEVICE_NAME was created and initialized at the top of the source code file as

string DEVICE_NAME = "Microphone Array (2- Creative VF0800)";

How did I know this name? I had to run the for loop, set a break point, and look at all the different devices as I iterated through them. It was obvious on my computer which one was the Intel RealSense camera in comparison to the other devices.

Once we have a match, we can stop looking. The global PXCMAudioSource.DeviceInfo object _audioDeviceInfo will now contain the proper device information to be used back in the calling function CreateAudioSource().

NOTE: I have seen situations where the device name is different. On one computer the Intel RealSense camera’s name will be "Microphone Array (2- Creative VF0800)" and on my other computer with the same Intel RealSense camera but a different physical device, the name will be "Microphone Array (4- Creative VF0800)". I’m not sure why this is but it’s something to keep in mind.

private bool CreateSpeechHandler( )

The CreateSpeechHandler function tells the speech recognition engine what to do after a word has been detected. I create a new PXCMSpeechRecognition.Handler object, ensuring it’s not null, and if not, I assign the OnSpeechRecognition function to the onRecognition delegate.

private void OnSpeechRecognition( PXCMSpeechRecognition.RecognitionData data )

OnSpeechRecognition is the event handler for the _handler.onRecognition delegate when a word has been detected. It accepts a single RecognitionData parameter. This parameter contains things like a list of scores, which is used when we need to maintain a certain confidence level in the word that was detected, as can be seen in the if(..) statement. I want to be sure that the confidence level of the currently detected work is at least 50 as defined here.

int CONFIDENCE_LEVEL = 50;

If the confidence level is at least 50, I create an instance of OnWordDetected event passing in a new instance of the WordEventArg object. WordEventArg takes a single parameter, which is the word that was detected.

At this point, the OnWordDetected sends a message to all subscribers, in this case, the Form informs it that the speech module detected one of the words in the list.

public void StartSpeechRecognition( )

StartSpeechRecogntion tells the speech recognition functionality to start listening to speech. First it checks to see if everything was initialized properly. If not, it returns out of the function.

Next, I tell it to start recording, passing in the audio source I want to listen to/for, and then what handler object has the function to call when a word is detected. And for good measure, just check to see if any error occurred.

public void StopSpeechRecognition( )

The function calls the module’s .StopRec() function to stop the processing and kills the internal thread. This process can take a few milliseconds. As such, if you immediately try to call the Dispose function on this class, there is a strong chance the Dispose() code will cause an exception. So if you try to set _speechRecognition to null and call its dispose method before StopRec() has completed, it will cause the application to crash. This is why I added the Thread.Sleep() function. I want the execution to halt just long enough to give StopRec() time to complete before moving on.

MainForm.CS

MainForm is the Windows GUI client to RSSpeechEngine. As mentioned previously, I designed this entire application so that the Form class only handles GUI operations and kicks off the RSSpeechEngine engine.

The RSSpeechEngineSample application itself is not a multithreaded application per se. However, because the PXCMSpeechRecognition module has a function that runs its own internal thread and we need data from that thread, we have to use some multithreaded constructs in the main form. This can be seen in the function that updates the label and the list box.

To start with, I create a global RSSpeechEngine _RSSpeechEngine object that will get initialized in the constructor. After that I declare two delegates. These delegates do the following

SetStatusLabelDelegate. Sets the applications status label from stopped to running.
AddWordToListDelegate. Adds a detected word to the list on the form.

public MainForm( )

This is the forms constructor. It initializes the _RSSpeechEngine object and assigns the OnWordDetected event to the AddWordToListEventHandler function.

private void btnStart_Click( object sender, EventArgs e )

This is the start button event click handler, which updates the label to “Running” and starts the RSSpeechEngine’s StartVoiceRecognition functionality.

private void btnStop_Click( object sender, EventArgs e )

This is the stop button event click handler, which tells the RSSPeechEngine to stop processing and sets the label to “Not Running.”

private void AddWordToListEventHandler( object source, WordEventArg e )

This is the definition for the AddWordToEventListHandler that gets called when _RSSpeechEngine has detected a word. It calls the AddWordToList function, which knows how to deal with multithreaded functionality.

private void AddWordToList( string s )

The AddWordToList takes one parameter, the word that was detected by the RSSpeechEngine engine. Due to the nature of multithreaded applications in Windows forms, this function looks a little strange.

When dealing with multithreaded applications in Windows, where form elements/controls need to be updated, it’s required to check a control’s InvokeRequired property. If true, this property indicates that the use of a delegate is going to be required. The delegate turns around and calls the exact same function. A new instance of the AddWordToListDelegate is created specifying the name of the function it is to call, which is a call back to the same function.

Once, the delegate has been initialized, I tell the form object to invoke the delegate with the original “s” parameter that came in.

private void SetLabel( string s )

This function works exactly like AddWordToList in the multithreaded area.

private void FormIsClosing( object sender, FormClosingEventArgs e )

This is the event that gets triggered when the application is closing. I check to ensure that _RSSpeechEngine is not null, and if _RSSpeechEngine has been successfully initialized, I call StopSpeechRecognition() forcing the processing to stop and then call the engine’s dispose function to clean up after itself.

Conclusion

I hope this article and sample code has helped you gain a better understanding of how to use Intel RealSense SDK speech recognition. These same principles will apply if you are using Unity. The intent was to show how to use Intel RealSense SDK speech recognition in an easy-to-understand, simple application covering everything to be successful in implementing a new solution.

If you think I have left out any explanation or haven’t been clear in a particular area, shoot me an email at rick.blacker@intel.com or make a comment below.

About Author

Rick Blacker is a seasoned software engineer who spent many of his years authoring solutions for database driven applications. Rick has recently moved to the Intel RealSense technology team and helps users understand the technology.

In this pair of articles on performance and memory covers basic concepts to provide guidance to developers seeking to improve software performance. These articles specifically address memory and data layout considerations. Part 1 addressed register use and tiling or blocking algorithms to improve data reuse. The paper begins by considering data layout first for general parallelism – shared memory programming with threads and then considers distributed computing via MPI as well. This paper expands those concepts to consider parallelism, both vectorization (single instruction multiple data SIMD) as well as shared memory parallelism (threading), and distributed memory computing. Lastly, this article considers data layout array of structure (AOS) versus structure of arrays (SOA) data layouts.

The basic performance principle emphasized in Part 1 is: reuse data in register or cache before it is evicted. The performance principles emphasized in this paper are: place data close to where it is most commonly used, place data in contiguous access mode, and avoid data conflicts.

Shared Memory Programming with Threads

Let's begin by considering shared memory programming with threads. Threads all share the same memory in a process. There are many popular threading models. The most well-known are Posix* threads and Windows* threads. The work involved in properly creating and managing threads is error prone. Modern software with numerous modules and large development teams makes it easy to make errors in parallel programming with threads. Several packages have been developed to ease thread creation, management and best use of parallel threads. The two most popular are OpenMP* and Intel® Threading Building Blocks. A third threading model, Intel® Cilk™ Plus, has not gained the adoption levels of OpenMP and Threading Building blocks. All of these threading models create a thread pool which is reused for each of the parallel operations or parallel regions. OpenMP has an advantage of incremental parallelism through the use of directives. Often OpenMP directives can be added to existing software with minimal code changes in a step-wise process. Allowing a thread runtime library to manage much of the thread maintenance eases development of threaded software. It also provides a consistent threading model for all code developers to follow, reduces the likelihood of some common threading errors, and provides an optimized threaded runtime library produced by developers dedicated to thread optimization.

The basic parallel principles mentioned in the introductory paragraphs are place data close to where it will be used and avoid moving the data. In threaded programming the default model is that data is shared globally in the process and may accessed by all threads. Introductory articles on threading emphasize how easy it is to begin threading by applying OpenMP to do loops (Fortran*) or for loop (C). These methods typically show good speed up when run on two to four cores. These methods frequently scale well to 64 threads or more. Just as frequently, though, they do not, and in some of those cases where they do not it is a matter of following a good data decomposition plan. This is a matter of designing an architecture of good parallel code.

It is important to explore parallelism at a higher level in the code call stack than where a parallel opportunity is initially identified by developer or software tools. When a developer recognizes that tasks or data can be operated on in parallel consider these questions in light of Amdahls’ law: “can I begin the parallel operations higher up in the call stack before I get to this point? If I do this do I increase the parallel region of my code that will then provide better scalability?”

The placement of data and what data must be shared through messages is carefully considered. Data is laid out to so that the data is placed where it is used most and then sent to other systems as needed. For applications represented in a grid, or a physical domain with specific partitions, it is common practice in MPI software to add a row of “ghost” cells around the subgrid or sub-domain. The ghost cells are used to store the values of data sent by the MPI process which updates those cells. Typically ghost cells are not used in threaded software, but just as you minimize the length of the edge along the partition for message passing, it is desirable to minimize the edge along partitions for threads using shared memory. This minimizes the needs for thread locks (or critical sections) or for cache usage penalties associated with cache ownership.

Large multi-socketed systems, although they share global memory address space, typically have non-uniform memory access (NUMA) times. Data in a memory bank closest to another socket takes more time to retrieve, or has longer latency, than data located in bank closest to the socket where the code is running. Access to close memory has a shorter latency.

Figure 1. Latency memory access, showing relative time to access data.

If one thread allocates and initializes data, that data is usually placed in the bank closest to the socket that the thread allocating and initializing the memory is running on (Figure 1). You can improve performance having each thread allocate and first reference the memory it will predominately use. This is usually sufficient to ensure that the memory is closest to the socket the thread is running on. Once a thread is created and active, the OS typically leaves threads on the same socket. Sometimes it has been beneficial to explicitly bind a thread to a specific core to prevent thread migration. When data has a certain pattern it is beneficial to assign, bind or set affinity of the threads to specific cores to match this pattern. The Intel OpenMP runtime library (part of Intel® Parallel Studio XE 2016) provides explicit mapping attributes which have proven useful for Intel® Xeon Phi™ coprocessor.

These types are compact, scatter, and balanced.

The compact attribute allocates consecutive or adjacent threads to the symmetric multithreading (SMTs) on a single core before beginning to assign threads to other cores. This is ideal where threads share data with consecutively numbered (adjacent) threads.
The scatter affinity assigns a thread to each core, before going back to the initial cores to schedule more threads on the SMTs.
Balanced affinity assigns thread of consecutive or neighboring IDs to the same core in a balanced fashion. Balanced is the recommended be starting affinity for those seeking to optimize thread affinity according to the Intel 16.0 C++ compiler documentation. The Balanced affinity setting is only available for Intel® Xeon Phi™ product family. It is not a valid option for general CPUs. When all the SMTs on a Xeon Phi platform are utilized balanced and compact behave the same. When only some of the SMTs are utilized on a Xeon Phi platform the compact method will fill up all the SMTs on the first cores and leave some cores ideal at the end.

Taking the time to place thread data close to where it is used is important when working with dozens of threads. Just as data layout is important for MPI programs it can be important for threaded software as well.

There are two short items to be considered regarding memory and data layout. These are relatively easy to address, but can have significant impact. The first is false sharing and the second is data alignment. One of the interesting performance issues with threaded software is false sharing. Each thread data operates on is independent. There is no sharing, but the cache line containing both data points is shared. This is why it is called false sharing or false data sharing; the data isn't shared, but the performance behavior is as though it is shared.

Consider a case where each thread increments its own counter, but the counter is in a one-dimensional array. Each thread increments its own counter. To increment its counter, the core must own the cache line. For example, thread A on socket 0 takes ownership of the cacheline and increments iCount[A]. Meanwhile thread A+1 on socket 1 increments iCount[A+1], to do this the core on socket 1 takes owner ship of the cacheline and thread A+1 updates its value. Since a value in the cacheline is altered, the cacheline for the processor on socket 0 is invalidated. At the next iteration, the processor in socket 0 takes ownership of the cacheline from socket 0 and alters the value in iCount[A], which in turn invalidates the cacheline in socket 1. When the thread on socket 1 is ready to write the cycle repeats. A significant number of cycles are spent to maintain cache coherency, as invalidating cachelines, regaining control and synchronizing to memory that performance can be impacted.

The best solution to this is not invalidate the cache. For example, at the entrance to the loop, each thread can read its count and store it in a local variable on its stack (reading does not invalidate the cache). When the work is completed the thread can copy this local value back into the permanent location (see Figure 2). Another alternative is to pad data so that data used predominately by a specific thread in its own cacheline.

int iCount[nThreads] ;
      .
      .
      .
      for (some interval){
       //some work . . .
       iCount[myThreadId]++ // may result in false sharing
     }

int iCount[nThreads*16] ;// memory padding to avoid false sharing
      .
      .
      .
      for (some interval){
       //some work . . .
       iCount[myThreadId*16]++ //no false sharing, unused memory
     }

int iCount[nThreads] ; // make temporary local copy

      .
      .
      .
      // every thread creates its own local variable local_count
      int local_Count = iCount[myThreadID] ;
      for (some interval){
       //some work . . .
       local_Count++ ; //no false sharing
     }
     iCount[myThreadId] = local_Count ; //preserve values
     // potential false sharing at the end,
     // but outside of inner work loop much improved
     // better just preserve local_Count for each thread

Figure 2.

The same false sharing can happen to scalars assigned to adjacent memory location. This last case is shown in the code snippet below:

int data1, data2 ; // data1 and data2 may be placed in memory
                   //such that false sharing could occur
declspec(align(64)) int data3;  // data3 and data4 will be
declspec(align(64)) int data4;  // on separate cache lines,
                                // no false sharing

When a developer designs parallelism from the beginning and minimizes shared data usage, false sharing is typically avoided. If your threaded software is not scaling well, even though there is plenty of independent work going on and there are few barriers (mutexes, critical sections), it may make sense to check for false sharing.

Data Alignment

Software performance is optimal when the data being operated on in a SIMD fashion (AVX512, AVX, SSE4, . . .) is aligned on cacheline boundaries. The penalty for unaligned data access varies according to processor family. The Intel® Xeon Phi™ coprocessors are particularly sensitive to data alignment. On the Intel Xeon Phi platforms data alignment is very important. The difference is not as pronounced on other Intel® Xeon® platforms, but performance improves measurably when data is aligned to cache line boundaries. For this reason it is recommended that the software developer always align data on 64 Byte boundaries. On Linux* and Mac OS X* this can be done with the Intel compiler option – no source code changes – just use the command line option: /align:rec64byte.

For dynamic allocated memory in C, malloc() can be replaced by _mm_alloc(datasize,64). When _mm_alloc() is used, _mm_free() should be used in place of free(). A complete article specifically on data alignment is found here: https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization.

Please check the compiler documentation as well. To show the affect of data alignment two matrices of the same size were created and both ran the blocked matrix multiply code used in Part 1 of this series. For the first case matrix A was aligned, for the second case matrix A was intentionally offset by 24 bytes (3 doubles), performance decreased by an 56 to 63% using the Intel 16.0 compiler for matrices ranging from size 1200x1200 to 4000x4000. In part 1 of this series I showed a table showing performance of loop ordering that used different compilers, when one matrix was offset there was no longer any performance benefit from using the Intel compiler. It is recommended the developer check their compiler documentation about data alignment and options available so that when data is aligned the compiler makes the best use of that information. The code for evaluating performance for a matrix offset from the cacheline is embedded in the code for Part 1 - the code for this experiment is at: https://github.com/drmackay/samplematrixcode

The compiler documentation will have additional information as well.

To show the effect of data alignment, two matrices of the same size were created and both ran the blocked matrix multiply code used in Part 1. The first matrix was aligned, the second matrix was intentionally offset by 24 bytes (3 doubles), performance decreased by 56 to 63% using the Intel 16.0 compiler for matrices ranging from size 1200x1200 to 4000x4000.

Array of Structure vs. Structure of Array

Processors do well when memory is streamed in contiguously. It is very efficient when every element of a cacheline is moved into the SIMD registers. If contiguous cachelines are also loaded the processors prefetch in an orderly fashion. In an array of structures, data may be laid out something like this:

struct {
   uint r, g, b, w ; // a possible 2D color rgb pixel layout
} MyAoS[N] ;

In this layout the rgb values are laid out contiguously. If the software is working on data across a color plane, then the whole structure is likely to be pulled into cache, but only one value, g (for example), will be used each time. If data is stored in a structure of arrays, the layout might be something like:

struct {
   uint r[N] ;
   uint g[N] ;
   uint b[N] ;
   uint w[N] ;
} MySoA ;

When data is organized in the Structure of Arrays and the software operates on all of the g values (or r or b), when a cacheline is brought into cache the entire cache line is likely to be used in the operations. Data is more efficiently loaded into the SIMD registers, this improves efficiency and performance. In many cases software developers take the time to actually temporarily move data into a Structure of Arrays to operate on and then copy it back as needed. When possible it is best to avoid this extra copying as this takes execution time.

Intel (Vectorization) Advisor 2016 “Memory Access Pattern” (MAP) analysis identifies loops with contiguous (“unit-stride”), non-contiguous and “irregular” access patterns:

The “Strides Distribution” column provides aggregated statistics about how frequent each pattern took place in given source loop. In picture above the left two-thirds of the bar is colored blue – indicating a contiguous access pattern, however right one-third is colored red – which means non-contiguous memory access. For codes with pure AoS pattern Advisor can also automatically get specific “Recommendation” to perform AoS -> SoA transformation.

The Access Pattern and more generally Memory Locality Analysis is simplified in Advisor MAP by additionally providing memory “footprint” metrics and by mapping each “stride” (i.e. access pattern) diagnostic to particular C++ or Fortran* objects/array names. Learn more about Intel Advisor at

https://software.intel.com/en-us/get-started-with-advisor and https://software.intel.com/en-us/intel-advisor-xe

Structure of array and array of structure data layout are relevant for many graphics programs as well as nbody (e.g. molecular dynamics), or anytime data/properties (e.g. mass, position, velocity, charge), may be associated with a point or specific body. Generally, the Structure of Arrays is more efficient and yields better performance.

Starting from Intel Compiler 2016 Update 1, AoS -> SoA transformation made simpler by introducing Intel® SIMD Data Layout Templates (Intel® SDLT). Using SDLT the AoS container could be simply redefined in this style:

SDLT_PRIMITIVE(Point3s, x, y, z)
sdlt::soa1d_container<Point3s> inputDataSet(count);

making it possible to access the Point3s instances in SoA fashion. Read more about SDLT here.

There are several articles written specifically to address the topic of AoS vs SoA. The reader is directed to read one of these specific articles:

https://software.intel.com/en-us/articles/a-case-study-comparing-aos-arrays-of-structures-and-soa-structures-of-arrays-data-layouts

and

https://software.intel.com/en-us/articles/how-to-manipulate-data-structure-to-optimize-memory-use-on-32-bit-intel-architecture
http://stackoverflow.com/questions/17924705/structure-of-arrays-vs-array-of-structures-in-cuda

While in most cases a structure of arrays matches that pattern and provides the best performance, there are a few cases where the data reference and usage more closely matches an array of structure layout, and in that case the array of structure provides better performance.

Summary

In summary here are the basic principles to observe regarding data layout and performance. Structure your code to minimize data movement. Reuse data while it is in the register or in cache; this also helps minimize data movement. Loop blocking can help minimize data movement. This is especially true for software having a 2D or 3D layout. Consider layout for parallelism – how are tasks and data distributed for parallel computation. Good domain decomposition practices benefit both message passing (MPI) and shared memory programming. A structure of arrays usually moves less data than an array of structures and performs better. Avoid false sharing, and create truly local variable or provide padding so each thread is referencing a value in a different cache line. Lastly, set data alignment to begin on a cacheline.

The complete code is available for download here: https://github.com/drmackay/samplematrixcode

In case you missed Part 1 it is located here.

Apply these techniques and see how your code performance improves.

Download PDF

Introduction

This article describes how the Intel® Tamper Protection Toolkit can help protect critical code and valuable data in a password-based encryption utility (Scrypt Encryption Utility) [3] against static and dynamic reverse-engineering and tampering. Scrypt [4] is a modern secure password-based key derivation function that is widely used in security-conscious software. There is a potential threat to scrypt described in [2] when an attacker can force generation of weak keys by forcing use of specific parameters. Intel® Tamper Protection Toolkit can be used to help mitigate this threat. We explain how to refactor relevant code and apply tamper protection to the utility.

In this article we discuss the following components of the Intel Tamper Protection Toolkit:

Iprot. An obfuscation tool that creates self-modifying and self-encrypted code
crypto library. A library that provides iprot-compatible implementations of basic crypto operations: cryptographic hash function, keyed-hash message authentication code (HMAC), and symmetric ciphers.

You can download the Intel Tamper Protection Toolkit at https://software.intel.com/en-us/tamper-protection.

Scrypt Encryption Utility Migration to Windows

Since the Scrypt Encryption Utility is targeted at Linux* and we want to show how to use the Intel Tamper Protection Toolkit on Windows*, our first task is to port the Scrypt Encryption Utility to Windows. Platform-dependent code will be framed with the following conditional directive:

#if defined(WIN_TP)
// Windows-specific code
#else
// Linux-specific code
#endif  // defined(WIN_TP)

Example 1:Basic structure of a conditional directive

The WIN_TP preprocessing symbol localizes Windows-specific code. WIN_TP should be defined for a Windows build, otherwise reference code is chosen for the build.

We use Microsoft Visual Studio* 2013 for building and debugging the utility. There are differences between Windows and Linux in various categories, such as process, thread, memory, file management, infrastructure services, and user interfaces. We had to address these differences for the migration, described in detail below.

The utility uses getopt() to handle command-line arguments. See a list of the program arguments in the Scrypt Encryption Utility section in [2]. The function getopt() is accessed from the unitstd.h POSIX OS header file. We used the get_opt() implementation from an open source project getopt_port [1]. Two new files, getopt.h and getopt.c, taken from this project were added into our source code tree.
Another function, gettimeofday(), present in the POSIX API helps the utility measure salsa opps, a number of salsa20/8 operations per second performed on the user’s platform. The utility needs the metric salsa opps to pick a secure configuration N, r, and p for input parameters so that the Scrypt algorithm executes at least the desired minimal number of salsa20/8 operations to avoid brute force attacks. We added the gettimeofday() implementation [5] to the scryptenc_cpuperf.c file.

Before the utility starts configuring the algorithm it asks the OS about the amount of available RAM allowed to be occupied for the derivation by calling the POSIX system function getrlimit(RLIMIT_DATA, …). For Windows, both soft and hard limits for the maximum size of process’s data segment (initialized data, uninitialized data, and heap) are established to be equal to 4 GB:

/* ... RLIMIT_DATA... */
#if defined(WIN_TP)
rl.rlim_cur = 0xFFFFFFFF;
rl.rlim_max = 0xFFFFFFFF;
if((uint64_t)rl.rlim_cur < memrlimit) {
	memrlimit = rl.rlim_cur;
}
#else
if (getrlimit(RLIMIT_DATA, &rl))
	return (1);
if ((rl.rlim_cur != RLIM_INFINITY) &&
     ((uint64_t)rl.rlim_cur < memrlimit))
	memrlimit = rl.rlim_cur;
#endif  // defined(WIN_TP)

Example 2:RLIMIT data limiting the process to 4GB.

Additionally, the MSVS compiler directive to inline functions in sysendian.h is added:

#if defined(WIN_TP)
static __inline uint32_t
#else
static inline uint32_t
#endif  // WIN_TP
be32dec(const void *pp);

Example 3:Adding sysendian.h inline functions

We migrated the tarsnap_readpass(…) function, which handles and masks retrieving passwords through a terminal. The function turns off echoing and masks the password with blanks in the terminal. The password is stored in memory buffer and sent to the next functions:

/* If we're reading from a terminal, try to disable echo. */
#if defined(WIN_TP)
if ((usingtty = _isatty(_fileno(readfrom))) != 0) {
	GetConsoleMode(hStdin, &mode);
	if (usingtty)
		mode &= ~ENABLE_ECHO_INPUT;
	else
		mode |= ENABLE_ECHO_INPUT;
	SetConsoleMode(hStdin, mode);
}
#else
if ((usingtty = isatty(fileno(readfrom))) != 0) {
	if (tcgetattr(fileno(readfrom), &term_old)) {
		warn("Cannot read terminal settings");
		goto err1;
	}
	memcpy(&term, &term_old, sizeof(struct termios));
	term.c_lflag = (term.c_lflag & ~ECHO) | ECHONL;
	if (tcsetattr(fileno(readfrom), TCSANOW, &term)) {
		warn("Cannot set terminal settings");
		goto err1;
	}
}
#endif  // defined(WIN_TP)

Example 4:Password control via terminal

In the original getsalt() a salt is built from pseudorandom numbers read from the Linux special file /dev/urandom. On Windows we suggest using the rdrand() instruction to read from a hardware random number generator available on Intel® Xeon® and Intel® Core™ processor families starting from Ivy Bridge microarchitecture. The C standard pseudorandom generator is not used, as getsalt() is incompatible with the Intel Tamper Protection Toolkit obfuscation tool. The function getsalt() should be protected with the obfuscator against static and dynamic tampering and reverse-engineering since a salt produced by this function is categorized as sensitive in the Scrypt Encryption Utility section in [2]. The example below shows both original and ported codes of random number generation to fill a salt:

#if defined(WIN_TP)
	uint8_t i = 0;

	for (i = 0; i < buflen; i++, buf++)
	{
		_rdrand32_step(buf);
	}
#else
	/* Open /dev/urandom. */
	if ((fd = open("/dev/urandom", O_RDONLY)) == -1)
		goto err0;
	/* Read bytes until we have filled the buffer. */
	while (buflen > 0) {
		if ((lenread = read(fd, buf, buflen)) == -1)
			goto err1;
		/* The random device should never EOF. */
		if (lenread == 0)
			goto err1;
		/* We're partly done. */
		buf += lenread;
		buflen -= lenread;
	}
	/* Close the device. */
	while (close(fd) == -1) {
		if (errno != EINTR)
			goto err0;
	}
#endif defined(WIN_TP)

Example 5:Original and ported random number generation code

Utility Protection with the Intel® Tamper Protection Toolkit

Now we will make changes in the utility design and code to help protect sensitive data identified in the threat model in the Password-Based Key Derivation section in [2]. The protection of the sensitive data is achieved by code obfuscation using iprot, the obfuscating compiler included in the Intel Tamper Protection Toolkit. It is reasonable to obfuscate only those functions that create, handle, and use sensitive data.

From the Code Obfuscation section in [2] we know that iprot takes as input a dynamic library (.dll) and produces a binary with only obfuscated export functions specified in the command line. So we put all functions working with sensitive data into a dynamic library to be obfuscated, leaving others, like command-line parsing and password reading, in the main executable.

Figure 1 shows the new design for the protected utility. The utility is split into two parts: the main executable and a dynamic library to be obfuscated. The main executable is responsible for parsing a command line, and reading a passphrase and input file into a memory buffer. The dynamic library includes export functions such as scryptenc_file and scryptdec_file that work with sensitive data (N, r, p, salt).

The key data structure used by the dynamic library is the Scrypt context, which stores HMAC digested information about the Scrypt parameters N, r, p and salt. The HMAC digest in the context is used to determine whether the latest changes in the context are done by trusted functions such as scrypt_ctx_enc_init, scrypt_ctx_dec_init, scryptenc_file, and scryptdec_file, which have an HMAC key to resign and to verify the context. These trusted functions will be resistant to modifications since we intend to obfuscate them by the obfuscation tool. Two new functions, scrypt_ctx_enc_init and scrypt_ctx_dec_init, appear to initialize the Scrypt context for both encryption and decryption modes.

Figure 1:Design for protected Scrypt Encryption Utility.

Encryption Flow

The utility uses getopt() to handle command-line arguments. See a list of the program arguments in the Password-Based Key Derivation Function section in [2].
Input file for encryption and a passphrase are read into the memory buffer.
The main executable calls scrypt_ctx_enc_init to initialize the Scrypt context for computing secure Scrypt parameters (N, r, p and salt) for specified CPU time and RAM size to the key derivation through command-line options like maxmem, maxmemfrac, and maxtime. At the end of this call the initialization function creates an HMAC digest, including the newly updated state, to prevent tampering when the function returns. The initialization function will also return the amount of memory the application must allocate to proceed with encryption.
The utility in the main executable dynamically allocates memory based on the size returned by the initialization function.
The executable calls scrypt_ctx_enc_init a second time. The function verifies integrity of the Scrypt context using Hash MAC digest. If integrity verification passes, the function sets the buffer location in the context with the allocated location and updates HMAC. File reading and dynamic memory allocation are done in the executable to avoid iprot incompatible code in the dynamic library. Code containing system calls and C standard functions generate indirect jumps and relocations that are not supported by the obfuscator.
The executable calls scryptenc_file to encrypt the file using the user-supplied passphrase. The function verifies integrity of the Scrypt context with parameters (N, r, p, and salt) used for the key derivation. If verification passes it calls the Scrypt algorithm to derivate a key. The derived key is then used for encryption. The export function forms the same output as the original Scrypt utility. This means the output has similar hash values used for integrity verification of encrypted data and correctness of passphrase during decryption.

Decryption Flow

The utility uses getopt() to handle command-line arguments. See a list of the program arguments in the Password-Based Key Derivation section in [2].
Input file for decryption and a passphrase are read into memory buffer.
The main executable calls scrypt_ctx_dec_init to check whether the provided parameters in the encrypted file data are valid and whether the key derivation function can be computed within the allowed memory and CPU time.
The utility in the main executable dynamically allocates memory based on the size returned by the initialization function.
The executable calls scrypt_ctx_dec_init a second time. The function does the same as in the encryption case.
The executable calls scryptdec_file to decrypt file using password. The function verifies integrity of the Scrypt context with parameters (N, r, p, and salt) used for the key derivation. If verification passes it calls the Scrypt algorithm to derive a key. Using hash values in encrypted data the function verifies correctness of password and integrity of encrypted data.

In the protected utility we replace the OpenSSL* implementation of the Advanced Encryption Standard in CTR mode cipher and keyed hash function with Intel Tamper Protection Toolkit crypto library one. Unlike OpenSSL*, the crypto library satisfies all code restrictions to be obfuscated by iprot and can be used from within obfuscated code without further modification. The AES cipher is called inside the scryptenc_file and scryptdec_file to encrypt/decrypt the input file using a key derived from a password. The keyed hash function is called by the export functions (scrypt_ctx_enc_init, scrypt_ctx_dec_init, scryptenc_file and scryptdec_file) to verify the data integrity of a Scrypt context before using it. In the protected utility all the exported functions of the dynamic library are obfuscated with iprot. The Intel Tamper Protection Toolkit helps us achieve the goal to mitigate threats defined in the Password-Based Key Derivation section in [2].

Our solution is a redesigned utility with an iprot obfuscated dynamic library. This is resistant to attacks determined above, and it can be proved that the Scrypt context can be updated only by the export functions because they have the HMAC private key to recalculate the HMAC digest in the context. Also, these functions and the HMAC key are protected against tampering and reverse engineering by the obfuscator. In addition, other sensitive data such as the key produced by Scrypt is protected since it is derived inside obfuscated exported functions scryptenc_file and scryptdec_file. The obfuscation compiler produces code that is self-encrypted at runtime and protected against tampering and debugging.

Let us consider the code about how scrypt_ctx_enc_init protects the Scrypt context. The main executable signals buf_p through a pointer at the same time scrypt_ctx_enc_init is called. If the pointer is equal to null, the function is called for the first time; otherwise it is called the second time. During the first call of the initialization it picks Scrypt parameters, calculates HMAC digest, and returns the amount of memory required for Scrypt computation as shown below:

// Execute for the first call when it returns memory size required by scrypt
	if (buf_p == NULL) {
// Pick parameters for scrypt and initialize the scrypt context
		// <...>

		// Compute HMAC
		itp_res = itpHMACSHA256Message((unsigned char *)ctx_p, sizeof(scrypt_ctx)-								sizeof(ctx_p->hmac),
							hmac_key, sizeof(hmac_key),
							ctx_p->hmac, sizeof(ctx_p->hmac));

		*buf_size_p = (r << 7) * (p + (uint32_t)N) + (r << 8) + 253;
	}

Example 6:The first call of code protecting the Scrypt context

During the second call, the buf_p point to allocated memory is transmitted to the scrypt_ctx_enc_init function. Using an HMAC digest in the context, the function verifies integrity of the context and makes sure that no one has changed it between the first and the second calls. After that it initializes address inside the context with buf_p and recomputes the HMAC digest since the context has changed as shown below:

// Execute for the second call when memory for scrypt is allocated
	if (buf_p != NULL) {
		// Verify HMAC
		itp_res = itpHMACSHA256Message(
(unsigned char *)ctx_p, sizeof(scrypt_ctx)-sizeof(ctx_p->hmac),
			hmac_key, sizeof(hmac_key),
			hmac_value, sizeof(hmac_value));
		if (memcmp(hmac_value, ctx_p->hmac, sizeof(hmac_value)) != 0) {
			return -1;
		}

		// Initialize pointers to buffers for scrypt computation:
// ctx_p->addrs.B0 = …

		// Recompute HMAC
		itp_res = itpHMACSHA256Message(
			(unsigned char *)ctx_p, sizeof(scrypt_ctx)-sizeof(ctx_p->hmac),
			hmac_key, sizeof(hmac_key),
			ctx_p->hmac, sizeof(ctx_p->hmac));
	}

Example 7:Second call of code protecting the Scrypt context

From [2] we know that iprot imposes some restrictions on input code for it to be obfuscatable. It demands no relocations and no indirect jumps. Coding constructs in C with global variables, system functions, and C standard function calls can generate relocations and indirect jumps. The code in Example 7 calls one C standard function memcmp, which causes code incompatibility with iprot. For this reason we implement some of our own C standard functions such as memcmp, memset, and memmove used by the utility. Also, all global variables in the dynamic library are transformed into local variables and take care of the data initialized on the stack.

In addition, we encountered a problem with obfuscation of code with double values that is not covered by tutorials and is not documented in the Intel Tamper Protection Toolkit user guide. As shown below, in pickparams function salsa20/8 the core operation limit has double type and equals 32768. This value is not initialized on the stack, and a compiler puts the value into a data segment of the binary that generates relocation in the code.

	double opslimit;
#if defined(WIN_TP)
	// unsigned char d_32768[] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xE0, 0x40};
	unsigned char d_32768[sizeof(double)];
	d_32768[0] = 0x00;
	d_32768[1] = 0x00;
	d_32768[2] = 0x00;
	d_32768[3] = 0x00;
	d_32768[4] = 0x00;
	d_32768[5] = 0x00;
	d_32768[6] = 0xE0;
	d_32768[7] = 0x40;
	double *var_32768_p = (double *) d_32768;
#endif

	/* Allow a minimum of 2^15 salsa20/8 cores. */
#if defined(WIN_TP)
	if (opslimit < *var_32768_p)
		opslimit = *var_32768_p;
#else
	if (opslimit < 32768)
		opslimit = 32768;
#endif

Example 8:Code for iprot-compatible double variable

We solved this problem by initializing a byte sequence on the stack with a hex dump that matches the hex representation of this double value in memory and creates a double pointer to this sequence.

To obfuscate the dynamic library with iprot, we use the following command:

iprot scrypt-dll.dll scryptenc_file scryptdec_file scrypt_ctx_enc_init scrypt_ctx_dec_init -c 512 -d 2600 -o scrypt_obf.dll

The interface of the protected utility is not changed. Let us compare the unobfuscated code with the obfuscated version. The following shows the disassembled code with significant difference between the two versions.

# non-obfuscated code
scrypt_ctx_enc_init PROC NEAR
        push    ebp                              ; 10030350 _ 55
        mov     ebp, esp                         ; 10030351 _ 8B. EC
        sub     esp, 100                         ; 10030353 _ 83. EC, 64
        mov     dword ptr [ebp-4H], 0  ; 10030356 _ C7. 45, FC, 00000000
        mov     eax, 1                           ; 1003035D _ B8, 00000001
        imul    ecx, eax, 0                      ; 10030362 _ 6B. C8, 00
        mov     byte ptr [ebp+ecx-1CH], 1 ; 10030365 _ C6. 44 0D, E4, 01
        mov     edx, 1                           ; 1003036A _ BA, 00000001
        shl     edx, 0                           ; 1003036F _ C1. E2, 00
        mov     byte ptr [ebp+edx-1CH], 2 ; 10030372 _ C6. 44 15, E4, 02
        mov     eax, 1                           ; 10030377 _ B8, 00000001
        shl     eax, 1                           ; 1003037C _ D1. E0
        mov     byte ptr [ebp+eax-1CH], 3 ; 1003037E _ C6. 44 05, E4, 03
        mov     ecx, 1                           ; 10030383 _ B9, 00000001<…>

# obfuscated code with default parameters
scrypt_ctx_enc_init PROC NEAR
        mov     ebp, esp                     ; 1000100E _ 8B. EC
        sub     esp, 100                     ; 10001010 _ 83. EC, 64
        mov     dword ptr [ebp-4H], 0        ; 10001013 _ C7. 45, FC, 00000000
        mov     eax, 1                       ; 1000101A _ B8, 00000001
        imul    ecx, eax, 0                  ; 1000101F _ 6B. C8, 00
        mov     byte ptr [ebp+ecx-1CH], 1    ; 10001022 _ C6. 44 0D, E4, 01
        push    eax                          ; 10001027 _ 50
        pop     eax                          ; 1000102D _ 58
        lea     eax, [eax+3FFFD3H]           ; 1000102E _ 8D. 80, 003FFFD3
        mov     dword ptr [eax], 608469404   ; 10001034 _ C7. 00, 2444819C
        mov     dword ptr [eax+4H], -124000508 ; 1000103A _ C7. 40, 04, F89BE704
        mov     dword ptr [eax+8H], -443981569 ; 10001041 _ C7. 40, 08, E58960FF
        mov     dword ptr [eax+0CH], 1633409 ; 10001048 _ C7. 40, 0C, 0018EC81
        mov     dword ptr [eax+10H], -477560832 ; 1000104F _ C7. 40, 10, E3890000<…>

Example 9:Disassembled codes for non-obfuscated and obfuscated versions

Obfuscation degrades performance, and dynamic library size is significantly increased. The obfuscator allows developers to balance security versus performance using cell size and mutation distance. The current obfuscation uses 512-byte cell size and 2600-byte mutation distance. A cell is instruction subsequence from original binary. A cell in obfuscated code is encrypted until the instruction pointer is about to enter it. A decrypted cell gets encrypted back when it is fully executed.

The source code for the utility that the Intel Tamper Protection Toolkit helps protect will soon be available at GitHub.

Acknowledgments

We thank Raghudeep Kannavara for originating the idea about to apply Intel Tamper Protection Toolkit to the scrypt encryption utility and Andrey Somsikov for many helpful discussions.

References

K. Grasman. getopt_port on GitHub https://github.com/kimgr/getopt_port/
R. Kazantsev, D. Katerinskiy, and L. Thaddeus. Understanding Intel® Tamper Protection Toolkit and Scrypt Encryption Utility, Intel Developer Zone, 2016.
C. Percival. The Scrypt Encryption Utility. http://www.tarsnap.com/scrypt/scrypt-1.1.6.tgz
C. Percival and S. Josefsson (2012-09-17). The Scrypt Password-Based Key Derivation Function. IETF.
W. Shawn. Freebsd sources on GitHub https://github.com/lattera/freebsd

About the Authors

Roman Kazantsev works in the Software & Services Group at Intel Corporation. Roman has 7+ year of professional experience in software engineering. His professional interests are focused on cryptography, software security, and computer science. Currently he occupies a position of Software Engineer where his ongoing mission is to deliver cryptographic solutions and expertise for content protection across all Intel platforms. He received his Bachelor and Masters in Computer Science with honors at Nizhny Novgorod State University, Russia.

Denis Katerinskiy works in the Software & Service Group at Intel Corporation. He has 2 years of experience in software development. His main interests are programming, performance optimization, algorithm development, mathematics, and cryptography. In his current role as a Software Developer Engineer Denis develops software simulators for Intel architecture. Denis Katerinskiy currently pursues Bachelor in Computer Science at Tomsk State University.

Thaddeus Letnes works in the Software & Services Group at Intel Corporation. He has 15+ year of professional experience in software development. His main interests are low level systems, languages, and engineering practices. In his current role as a Software Engineer developing software development tools Thaddeus works closely with software developers, architects, and project managers to produce high quality development tools. Thaddeus holds a Bachelor’s degree in Computer Science from Knox College.

Code Sample

Introduction

RealPerspective utilizes Intel® RealSense™ technology to create a unique experience. This code sample utilizes head tracking to perform a monoscopic technique for better 3D fidelity.

Using a system equipped with an Intel® RealSense® camera, the user can move their head around and have the game’s perspective correctly computed. The effect can be best described as looking into a window of another world. Traditionally this has been done with a RGB camera or IR trackers [3], but with the Intel RealSense camera’s depth information, the developer is provided accurate face tracking without any additional hardware on the user.

The sample accomplishes the effect by implementing an off-axis perspective projection described by Kooima [1]. The inputs are the face’s spatial X, Y position and the face’s average depth.

Build and Deploy

The Intel® RealSense™ SDK and the Intel RealSense Depth Camera Manager are required for development.

For deploying the project to end users, the matching SDK Runtime Redistributable must be installed.

To download the SDK, SDK Runtime, and Depth Camera Manager, go to: https://software.intel.com/en-us/intel-realsense-sdk/download

Unity

For the Unity project, to ensure compatibility with the SDK installed on the system, please replace libpxcclr.unity.dll and libpxccpp2c.dll in Libraries\x64 and Libraries\x86 of the project with the DLLs in bin\x64 and bin\x86 of the Intel RealSense SDK respectively.

Method

Initialize Intel RealSense Camera

During start up, the Sense Manager initializes and configures the face module for face detection (bounding rectangle and depth). Once completed the Sense Manager pipeline is ready for data.

Process Input

The process input function returns a Vector3 normalized to 0 to 1 containing the face’s 3D spatial position from the Intel RealSense camera. If the Intel RealSense camera is not available, mouse coordinates are used.

The face’s x, y, z location comes from the Intel RealSense SDK’s face module. The face’s XY planar position comes from the center of the face’s bounding rectangle detection in pixel units. The face’s z comes from the face’s average depth in millimeters. The function is set to be non-blocking, so if data is not available the Update function is not delayed. Otherwise, the previous perspective and view matrices are unchanged.

Calculate Off-Axis Parameters

p_a, p_b, and p_c are points that define the screen extents and determine the screen size, aspect ratio, position, and orientation in space. These screen extents are scaled based on the screen and come from the application’s window. Finally n and f determine the near and far planes, and in Unity, the values come from the Camera class.

For example, if the room is 16 by 9 units with an aspect ratio of 16:9, then p_a, p_b, p_c can be set so the room covers the screen. Distance from p_a to p_b will be width of the room, 16 units, and the distance from p_ato p_c is the height of the room, 9 units. For additional examples, see Kooima [1].

Calculate Off-Axis Matrices

The goal of this function is to return the off-axis matrix. The projection matrix is essentially based on the OpenGL* standard glFrustum. The final step is aligning the eye with the XY plane and translating to the origin. This is similar to what the camera or view matrix does for the graphics pipeline.

Projection matrix

First, the orthonormal basis vectors ( v_r, v_u, v_n) are computed based on the screen extents. The orthonormal basis vectors will later help project the screen space onto the near plane and create the matrix to align the tracker space with the XY plane.

Next, screen extents vectors, v_a, v_b, and v_c, are created from the screen plane.

Next, frustum extents, l, r, b, and t, are computed from the screen extents by projecting the basis vectors onto the screen extent vectors to get their location on the plane then scaled back by distance from the screen plane to the near plane. This is done because the frustum extents define the frustum created on the near plane.

Finally, once the frustum extents are computed, the values are plugged into the glFrustum function to produce the perspective projection matrix. The field of view can be computed from frustum extents [2].

Projection plane orientation

The foreshortening effect of the perspective projection works only when the view position is at the origin. Thus the first step is to align the screen with the XY plane. The matrix M is constructed to get the Cartesian coordinate system to screen-local coordinates by its basis vectors ( v_r, v_u, and v_n). However, the screen space is what needs to be aligned with the XY plane, thus the transpose of matrix M is taken.

View point offset

Similarly, the tracker eye position, p_e, must be translated to the frustum origin. This is done with a translation matrix T.

Composition

The computed matrices are fed back into Unity’s Camera data structure.

Performance

The test system was a GIGABYTE Technology BRIX* Pro with an Intel® Core™ i7-4770R processor (65W TDP).

In general, the performance overhead on average is very low. The entire Update() function completes in less than 1 ms. About 0.50 ms for consecutive detected frames with face and 0.20 ms for frames with no detected faces. New data is available about every 33 ms.

Use Case and Future work

The technique discussed in the sample can be used seamlessly in games when RealSense hardware is available on an Intel® processor-based system. The provided auxiliary input system adds an extra level of detail that improves the game’s immersion and 3D fidelity.

A few possible use cases are RTS (real-time strategy), MOBA (multiplayer online battle arena), and tabletop games, which let the user see the action as if they are playing a game of chess. In simulation and sandbox games the user sees the action and can get the perfect view on his or her virtual minions and lean in to see what they’re up to.

The technique is not limited to retrofitting current and previous games or even gaming use. For gaming, new uses can include dodging, lean-in techniques, and full screen HUDs movement (e.g., crisis helmet HUD). Non-gaming use can be used in digital displays such as picture frames or multiple monitor support. This technique can also be considered on a spectrum of virtual reality without using a bulky and expensive head-mounted display.

References

[1]: Kooima, Robert. Generalized Perspective Projection. 2009.

[2]: OpenGL.org. Transformations.

[3]: Johnny Lee. Head Tracking for Desktop VR Displays using the WiiRemote. 2007.

Appendix

Requirements

Intel RealSense enabled system or SR300 developer camera
Intel RealSense SDK version 6.0+
Intel RealSense Depth Camera Manager SR300 version 3.0+
Microsoft Windows 8.1* or newer
Unity 5.3+

Intel® Math Kernel Library (Intel® MKL) is a highly optimized, extensively threaded, and thread-safe library of mathematical functions for engineering, scientific, and financial applications that require maximum performance. Intel MKL 11.3 Update 2 packages are now ready for download. Intel MKL is available as part of the Intel® Parallel Studio XE and Intel® System Studio . Please visit the Intel® Math Kernel Library Product Page.

Intel® MKL 11.3 Update 2 Bug fixes

New Features in MKL 11.3 Update 2

Introduced mkl_finalize function to facilitate usage models when Intel MKL dynamic libraries or third party dynamic libraries are linked with Intel MKL statically are loaded and unloaded explicitly
Compiler offload mode now allows using Intel MKL dynamic libraries
Added Intel TBB threading for all BLAS level-1 functions
Intel MKL PARDISO:
- Added support for block compressed sparse row (BSR) matrix storage format
- Added optimization for matrixes with variable block structure
- Added support for mkl_progress in Parallel Direct Sparse Solver for Clusters
- Added cluster_sparse_solver_64 interface
Introduced sorting algorithm in Summary Statistics

Check out the latest Release Notes for more updates

Contents

File: w_mkl_11.3.2.180_online.exe
Online Installer for Windows
File: w_mkl_11.3.2.180.exe
A File containing the complete product installation for Windows* (32-bit/x86-64bit development)

Abstract

The arrival of a new and improved front-facing camera, the SR300, has necessitated changes to the Intel® RealSense™ SDK and the Intel® RealSense™ Depth Camera Manager that may prevent legacy applications from functioning. This paper provides an overview of some key aspects in developing camera-independent applications that are portable across the different front-facing cameras: Intel® RealSense™ cameras F200 and SR300. It also details several methods for detecting the set of front- and rear-facing camera devices featured in the Intel RealSense SDK. These methods include how to use the installer scripts to detect the local capture device as well as how to use the Intel RealSense SDK to detect the camera model and its configuration at runtime. This paper is intended for novice and intermediate developers who have either previously developed F200 applications and want to ensure compatibility on SR300-equipped systems or want to develop new Intel® RealSense™ applications targeting SR300’s specific features.

Introduction

The arrival of the new and improved front-facing Intel RealSense camera SR300 has introduced a number of changes to the Intel RealSense SDK as well as new considerations to maintain application compatibility across multiple SDK versions. As of the R5 2015 SDK release for Windows*, three different camera models are supported including the rear-facing Intel RealSense camera R200 and two front-facing cameras: the Intel RealSense camera F200 and the newer SR300. The SR300 brings a number of technical improvements over the legacy F200 camera, including improved tracking range, motion detection, color stream and IR sensors, and lower system resource utilization. Developers are encouraged to create new and exciting applications that take advantage of these capabilities.

However, the presence of systems with different front-facing camera models presents several unique challenges for developers. There are certain steps that should be taken to verify the presence and configuration of the Intel RealSense camera to ensure compatibility. This paper outlines the best-known methods to develop a native SR300 application and successfully migrate an existing F200 application to a SR300 platform while maintaining compatibility across both cameras models.

Detecting The Intel® RealSense™ camera During Installation

In order to ensure support, first verify which camera model is present on the host system during application install time. The Intel RealSense SDK installer script provides options to check for the presence of any of the camera models using command-line options. Unless a specific camera model is required, we recommend that you use the installer to detect orientation (front or rear facing) to maintain portability across platforms with different camera models. If targeting specific features, you can check for specific camera models (Intel RealSense cameras F200, SR300, and R200) by specifying the appropriate options. If the queried camera model is not detected, the installer will abort with an error code. The full SDK installer command list can be found on the SDK documentation website under the topic Installer Options. For reference, you can find the options related to detecting the camera as well as sample commands below.

Installer Command Option

--f200 --sr300 --f200 --r200

Force a camera model check such that the runtime is installed only when the requested camera model is detected. If the camera model is not detected, the installer aborts with status code 1633.

--front --rear

The --front option checks for any front facing camera and --rear option for any rear facing camera.

Examples

Detect presence of any rear-facing camera and install the 3D scan runtime silently via web download:

intel_rs_sdk_runtime_websetup_YYYY.exe --rear --silent --no-progress --acceptlicense=yes --finstall=core,3ds --fnone=all

Detect presence of an F200 camera and Install the face runtime silently:

intel_rs_sdk_runtime_YYYY.exe --f200 --silent --no-progress --acceptlicense=yes --finstall=core,face3d --fnone=all

Detecting The Intel RealSense Camera Configuration at Runtime

After verifying proper camera setup at install time, verify the capture device and driver version (that is, Intel RealSense Depth Camera Manager (DCM) version) during the initialization of your application. To do this, use the provided mechanisms in the Intel RealSense SDK such as DeviceInfo and the ImplDesc structures. Note that the device information is only valid after the Init function of the SenseManager interface.

Checking the Camera Model

To check the camera model at startup, use the QueryDeviceInfo function, which returns a DeviceInfo structure. The DeviceInfo structure includes a DeviceModel member variable that includes all supported camera models available. Note that the values enumerated by the DeviceModel include predefined camera models that will change as the SDK evolves. You will want to verify that the SDK version on which you are compiling your application is recent enough to include the appropriate camera model that your application requires.

Code sample 1 illustrates how to use the QueryDeviceInfo function to retrieve the currently connected camera model in C++. Note that the device information is only valid after the Init function of the SenseManager interface.

Code Sample 1:Using DeviceInfo to check the camera model at runtime.

// Create a SenseManager instance
PXCSenseManager *sm=PXCSenseManager::CreateInstance();
// Other SenseManager configuration (say, enable streams or modules)
...
// Initialize for starting streaming.
sm->Init();
// Get the camera info
PXCCapture::DeviceInfo dinfo={};
sm->QueryCaptureManager()->QueryDevice()->QueryDeviceInfo(&dinfo);
printf_s("camera model = %d\n", dinfo.model);
// Clean up
sm->Release();

Checking The Intel RealSense Depth Camera Manager Version At Runtime

The Intel RealSense SDK also allows you to check the DCM version at runtime (in addition to the SDK runtime and individual algorithm versions). This is useful to ensure that the required Intel® RealSense™ technologies are installed. An outdated DCM may result in unexpected camera behavior, non-functional SDK features (that is, detection, tracking, and so on), or reduced performance. In addition, having the latest Gold DCM for the Intel RealSense camera SR300 is necessary to provide backward compatibility for apps designed on the F200 camera (the latest SR300 DCM must be downloaded on Window 10 machines using Windows Update). An application developed on an SDK earlier than R5 2015 for the F200 camera should verify both the camera model and DCM at startup to ensure compatibility on an SR300 machine.

In order to verify the camera driver version at runtime, use the QueryModuleDesc function, which returns the specified module’s descriptor in the ImplDesc structure. To retrieve the camera driver version, specify the capture device as the input argument to the QueryModuleDesc and retrieve the version member of the ImplDesc structure. Code sample 2 illustrates how to retrieve the camera driver version in the R5 version of the SDK using C++ code. Note that if the DCM is not installed on the host system, the QueryModuleDesc call returns a STATUS_ITEM_UNAVAILABLE error. In the event of a missing DCM or version mismatch, the recommendation is to instruct the user to download the latest version using Windows Update. For full details on how to check the SDK, camera, and algorithm versions, please reference the topic titled Checking SDK, Camera Driver, and Algorithm Versions on the SDK documentation website.

Code Sample 2:Using ImplDesc to get the algorithm and camera driver versions at runtime.

PXCSession::ImplVersion GetVersion(PXCSession *session, PXCBase *module) {
    PXCSession::ImplDesc mdesc={};
    session->QueryModuleDesc(module, &mdesc);
    return mdesc.version;
}
// sm is the PXCSenseManager instance
PXCSession::ImplVersion driver_version=GetVersion(sm->QuerySession(), sm->QueryCaptureManager()->QueryCapture());
PXCSession::ImplVersion face_version=GetVersion(sm->QuerySession(), sm->QueryFace());

Developing For Multiple Front-Facing Camera Models

Starting with the R5 2015 SDK release for Windows, a new front-facing camera model, named the Intel RealSense camera SR300, has been added to the list of supported cameras. The SR300 improves upon the Intel RealSense camera model F200 in several key ways, including increased tracking range, lower power consumption, better color quality in low light, increased SNR for the IR sensor, and more. Applications that take advantage of the SR300 capabilities can result in improved tracking quality, speed, and enhanced responsiveness over F200 applications. However, with the addition of a new camera in the marketplace comes increased development complexity in ensuring compatibility and targeting specific features in the various camera models.

This section summarize the key aspects developers must know in order to write applications that take advantage of the unique properties of the SR300 cameras or run in backward compatibility mode with only F200 features. For a more complete description of how to migrate F200 applications to SR300 applications, please read the section titled Working with Camera SR300 on the SDK documentation website.

Intel RealSense camera F200 Compatibility Mode

In order to allow older applications designed for the F200 camera to function on systems equipped with an SR300 camera, the SR300 DCM (gold or later) implements an F200 compatibility mode. It is automatically activated when a streaming request is sent by a pre-R5 application, and it allows the DCM to emulate F200 behavior. In this mode if the application calls QueryDeviceInfo, the value returned will be “F200” for the device name and model. Streaming requests from an application built on the R5 2015 or later SDK are processed natively and are able to take advantage of all SR300 features as hardware compatibility mode is disabled.

It is important to note that only one mode (native or compatibility) can be run at a time. This means that if two applications are run, one after the other, the first application will determine the state of the F200 compatibility mode. If the first application was compiled on an SDK version earlier than R5, the F200 compatibility mode will automatically be enabled regardless of the SDK version of the second application. Similarly, if the first application is compiled on R5 or later, the F200 compatibility mode will automated be deactivated and any subsequent applications will see the camera as an SR300. Thus if the first application is R5 or later (F200 compatibility mode disabled) but a subsequent application is pre-R5, the second application will not see a valid Intel RealSense camera on the system and thus will not function. This is because the pre-R5 application requires a F200 camera but the DCM is running in native SR300 mode due to the earlier application. There is currently no way to overwrite the F200 compatibility state for the later application, nor is it possible for the DCM to emulate both F200 and SR300 simultaneously.

Table 1 summarizes the resulting state of the compatibility mode when multiple Intel RealSense applications are running on the same system featuring an SR300 camera (application 1 is started before application 2 on the system):

Table 1: Intel RealSense camera F200 Compatibility Mode State Summary with Multiple Applications Running

Application 1	Application 2	F200 Compatibility Mode State	Comments
Pre-R5 Compilation	Pre-R5 Compilation	ACTIVE	App1 is run first, DCM sees pre-R5 app and enables F200 compatibility mode.
Pre-R5 Compilation	R5 or later Compilation	ACTIVE	App1 is run first, DCM sees pre-R5 app and enables F200 compatibility mode.
R5 or later Compilation	Pre-R5 Compilation	NOT ACTIVE	App1 is run first, DCM sees SR300 native app and disables F200 compatibility mode. App2 will not see a valid camera and will not run.
R5 or later Compilation	R5 or later Compilation	NOT ACTIVE	App1 is run first, DCM sees R5 or later app and disables F200 compatibility mode. Both apps will use native SR300 requests.

Developing Device-Independent Applications

To accommodate the arrival of the Intel RealSense camera SR300, many of the 2015 R5 Intel RealSense SDK components have been modified to maintain compatibility and to maximize the efficiency of the SR300’s capabilities. In most cases, developers should strive to develop camera-agnostic applications that will run on any front-facing camera to ensure maximum portability across various platforms. The SDK modules and stream interfaces provide the capability to handle all of the platform differentiation if used properly. However, if developing an application that uses unique features of either the F200 or SR300, the code must identify the camera model and handle cases where the camera is not capable of those functions. This section outlines the key details to keep in mind when developing front- facing Intel RealSense applications to ensure maximum compatibility.

SDK Interface Compatibility

To maintain maximum compatibility between the F200 and SR300 cameras, use the built-in algorithm modules (face, 3DS, BGS, and so on) and the SenseManager interface to read raw streams without specifying any stream resolutions or pixel formats. This approach allows the SDK to handle the conversion automatically and minimize necessary code changes. Keep in mind that the maturity levels of the algorithms designed for SR300 may be less than those designed for F200 given that the SR300 was not supported until the 2015 R5 release. Be sure to read the SDK release notes thoroughly to understand the maturity of the various algorithms needed for your application.

In summary, the following best practices are recommended to specify a stream and read image data:

Avoid enabling streams using specific configuration (resolution, frame rate):
sm->EnableStream(PXCCapture::STREAM_TYPE_COLOR, 640, 480, 60);
Instead, let the SenseManager select the appropriate configuration based on the available camera model:
sm->EnableStream(PXCCapture::STREAM_TYPE_COLOR);
Use the Image functions (such as AcquireAccess and ExportData) to force pixel format conversion.
PXCImage::ImageData data;
image->AcquireAccess(PXCImage::ACCESS_READ, PXCImage::PIXEL_FORMAT_RGB32,&data);
If a native pixel format is desired, be sure to handle all cases so that the code will work independent of the camera model (see SoftwareBitmapToWriteableBitmap sample in the appendix of this document).
When accessing the camera device properties, use the device-neutral device properties as listed in Device Neutral Device Properties.

Intel RealSense SDK Incompatibilities

As of the R5 2015 Intel RealSense SDK release, there remain several APIs that will exhibit some incompatibilities between the F200 and SR300 cameras. Follow the mitigation steps outlined in Table 2 to write camera-independent code that works for any camera:

Table 2: Mitigation Steps for Front-Facing Camera Incompatibilities

Feature	Compatibility Issue	Recommendations
Camera name	Friendly name and device model ID differ between F200 and SR300	- Do not use friendly name string as a unique ID. Only use to display device name in text to the user. - Use the device model name to perform camera-specific operations. Use front-facing or rear-facing orientation value from `DeviceInfo` if sufficient.
SNR	IR sensor in SR300 has much higher SNR and native 10-bit data type (up from 8-bit on F200). As a result the `IR_RELATIVE` pixel format is no longer exposed	- Use `AcquireAccess` function to force a pixel format of `Y16` when accessing SR300 IR stream data.
Depth Stream Scaling Factor	Native depth stream data representation has changed from 1/32 mm in F200 to 1/8 mm in SR300. If accessing native depth data with pixel format `DEPTH_RAW`, a proper scaling factor must be used. (Does not affect apps using pixel format `DEPTH`)	- Retrieve the proper scaling factor using the `QueryDepthUnit` or force a pixel format conversion from `DEPTH_RAW` to `DEPTH` using the `AcquireAccess` function.
Device Properties	Several of the device properties outlined in the F200 & SR300 Member Functions document have differences between the two cameras that should be noted: - Filter option definition table has differences based on the different range capabilities between the two cameras - SR300 only supports the `FINEST` option for the `SetIVCAMAccurary` function	- Avoid using camera-specific properties to avoid camera-level feature changes. Use the Intel RealSense SDK algorithm modules to have the SDK automatically set the best settings for the given algorithm.

Conclusion

This paper outlined several best-known practices to ensure high compatibility across multiple Intel RealSense camera models. The R5 2015 SDK release for Windows features built-in functions to mitigate compatibility. It is generally good practice to design applications to use only common features across all cameras to facilitate development time and ensure portability. If an application uses features unique to a particular camera, be sure to verify the system configuration both at install time and during runtime initialization. In order to facilitate migration of applications developed for the F200 camera to SR300 cameras, the SR300 DCM includes an F200 compatibility mode that will allow legacy applications to run seamlessly on the later-model camera. However, be aware that not updating legacy apps (pre-R5) may result in failure to run on SR300 systems running other R5 or later applications simultaneously. Finally, it is important to read all supporting SDK documentation thoroughly to understand the varying behavior of certain SDK functions with different camera models.

Resources

Intel RealSense SDK Documentation

https://software.intel.com/sites/landingpage/realsense/camera-sdk/v1.1/documentation/html/index.html?doc_devguide_introduction.html

SR300 Migration Guide

https://software.intel.com/sites/landingpage/realsense/camera-sdk/v1.1/documentation/html/index.html?doc_mgsr300_working_with_sr300.html

Appendix

`SoftwareBitmapToWriteableBitmap` Code Sample

// SoftwareBitmap is the UWP data type for images.
public SoftwareBitmapToWriteableBitmap(SoftwareBitmap bitmap,
WriteableBitmap bitmap2)
{
switch (bitmap.BitmapPixelFormat)
{
default:
using (var converted = SoftwareBitmap.Convert(bitmap,
BitmapPixelFormat.Rgba8))
converted.CopyToBuffer(bitmap2.PixelBuffer);
break;
case BitmapPixelFormat.Bgra8:
bitmap.CopyToBuffer(bitmap2.PixelBuffer);
break;
case BitmapPixelFormat.Gray16:
{
// See the UWP StreamViewer sample for all the code.
....
break;
}
}
}

About the Author

Tion Thomas is a software engineer in the Developer Relations Division at Intel. He helps deliver leading-edge user experiences with optimal performance and power for all types of consumer applications with a focus on perceptual computing. Tion has a passion for delivering positive user experiences with technology. He also enjoys studying gaming and immersive technologies.

Achieve Real-Time 4K HEVC Encode, Ensure AVC & MPEG-2 Decode Robustness

Intel® Media Server Studio 2016 is now available! With a 1.1x performance and 10% quality improvement in its HEVC encoder, Intel® Media Server Studio helps transcoding solution providers achieve real-time 4K HEVC encode with broadcast quality on Intel® Xeon® E3-based Intel® Visual Compute Accelerator and select Xeon® E5 processors.¹ Robustness enhancements give extra confidence for AVC and MPEG-2 decode scenarios through handling of broken content seamlessly. See below for more details about new features to accelerate media transcoding.

As a leader in media processing acceleration and cloud-based technologies, - thanks to the power of Intel® processors and Intel® Media Server Studio, - Intel helps media solution providers, broadcasting companies, and media/infrastructure developers innovate and deliver advanced performance, efficiency and quality for media applications, and OTT/live video broadcasting.

Download Media Server Studio 2016 Now

Current Users (login required) New Users: Get Free Community version, Pro Trial or Buy Now

Improve HEVC (H.265) Performance & Quality by 10%, Use Advanced GPU Analysis, Reduce Bandwidth

Professional Edition

With 1.1x performance and 10% quality increase (compared to the previous release), media solution developers can achieve real-time 4K HEVC encode with broadcast quality on select Intel Xeon E5 platforms¹ using Intel HEVC software solution and Intel® Visual Compute Accelerator (Intel® VCA)¹ by leveraging a GPU-accelerated HEVC encoder.
Improve HEVC GPU-accelerated performance by offloading the in-loop filters like deblocking filter (DBF) and sample adaptive offset (SAO) workload to the GPU (in prior releases these filters executed on the CPU(s)).

Figure 1. The 2016 edition continues the rapid cadence of innovation with up to 10% improved video coding efficiency over the 2015 R7 version. In addition to delivering real-time 4K30 encode on select Intel® Xeon® E5 processors, this edition now provides real-time 1080p50 encode on previous generation Intel® Core™ i7 and Xeon E3 platforms.** HEVC Software/GPU Accelerated Encode Quality vs. Performance on 4:2:0, 8-bit 1080p content. Quality data is baseline to ISO HM14 (“0 %”) and computed using Y-PSNR BDRATE curves. Performance is an average across 4 bitrates ranging from low bitrate (avg 3.8Mbps) to high bitrate (avg 25 Mbps). For more information, please refer to Deliver High Quality and Performance HEVC whitepaper.

With Intel® VTune™ Amplifier advancements, developers can more easily get and interpret graphics processor usage and performance of OpenCL* and Intel® Media SDK-optimized applications. Includes CPU and GPU concurrency analysis, GPU usage analysis using hardware metrics, GPU architecture diagram, and much more.
Reduce bandwidth when using HEVC codec by running Region of Interest (ROI) based encoding, where ROI can be least compressed to preserve details as compared to other surroundings. This feature improves Video Conferencing applications. This can be achieved by setting mfxExtEncoderROI structure in the application to specify different ROIs during encoding, and can be used at initialization or at runtime.
Video Conferencing - Connect business meetings and people together more quickly via video conferencing with specially tuned low-delay HEVC mode.
Innovate for 8K - Don't limit your application for encoding steams of 4K resolution, Intel's HEVC codec in Media Server Studio 2016 now supports 8K, both software and GPU-accelerated encoder

Advance AVC (H.264) & MPEG-2 Decode & Transcode

Community, Essentials, Pro Editions

Advanced 5th generation graphics and media accelerators, plus custom drivers unlock transcoding for up to 16 HD AVC streams real-time high quality per socket on Intel Xeon E3 v4 processors (or via Intel VCA)¹by taking advantage of hardware acceleration.
Achieve up to 12 HD AVC streams on Intel® Core™ 4th generation processors with Intel® Iris™ graphics**.
Utilize improved AVC encode quality for BRefType MFX_B_REF_PYRAMID.
AVC and MPEG2 decoder is more robust than ever in handling corrupted streams and returning failure errors. Get extra confidence for AVC and MPEG-2 decode scenarios with increased robustness and recovery to corrupted output, and seamless handling of broken content. Advanced error reporting allows developers to better find and analyze decode errors.

Figure 2: In the 2016 version 40% performance gains are achieved in H.264 scenarios from improved hardware scheduling algorithms compared to the 2015 version.** This figure illustrates results of multiple H.264 encodes from a single H.264 source file accelerated using Intel® Quick Sync Video using sample multi_transcode (avail. in code samples). Each point is an average of 4 streams and 6 bitrates with error bars showing performance variation across streams and bitrates. Target Usage 7 (“TU7”) is the highest speed (and lowest quality) operating point. [1080p 50 content was obtained from media.xiph.org/video/derf/: crowd_run, park_joy (30mbps input; 5, 7.1, 10.2, 14.6, 20.9, 30 mbps output; in_to_tree, old_town_cross 15 mbps input, 2.5, 3.5, 5.1, 7.3, 10.4, 15 mbps output]. Configuration: AVC1→N Multi-Bitrate concurrent transcodes, 1080p, TU7 preset, Intel® Core™ i7-4770K CPU @ 3.50GHz ** Number of 1080p Multi-bitrate channels.

Other New and Improved Features

Improvements in Intel® SDK for OpenCL™ Applications for Windows includes new features for kernel development.
Added support for CTB-level delta QP for all quality presets i.e. Target Usage 1 through 7 for all rate control modes (CBR, VBR, AVBR, ConstQP) and all Profiles (MAIN, MAIN10, REXT).
Support for encoding IPPP..P stream i.e. no B frames by using Generalized P and B control for the applications where B frames are dropped to meet bandwidth limitations
H.264 encode natively consumes ARGB surfaces (captured from screen/game) and YUY2 surfaces, which reduces preprocessing overhead (i.e. color conversion from RGB4 to NV12 for the Intel® Media SDK to process), and increases screen capture performance.

Save Time by Using Updated Code Samples

Major features are added to sample_multi_transcode by extending the pipeline to multiple VPP filters like composition, denoise, detail (edge detection), frame rate control (FRC), deinterlace, color space conversion(CSC).
Sample_decode in the Linux sample package has drm based rendering, which can use input argument "-rdrm". Now, sample_decode and sample_decvpp are merged in the decode sample with new VPP filters like deinterlace and color space conversion added.

For More Information

The above notes are just the top level features and enhancements in Media Server Studio 2016. Access the product site and review the various edition Release Notes for more details.

Essential/Community Edition Release Notes Windows Linux
Professional Edition Release Notes: Windows Linux

1 See Technical Specifications for more details.

_{**Baseline configuration: Intel® Media Server Studio 2016 Essentials vs. 2015 R7, R4 running on Microsoft Windows* 2012 R2. Intel Customer Reference Platform with Intel® Core-i7 4770k (84W, 4C,3.5GHz, Intel® HD Graphics 4600). Intel Z87KL Desktop board with Intel Z87LPC, 16 GB (4x4GB DDR3-1600MHz UDIMM), 1.0TB 7200 SATA HDD, Turbo Boost Enabled, and HT Enabled. Source: Intel internal measurements as of January 2016.}

Download Code Sample

Introduction

The downloadable code sample demonstrates the basics of acquiring raw camera streams from Intel® RealSense™ cameras (R200 and F200) in the MATLAB® workspace using the Intel® RealSense™ SDK and MATLAB’s Image Acquisition Toolbox™ Adaptor Kit. This code sample creates possibilities for MATLAB developers to develop Intel® RealSense™ applications for Intel® platforms and has the following features:

Multi-stream synchronization. Color stream and depth stream can be acquired simultaneously (see Figure 1).
Multi-camera support. Raw streams can be acquired from multiple cameras simultaneously.
User adjustable properties. This adaptor supports video input with different camera-specific properties.

**Figure 1.** Raw Intel® RealSense™ camera (F200) color and depth streams in the MATLAB* figure.

Software Development Environment

The code sample was created on Windows 8* using Microsoft Visual Studio* 2013. The MATLAB version used in this project was MATLAB R2015a.

The SDK and Depth Camera Manager (DCM) version used in this project were:

Intel RealSense SDK V7.0.23.8048
Intel RealSense Depth Camera Manager F200 V1.4.27.41944
Intel RealSense Depth Camera Manager R200 V2.0.3.53109

Hardware Overview

We used the Intel® RealSense™ Developer Kit (F200) and Intel RealSense Developer Kit (R200).

About the Code

This code sample can be built into a dynamic link library (DLL) that implements the connection between the MATLAB Image Acquisition Toolbox™ and Intel RealSense cameras via the Intel RealSense SDK. Figure 2 shows the relationship of this adaptor to the MATLAB and Intel RealSense cameras. The Image Acquisition Toolbox™ is a standard interface provided by MATLAB to acquire images and video from imaging devices.

**Figure 2.** The relationship of the adaptor to the MATLAB* and Intel® RealSense™ cameras.

The MATLAB installation path I used was C:\MATLAB and the SDK installation path was C:\Program Files (x86)\Intel\RSSDK. Note that the include directories and library directories will need to be changed if your SDK and MATLAB installation paths are different. You will also need to set an environment variable MATLAB in system variables that contains the name of your MATLAB installation folder.

The file location I use to put the entire code sample RealSenseImaq was C:\My_Adaptor\RealSenseImaq. The RealSenseImaq solution can also be found under this directory. This RealSenseImaq solution actually consists of two projects:

The imaqadaptorkit is an adaptor kit project provided by MATLAB to make it easier to refer to some adaptor kit files in MATLAB. The file location of this project is: <your_matlab_installation_directory>\R2015a\toolbox\imaq\imaqadaptors\kit
The RealSenseImaq is an adaptor project that acquires the raw camera streams. The color and depth data from multiple cameras can be acquired simultaneously. It also contains functions to support video input with different camera-specific properties.

How to Run the Code

To build the DLL from this code sample:

First run Microsoft Visual Studio as administrator and open the RealSenseImaq solution. You must ensure that “x64” is specified under the platform setting in the project properties.
To build this code sample, right-click the project name RealSenseImaq in the solution explorer, then select it as the startup project from the menu option and build it.
For users who are MATLAB developers and not interested in the source code, pre-build DLL can be found in the C:\My_Adaptor\RealSenseImaq\x64\Debug\ folder. Note that the DLL directory will need to be changed if you put the code sample in a different location.

To register the DLL in the MATLAB:

You must inform the Image Acquisition Toolbox™ software of DLL’s existence by registering it with the Imaqregister function. The DLL can be registered by using the following MATLAB code:

Imaqregister (‘<your_directory>\RealSenseImaq.dll’);

Start MATLAB and call the imaqhwinfo function. You should be able to see the RealSenseImaq adaptor included in the adaptors listed in the InstalledAdaptors field.

To run the DLL in the MATLAB:

Three MATLAB scripts that I created have been put under the code sample directory C:\My_Adaptor\RealSenseImaq\matlab.

To start to run the DLL in MATLAB, use the scripts as follows:

MATLAB script “test1” can be used to acquire raw F200 color streams in MATLAB.
Raw color and depth streams from the Intel RealSense camera (F200) can be acquired simultaneously by using the MATLAB script “test2” (see Figure 1).
You can also use this adaptor to adjust the camera-specific property and retrieve the current value of the property. For example, the MATLAB script “test3” in the code sample file can be used to retrieve the current value of color brightness and adjust its value.

Check It Out

Follow the download link to get the code.

About Intel® RealSense™ Technology

To get started and learn more about the Intel RealSense SDK for Windows, go to https://software.intel.com/en-us/intel-realsense-sdk.

About MATLAB®

MATLAB is the high-level language and interactive environment to let you explore and visualize ideas and collaborate across disciplines. To learn more about the MATLAB, go to http://www.mathworks.com/products/matlab/.

About the Author

Jing Huang is a software application engineer in the Developer Relations Division at Intel. She is currently focused on performance of applications of the Intel Real Sense SDK on Intel platforms but has an extensive background in video and image processing and computer vision, mostly applied to medical imaging applications and multi-camera applications such as video tracking and video classification.

Online webinar: March 19, 9 a.m. (Pacific time)

Register NOW

Media application developers unite! Accessing the heterogeneous capabilities of Intel® Core™ and Intel® Xeon® processors¹ unlocks amazing opportunities for faster performance utilizing some of the most disruptive and rapidly improving aspects of Intel processor design.

Ensure that your media applications and solutions aren't leaving performance options untapped. Learn tips and tricks for adding hardware acceleration to your media code with advanced Intel media software tools in this webinar:

Get Amazing Intel GPU Acceleration for Media Pipelines
March 30, 9 a.m. (Pacific time) - Sign up today

And what’s even better than that? Many of these options and tools are FREE - and already integrated into popular open source frameworks like FFmpeg and OpenCV (more details are below).

Intel’s amazing GPU capabilities are easy-to-use, with an awesome set of tools to help you capture the best performance, quality and efficiency from your media workloads.This overview includes:

Intel GPU capabilities and architecture
Details on Intel's hardware accelerated codecs
How to get started with rapid application development using FFmpeg and OpenCV (it can be easy!)
How to get even better performance by programming directly to Intel® Media SDK and Intel® SDK for OpenCL™ Applications
H.264 (AVC) and H.265 (HEVC) capabilities
Brief tools introduction, and more!

Register Now

Figure 1. CPU/GPU Evolution

Figure 1 shows how Intel's graphics processor (GPU) has had increasing importance and placement with each generation of Intel Core processor. With potential video performance measured by the number of execution units (EUs) you can see how quickly the core processor has moved from only 12 to now 72 EUs.

Advanced Media Software Tools & Free Downloads

Intel Media SDK (for client devices)
Intel® SDK for OpenCL™ Applications
Intel Media Server Studio - free downloads include the Community Edition, or a 30-day trial of the Professional Edition

Webinar Speakers

Future Webinars, Connect with Intel at Upcoming Events

More webinar topics are planned later this year; watch our site for updates. See Intel media acceleration tools and technologies in action, meet with Intel technical experts at:

China Content Broadcasters Network (CCBN)- Beijing, March 24 to 26 -Intel booth location: Exhibition Hall #2, #2303. Meet Media Server Studio/Media SDK experts, to meet with sales contact Fred Fa n.
敬请莅临3月24-26日CCBN展会2号展厅2303展台体验英特尔高级媒体加速器。了解更多 ›
NAB (National Association of Broadcasters) Show - Las Vegas, April 18-21, Intel booth location is in the South Upper Hall, #SU621. Get a free exhibits-only pass - use code:LV6579. See exciting demos of the latest media technologies, solutions and software. Meet with Intel media experts, for a special solution or sales meeting contact Fred Fa n.

¹See hardware requirements for technical specifications.

Introduction

The SR300 is the second generation front-facing Intel® RealSense™ camera that supports Microsoft Windows* 10. Similar to the F200 camera model, the SR300 uses coded light depth technology to create a high quality 3D depth video stream at close range. The SR300 camera implements an infrared (IR) laser projector system, Fast VGA infrared (IR) camera, and a 2MP color camera with integrated ISP. The SR300 model uses Fast VGA depth mode instead of native VGA depth mode that the F200 model uses. This new depth mode reduces exposure time and allows dynamic motion up to 2m/s. This camera enables new platform usages by providing synchronized color, depth, and IR video stream data to the client system. The effective range of the depth solution from the camera is optimized from 0.2 to 1.2m for use indoors.

Figure 1: SR300 camera model.

The SR300 camera can use the Intel® RealSense™ SDK for Windows. The version that adds support for SR300 is SDK 2015 R5 or later. The SR300 will become available and built into form factors in 2016 including PCs, all-in-ones, notebooks and 2-in-1s. The SR300 model adds new features and has a number of improvements over the F200 model as follows:

Support for the new Hand Tracking Cursor Mode
Support for the new Person Tracking Mode
Increased Range and Lateral Speed
Improved Color Quality under Low-light Capture and Improved RGB Texture for 3D Scan
Improved Color and Depth Stream Synchronization
Decreased Power Consumption

Product Highlights	SR300	F200
Orientation	Front facing	Front facing
Technology	Coded Light; Fast VGA 60fps	Coded Light; native VGA 60fps
Color Camera	Up to 1080p 30 fps, 720p 60 fps	Up to 1080p 30 fps
SDK	SDK 2015 R5 or later	SDK R2 or later
DCM version	DCM 3.0.24.51819*	DCM 1.4.27.41994*
Operating System	Windows 10 64-bit RTM	Windows 10 64-bit RTM, Windows 8 64-bit
Range	Indoors; 20 – 120cm	Indoors; 20 – 120cm

* As of Feb 19^th, 2016.

New Features only Supported by SR300

Cursor Mode

The standout feature for the SR300 camera model is Cursor Mode. This tracking mode returns a single point on the hand allowing accurate and responsive 3D cursor point tracking and basic gestures. Cursor mode also improves power and performance more than 50% compared to Full Hand mode but without latency or requiring calibration. It also increases range to 85cm and tracks hand motion speed up to 2m/s. Cursor Mode includes the Click gesture to simulate the mouse click using the index finger.

Figure 2: Click gesture.

Person Tracking

Another new feature provided for the SR300 model is Person Tracking. Person Tracking also supports the rear facing camera R200, but is not available for the F200. Person Tracking supports real-time 3D body motion tracking. It has three main tracking modes: the body movement, skeleton joints, and facial recognition.

Body movement: Locates the body, head and body contour.
Skeleton joints: Return the position of body’s joints in 2D and 3D data.
Facial recognition: Compares the current face against the database of registered users to determine the user’s identification.

Person Tracking	SR300	F200
Detection	50-250 cm	NA
Tracking	50-550 cm	NA
Skeleton	50-200 cm	NA

Increased Range and Lateral Speed

The SR300 camera model introduces a new Depth mode called Fast VGA. It captures frames at HVGA but interpolates the frames to VGA before transmitting to a client. This new depth mode reduces exposure time for scenes and allows hand motion speed up to 2m/s while native VGA F200 support accepts hand motion speed only up to 0.75m/s. The SR300 model also provides a significant improvement in range from the F200 model. Using hand tracking, the SR300 was able to achieve up to 85 cm while the F200 only achieved 60 cm. Hand segmentation range is increased up to 110 cm for the SR300 improved from 100 cm for the F200 model.

Hand Tracking Mode	SR300	F200
Cursor Mode - general	20-120 cm (2m/s)	NA
Cursor Mode - kids	20-80 cm (1-2m/s)	NA
Tracking	20-85 cm (1.5m/s)	20-60 cm (0.75m/s)
Gesture	20-85 cm (1.5m/s)	20-60 cm (0.75m/s)
Segmentation	20-120 cm (1m/s)	20-100 cm (1m/s)

The range for face recognition increases from 80 cm for the F200 up to 150 cm for the SR300 model.

Face Tracking Mode	SR300	F200
Detection	30-100 cm	25-100 cm
Landmark	30-100 cm	30-100 cm
Recognition	30-150 cm	30-80 cm
Expression	30-100 cm	30-100 cm
Pulse	30-60 cm	30-60 cm
Pose	30-100 cm	30-100 cm

The SR300 model improves RGB texture mapping and achieves a more detailed 3D scan. The range for 3D scan increases up to 70 cm while also allowing more details. Blob tracking speed increases up to 2m/s and its range increases up to 150 m/s in the SR300 model.

Others Tracking Mode	SR300	F200
3D scanning	25-70 cm	25-54 cm
Blob Tracking	20-150 cm (2m/s)	30-85 cm (1.5m/s)
Object Tracking	30-180 cm	30-180 cm

The depth range of the SR300 model was improved by 50%-60%. At the 80 cm range, both SR300 and F200 cameras detect the hand clearly. When the range gets longer than 120 cm, the SR300 can still detect the hand while F200 can’t detect the hand at all at that range.

Figure 3: SR300 vs F200 depth range.

Improved Color Quality Under Low-light Capture and Improved RGB Texture for 3D Scan

The new auto exposure feature is only available with the SR300 model. The exposure compensation feature allows the images taken in low-light or high-contrast to achieve better color quality. The color stream frame rate in the low-light condition might be lower when the color stream auto exposure is enabled.

Function	SR300	F200
Color EV Compensation Control	Yes	No

Improved Color and Depth Stream Synchronization

The F200 model only supports multiple depth and color applications running at the same frame rate. The SR300 supports multiple depth and color applications running at different frame rates, within an integer interval, while maintaining temporal synchronization. This allows software to switch between different frame rates without having to start or stop the video stream.

Camera Temporal Synchronization	SR300	F200
Sync different stream types of same frame rate	Yes	Yes
Sync different stream types of different frame rate	Yes	No

Decreased Power Consumption

The SR300 camera model enables additional power gear modes that can operate at lower frame rates. This allows the image system to reduce the power consumption of the camera, but still maintains awareness. With the power gears mode, SR300 can process the scene autonomously while the system is in standby mode.

Backward Compatibility with F200 Applications

The Intel RealSense Depth Camera Manager (DCM) 3.x enables the SR300 camera to function as an F200 camera to provide backwards compatibility for applications developed for the F200 camera model. The DCM emulates the capabilities of the F200 camera so that the existing SDK applications can work seamlessly on the SR300 model. SR300 features are supported in SDK R5 2015 or later.

When a streaming request comes from an SDK application compiled with SDK earlier than SDK R5 2015, the DCM will automatically activate the compatibility mode and send calls through the F200 pipe instead of the SR300 pipe. Most applications should work without any configuration on the new SR300 model.

Infrared Compatibility

The SR300 supports a 10-bit native infrared data format while the F200 supports an 8-bit native infrared data format. The DCM driver will provide compatibility by either removing or padding 2-bit of the data to fit the requested infrared data size.

Physical Connector

The motherboard and cable design for F200 and SR300 are identical. The F200 cable plug fits into an SR300 receptacle. Therefore, an F200 cable can be used for an SR300 camera model. Both models require fully powered USB 3.0.

SDK APIs

Most SDK APIs are shared between SR300, F200 and even R200 in some cases, and the SDK modules provide the proper interface depending on the camera found at runtime. Similarly, simple color and depth streaming that does not call specific resolutions or pixel formats will run without change required.

And by using the SenseManager to read raw streams, no code change is needed to pick stream resolutions, frame rate, and pixel format without hardcoding.

For the above automatic change depending on camera, it’s important for every app to check for camera model and configuration at runtime. See Installer Options in the SDK documentation.

DCM

As of this writing, the gold DCM version for SR300 is DCM 3.0.24.59748 and updates will be provided by Windows Update. Visit https://software.intel.com/en-us/intel-realsense-sdk/download to download the latest DCM. For more information on the DCM, go to Intel® RealSenseTM Cameras and DCM Overview.

Camera Type	SR300	F200	R200
DCM Installer Version	3.x	1.x	2.x

Hardware Requirements

To support the bandwidth needed by the Intel RealSense camera, a USB 3 port is required in the client system. For details on system requirements and supported operating systems for SR300 and F200, see https://software.intel.com/en-us/RealSense/Devkit/

Summary

This document summarizes the new features and enhancements available with the front-facing Intel RealSense 3D camera SR300 beyond those available with the F200. These new features are supported in SDK 2015 R5 and DCM 3.0.24.51819 or later. This new camera is available to order at http://click.intel.com/realsense.html.

Helpful References

Here is a collection of useful references for the Intel® RealSense™ DCM and SDK, including release notes and how to download and update the software.

About the Author

Nancy Le is a software engineer at Intel Corporation in the Software and Services Group working on Intel® Atom™ processor scale-enabling projects.

Download Code Sample

Introduction

In the spirit of the Maker Movement and “America’s Greatest Makers” TV show coming this spring, this article describes a project I constructed and programmed: a pan-tilt person-tracking camera rig using an Intel® RealSense™ camera (R200) and a few inexpensive electronic components. The goal of this project was to devise a mechanism that extends the viewing range of the camera for tracking a person in real time.

The camera rig (Figure 1) consists of two hobby-grade servo motors that are directly coupled using tie wraps and double-sided tape, and a low-cost control board.

Figure 1. DIY pan-tilt camera rig.

The servos are driven by a control board connected to the computer’s USB port. A Windows* C# app running on a PC or laptop controls the camera rig. The app uses the Face Tracking and Person Tracking APIs contained in the Intel® RealSense™ SDK for Windows*.

The software, which you can download using the link on this page, drives the two servo motors in real time to physically move the rig nearly 180° degrees in two axes to center the tracked person in the field of view of the R200 camera. You can see a video of the camera rig in action here: https://youtu.be/v2b8CA7oHPw

Why?

The motivation to build a device like this is twofold: first, it presents an interesting control systems problem wherein the camera that’s used to track a moving person is also moving at the same time. Second, a device like this can be employed in interesting use cases such as:

Enhanced surveillance – monitoring areas over a wider range than is possible with a fixed camera.
Elderly monitoring – tracking a person from a standing position to lying on the floor.
Robotic videography – controlling a pan-tilt system like this for recording presentations, seminars, and similar events using a mounted SLR or video camera.
Companion robotics – controlling a mobility platform and making your robot follow you around a room.

Scope (and Disclaimer)

This article is not intended to serve as a step-by-step “how-to” guide, nor is the accompanying source code guaranteed to work with your particular rig if you decide to build something similar. The purpose of this article is to chronical one approach for building an automated person-tracking camera rig.

From the description and pictures provided in this document, it should be fairly evident how to fasten two servo motors together in a pan-tilt arrangement using tie wraps and double-sided tape. Alternatively, you can use a kit like this to simplify the construction of a pan-tilt rig.

Note: This is not a typical (or recommended) usage of the R200 peripheral camera. If you decide to build your own rig, make certain you securely fasten the camera and limit the speed and range of the servo motors to prevent damaging it. If you are not completely confident in your maker skills, you may want to pass on building something like this.

Software Development Environment

The software developed for this project runs on Windows 10 and was developed with Microsoft Visual Studio* 2015. The code is compatible with the Intel® RealSense™ SDK version 2016 R1.

This software also requires installation of the Pololu USB Software Development Kit, which can be downloaded here. The Pololu SDK contains the drivers, Control Center app, and samples that are useful for controlling servo motors over a computer’s USB port. (Note: this third-party software is not part of the code sample that can be downloaded from this page.)

Computer System Requirements

The basic hardware requirements for running the person-tracking app are:

4th generation (or later) Intel® Core™ processor
150 MB free hard disk space
4GB RAM
Intel® RealSense™ camera (R200)
Available USB3 port for the R200 camera
Additional USB port for the servo controller board

Code Sample

The software developed for this project was written in C#/WPF using Microsoft Visual Studio 2015. The user interface (Figure 2) provides the color camera stream from the R200 camera, along with real-time updates of the face and person tracking parameters.

Figure 2. Custom software user interface.

The software attempts to track the face and torso of a single person using both the Face Tracking and Person Tracking APIs. Face tracking alone is performed by default, as it currently provides more accurate and stable tracking. If the tracked person’s face goes out of view of the camera, the software will resort to tracking the whole person. (Note that the person tracking algorithm is under development and will be improved in future releases of the RSSDK.)

To keep the code sample as simple as possible, it attempts tracking only if a single instance of a face or person is detected. The displacement of a bounding rectangle’s center to the middle of the image plane is used to drive the servos. The movements of the servos will attempt to center the tracked person in the image plane.

Servo Control Algorithm

The first cut at controlling the servos in software was to derive linear equations that effectively scale the servo target positions to the coordinate system shared by the face rectangle and image, as shown in the following code snippet.

Servo.cs
public class Servo
{
   public const int Up = 1152;
   public const int Down = 2256;
   public const int Left = 752;
   public const int Right = 2256;
   .
   .
   .
}

MainWindow.xaml.cs
private const int ImageWidthMin = 0;
private const int ImageWidthMax = 640;
private const int ImageHeightMin = 0;
private const int ImageHeightMax = 480;
.
.
.
ushort panScaled = Convert.ToUInt16((Servo.Right - Servo.Left) * (faceX –
ImageWidthMin) / (ImageWidthMax - ImageWidthMin) + Servo.Left);

ushort tiltScaled = Convert.ToUInt16((Servo.Down - Servo.Up) * (faceY –
	ImageHeightMin) / (ImageHeightMax - ImageHeightMin) + Servo.Up);

MoveCamera(panScaled, tiltScaled);

Although this approach came close to accomplishing the goal of centering the tracked person in the image plane, it resulted in oscillations that occurred as the servo target position and face rectangle converged. These oscillations could be dampened by reducing the speed of the servos, but this made the camera movements too slow to effectively keep up with the person being tracked. A PID algorithm or similar solution could have been employed to tune-out the oscillations, or perhaps employing inverse kinematics to determine the camera position parameters, but I decided to use a simpler approach instead.

The chosen solution simply compares the center of the face (faceRectangle) or person (personBox) to the center of the image plane in a continuous thread and then increments or decrements the camera position in both x and y axes to find a location that roughly centers the person in the image plane. Deadband regions (Figure 3) are defined in both axes to help ensure the servos stop “hunting” for the center position when the camera is approximately centered on the person.

Figure 3. Incremental tracking method.

Building the Code Sample

The code sample has two dependencies that are not redistributed in the downloadable zip file, but are contained in the Pololu USB Software Development Kit:

UsbWrapper.dll (located in pololu-usb-sdk\UsbWrapper_Windows\)
Usc.dll (located in pololu-usb-sdk\Maestro\Usc\precompiled_obj\)

These files should be copied to the ServoInterface project folder (C:\PersonTrackingCodeSample\ServoInterface\), and then added as references as shown in Figure 4.

Figure 4. Third-party dependencies referenced in Solution Explorer.

Note that this project uses an explicit path to libpxcclr.cs.dll (the managed RealSense DLL): C:\Program Files (x86)\Intel\RSSDK\bin\win32. This reference will need to be changed if your installation path is different. If you have problems building the code samples, try removing and then re-adding this library reference.

Control Electronics

This project incorporates a Pololu Micro Maestro* 12-channel USB servo controller (Figure 5) to control the two servo motors. This device includes a fairly comprehensive SDK for developing control applications targeting different platforms and programming languages. To see how a similar model of this board was used refer to robotic hand control experiment article.

Figure 5. Pololu Micro Maestro* servo controller.

I used Parallax Standard Servo motors in this project; however, similar devices are available that should work equally well for this application. The servos are connected to channels 0 and 1 of the control board as shown in Figure 5.

Servo Controller Settings

I configured the servo controller board settings before starting construction of the camera rig. The Pololu Micro Maestro SDK includes a Control Center app (Figure 6) that allows you to configure firmware-level parameters and save them to flash memory on the control board.

Figure 6. Control Center channel settings.

Typically, you should set the Min and Max settings in Control Center to match the control pulse width of the servos under control. According to the Parallax Standard Servo data sheet, these devices are controlled using “pulse-width modulation, 0.75–2.25 ms high pulse, 20 ms intervals.” The Control Center app specifies units in microseconds, so Min would be set to 750 and Max set to 2250.

However, the construction of this particular device resulted in some hard-stops (i.e., positions that result in physical binding of the servo horn that can potentially damage the component). The safe operating range of each servo was determined experimentally, and these values were entered for channels 0 and 1 to help prevent it from inadvertently being driven to a binding position.

Summary

This article gives an overview of one approach to building an automated camera rig capable of tracking a person’s movements around a wide area. Beyond presenting an interesting control systems programming challenge, practical applications for a device like this include enhanced surveillance, elderly monitoring, etc. Hopefully, this project will inspire other makers to create interesting things with the Intel RealSense cameras and SDK for Windows.

Watch the Video

To see the pan-tilt camera rig in action, check out the YouTube video here: https://youtu.be/v2b8CA7oHPw

Check Out the Code

Follow the Download link to get the sample code for this project.

About Intel® RealSense™ Technology

To learn more about the Intel RealSense SDK for Windows, go to https://software.intel.com/en-us/intel-realsense-sdk.

About the Author

Bryan Brown is a software applications engineer at Intel Corporation in the Software and Services Group.

This paper is a more formal response to a IDZ Forum posting. See: (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/590710).

The issue as expressed by original poster was that the code did not scale well using OpenMP on an 8 core E5-2650 V2 processor with 16 hardware threads. I took some time on the forum to aid the poster by giving him some pointers, but did not take sufficient time to fully optimize the code. This article will address additional optimizations that can be made beyond that laid out in the IDZ forum.

I have to say it is unclear as to the experience level of the original poster; I am going to assume he has recently graduated from an institution that may have taught parallel programming with an emphasis on scaling. In the outside world, the practicalities are: systems have limited amount of processing resources (threads), and the emphasis should be on efficiency as well as scaling. The original sample code on that forum posting provides us with the foundation of a learning tool of how to address efficiencies, in the greater sense, and scaling in the lesser sense.

In order to present code for this paper, I took the liberty to re-work the sample code, while keeping with the overall design and spirit of the original code. This means, I kept the fundamental algorithm intact as the example code was taken from an application that may have had additional functionality requiring the given algorithm. The provided code sample used an array of LOGICALs (mask) for flow control. While the sample code could have been written without the logical array(s), the sample code provided may have been an abbreviated excerpt of a larger application, and these mask arrays may have been required for reasons not obvious in the sample code. Therefore the masks were kept.

Upon inspection of the code, and the poster’s first attempt at parallelization, it was determined that the place chosen to create the parallel region (parallel DO) had too short of run. The original code can be sketched like this:

bid = 1 ! { not stated in original posting, but would appeared to be in a DO bid=1,65 }
do k=1,km-1  ! km = 60
    do kk=1,2
        !$OMP PARALLEL PRIVATE(I) DEFAULT(SHARED)
        !$omp do 
        do j=1,ny_block     ! ny_block = 100
            do i=1,nx_block ! nx_block = 81
... {code}
            enddo
        enddo
        !$omp end do
        !$OMP END PARALLEL
    enddo
enddo

For the users first attempt at parallelization he placed the parallel do on the do j= loop. While this is the “hottest” loop levels, it is not the appropriate loop level for this problem and on this platform.

The number of threads involved was 16. With 16 threads, and the inner two loops performing a combined 8100 iterations, each thread would iterate about 506 iterations. However, the parallel region would be entered 120 times (60*2). The work performed in the inner most loop, while not insignificant, was also not significant. This resulted in the cost of the parallel region being a significant portion of the application. With 16 threads, and an outer loop count of 60 iterations (120 if loops fused), a better choice may be to raise the parallel region to the do k loop.

The code was modified to execute the do k loop many times and compute the average time to execute the entire do k loop. As optimization techniques are applied, we can then use the ratios of average times of original code to revised code as a measurement of improvement. While I did not have an 8 core E5-2650 v2 processor available for testing, I do have a 6 core E5-2620 v2 processor available. The slightly reworked code presented the following results:

OriginalSerialCode
Average time 0.8267E-02
Version1_ParallelAtInnerTwoLoops
Average time 0.1746E-02,  x Serial  4.74

Perfect scaling on an 6 core E5-2620 v2 processor would have been somewhere between 6x and 12x (7x if you assume an additional 15% for HT). A scaling of 4.74x is significantly less than an expected 7x.

In the following sections of this paper will walk you through four additional techniques of optimization.

OriginalSerialCode
Average time 0.8395E-02
ParallelAtInnerTwoLoops
Average time 0.1699E-02,  x Serial  4.94
ParallelAtkmLoop
Average time 0.6905E-03,  x Serial 12.16,  x Prior  2.46
ParallelAtkmLoopDynamic
Average time 0.5509E-03,  x Serial 15.24,  x Prior  1.25
ParallelNestedRank1
Average time 0.3630E-03,  x Serial 23.13,  x Prior  1.52

Note, the ParallelAtInnerTwoLoops report in the second run illustrates a different multiplier factor than the first run. The principal cause for this is fortuitous code placement or lack thereof. The code did not change between runs. The only difference was the addition of the extra code and the insertion of the call statements to run those subroutines. It is important to bear in mind that code placement of tight loops can significantly affect the performance of those loops. Even adding or removing a single statement can significantly affect some code run times.

To facilitate ease of reading of the code changes, the body of the inner 3 loops was encapsulated into a subroutine. This makes the code easier to study as well as easier to diagnose with program profiler (VTune). Example from the ParallelAtkmLoop subroutine:

bid = 1
!$OMP PARALLEL DEFAULT(SHARED)
!$omp do 
do k=1,km-1 ! km = 60
    call ParallelAtkmLoop_sub(bid, k)
end do
!$omp end do
!$OMP END PARALLEL
endtime = omp_get_wtime()
...
subroutine ParallelAtkmLoop_sub(bid, k)
     ...
    do kk=1,2
        do j=1,ny_block     ! ny_block = 100
            do i=1,nx_block ! nx_block = 81
...
            enddo
        enddo
    enddo
end subroutine ParallelAtkmLoop_sub

The first optimization I performed was to make two changes:

1) Move the parallelization up two loop levels to the do k loop level. Thus reducing the number of entries into the parallel region by a factor of 120. And,

2) The application used an array of LOGICAL’s as a mask for code selection. I reworked the code used to generate the values to reduce unnecessary manipulation of the mask array.

These two changes resulted in an improvement of 2.46x over the initial parallelization attempt. While this improvement is great, is this as good as you can get?

In looking at the code of the inner most loop we find:

  ... {construct masks}
  if ( LMASK1(i,j) ) then
     ... {code}
  endif

  if ( LMASK2(i,j) ) then
     ... {code}
  endif

  if( LMASK3(i,j) ) then
     ... {code}
  endif

Meaning the filter masks results in the work load per iteration being unequal. Under this circumstance, it is often better to use dynamic scheduling. This next optimization is performed with ParallelAtkmLoopDynamic. This is the same code as ParallelAtkmLoop but with schedule(dynamic) added to the !$omp do.

This simple change added an additional 1.25x. Note, dynamic scheduling is not your only scheduling option. There are others that might be worth exploring, and note that the type of scheduling often includes a modifier clause (chunk size).

The next level of optimization, which provides an additional 1.52x performance boost in performance, is what one would consider aggressive optimization. The extra 52% does require significant programming effort (but not unmanageable). The opportunity for this optimization comes from an observation that can be made by looking at the Assembly code that you can view using VTune.

I would like to stress that you do not have to understand the assembly code when you look at it. In general you can assume:

more assembly code == slower performance

What you can do is to make an inference as to the complexity (volume) of assembly code has to potential missed optimization opportunities by the compiler. And, when missed opportunities are detected, how you can use a simple technique, to aid the complier with code optimization.

When looking at the body of main work we find:

subroutine ParallelAtkmLoopDynamic_sub(bid, k)
  use omp_lib
  use mod_globals
  implicit none
!-----------------------------------------------------------------------
!
!     dummy variables
!
!-----------------------------------------------------------------------
  integer :: bid,k

!-----------------------------------------------------------------------
!
!     local variables
!
!-----------------------------------------------------------------------
  real , dimension(nx_block,ny_block,2) :: &
        WORK1, WORK2, WORK3, WORK4   ! work arrays

  real , dimension(nx_block,ny_block) :: &
        WORK2_NEXT, WORK4_NEXT       ! WORK2 or WORK4 at next level

  logical , dimension(nx_block,ny_block) :: &
        LMASK1, LMASK2, LMASK3       ! flags
   
  integer  :: kk, j, i    ! loop indices
   
!-----------------------------------------------------------------------
!
!     code
!
!-----------------------------------------------------------------------
  do kk=1,2
    do j=1,ny_block
      do i=1,nx_block
        if(TLT%K_LEVEL(i,j,bid) == k) then
          if(TLT%K_LEVEL(i,j,bid) < KMT(i,j,bid)) then
            LMASK1(i,j) = TLT%ZTW(i,j,bid) == 1
            LMASK2(i,j) = TLT%ZTW(i,j,bid) == 2
            if(LMASK2(i,j)) then
              LMASK3(i,j) = TLT%K_LEVEL(i,j,bid) + 1 < KMT(i,j,bid)
            else
              LMASK3(i,j) = .false.
            endif
          else
            LMASK1(i,j) = .false.
            LMASK2(i,j) = .false.
            LMASK3(i,j) = .false.
          endif
        else
          LMASK1(i,j) = .false.
          LMASK2(i,j) = .false.
          LMASK3(i,j) = .false.
        endif
        if ( LMASK1(i,j) ) then
          WORK1(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
            * SLX(i,j,kk,kbt,k,bid) * dz(k)
                           
          WORK2(i,j,kk) = c2 * dzwr(k) * ( WORK1(i,j,kk)            &
            - KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) &
            * dz(k+1) )

          WORK2_NEXT(i,j) = c2 * ( &
            KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) - &
            KAPPA_THIC(i,j,kbt,k+1,bid) * SLX(i,j,kk,kbt,k+1,bid) )

          WORK3(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
            * SLY(i,j,kk,kbt,k,bid) * dz(k)

          WORK4(i,j,kk) = c2 * dzwr(k) * ( WORK3(i,j,kk)            &
            - KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) &
            * dz(k+1) )

          WORK4_NEXT(i,j) = c2 * ( &
            KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) - &
              KAPPA_THIC(i,j,kbt,k+1,bid) * SLY(i,j,kk,kbt,k+1,bid) )

          if( abs( WORK2_NEXT(i,j) ) < abs( WORK2(i,j,kk) ) ) then
            WORK2(i,j,kk) = WORK2_NEXT(i,j)
          endif

          if ( abs( WORK4_NEXT(i,j) ) < abs( WORK4(i,j,kk ) ) ) then
            WORK4(i,j,kk) = WORK4_NEXT(i,j)
          endif
        endif

        if ( LMASK2(i,j) ) then
          WORK1(i,j,kk) =  KAPPA_THIC(i,j,ktp,k+1,bid)     &
            * SLX(i,j,kk,ktp,k+1,bid)

          WORK2(i,j,kk) =  c2 * ( WORK1(i,j,kk)                 &
            - ( KAPPA_THIC(i,j,kbt,k+1,bid)        &
            * SLX(i,j,kk,kbt,k+1,bid) ) )

          WORK1(i,j,kk) = WORK1(i,j,kk) * dz(k+1)

          WORK3(i,j,kk) =  KAPPA_THIC(i,j,ktp,k+1,bid)     &
            * SLY(i,j,kk,ktp,k+1,bid)

          WORK4(i,j,kk) =  c2 * ( WORK3(i,j,kk)                 &
            - ( KAPPA_THIC(i,j,kbt,k+1,bid)        &
            * SLY(i,j,kk,kbt,k+1,bid) ) )

          WORK3(i,j,kk) = WORK3(i,j,kk) * dz(k+1)
        endif
 
        if( LMASK3(i,j) ) then
          if (k.lt.km-1) then ! added to avoid out of bounds access
            WORK2_NEXT(i,j) = c2 * dzwr(k+1) * ( &
              KAPPA_THIC(i,j,kbt,k+1,bid) * SLX(i,j,kk,kbt,k+1,bid) * dz(k+1) - &
              KAPPA_THIC(i,j,ktp,k+2,bid) * SLX(i,j,kk,ktp,k+2,bid) * dz(k+2))

            WORK4_NEXT(i,j) = c2 * dzwr(k+1) * ( &
              KAPPA_THIC(i,j,kbt,k+1,bid) * SLY(i,j,kk,kbt,k+1,bid) * dz(k+1) - &
              KAPPA_THIC(i,j,ktp,k+2,bid) * SLY(i,j,kk,ktp,k+2,bid) * dz(k+2))
          end if
          if( abs( WORK2_NEXT(i,j) ) < abs( WORK2(i,j,kk) ) ) &
            WORK2(i,j,kk) = WORK2_NEXT(i,j)
          if( abs(WORK4_NEXT(i,j)) < abs(WORK4(i,j,kk)) ) &
            WORK4(i,j,kk) = WORK4_NEXT(i,j)
          endif  
        enddo
      enddo
  enddo
end subroutine Version2_ParallelAtkmLoop_sub

Making an Intel Amplifier run (VTune), and looking at line 540 as an example:

We have part of a statement that performs the product of two numbers. For this partial statement you would expect:

Load value at some index of SLX
Multiply by value at some index of dz

Clicking on the Assembly button in amplifier:

Then, sorting by source line number:

And locating source line 540, we find:

We find a total of 46 assembler instructions use to multiply two numbers.

Now comes the inference part.

The two numbers are cells of two arrays. The array SLX has six subscripts the other has one subscript. You can also observe that the last two assembly instructions are vmovss from memory and vmulss from memory. We were expecting fully optimized code to produce something similar to our expectations. The code above shows 44 out of 46 assembly instructions are associated with computing the array indexes to these two variables. Granted, we might expect a few instructions to obtain the indexes into the arrays, but not 44 instructions. Can we do something to reduce this complexity?

In looking at the source code (most recent above) you will note that the last four subscripts of SLX, and the one subscript of dz are loop invariant for the inner most two loops. In the case of SLX, the left most two indices, the inner most two loop control variables, represents a contiguous array section. The compiler optimization failed to recognize the unchanging (right most) array indices as candidates for loop invariant code that can be lifted out of a loop. Additionally, the compiler also failed to identify the left two most indexes as a candidate for collapse into a single index.

This is a good example of what future compiler optimization efforts could address under these circumstances. In this case, the next optimization, which performs a lifting of loop invariant subscripting, illustrates a 1.52x performance boost.

Now that we know that a goodly portion of the “do work” code involves contiguous array sections with several subscripts, can we somehow reduce the number of subscripts without rewriting the application?

The answer to this is yes, if we encapsulate smaller array slices represented by fewer array subscripts. How do we do this for this example code?

The choice made was for two nest levels:

at the outer most bid level (the module data indicates the actual code uses 65 bid values)
at the next to outer most level, the do k loop level. In addition to this, we consolidate the first two indexes into one.

The outermost level passes bid level array sections:

        bid = 1 ! in real application bid may iterate
        ! peel off the bid
        call ParallelNestedRank1_bid( &
            TLT%K_LEVEL(:,:,bid), &
            KMT(:,:,bid), &
            TLT%ZTW(:,:,bid), &
            KAPPA_THIC(:,:,:,:,bid),  &
            SLX(:,:,:,:,:,bid), &
            SLY(:,:,:,:,:,bid))
…
subroutine ParallelNestedRank1_bid(K_LEVEL_bid, KMT_bid, ZTW_bid, KAPPA_THIC_bid, SLX_bid, SLY_bid)
    use omp_lib
    use mod_globals
    implicit none
    integer, dimension(nx_block , ny_block) :: K_LEVEL_bid, KMT_bid, ZTW_bid
    real, dimension(nx_block,ny_block,2,km) :: KAPPA_THIC_bid
    real, dimension(nx_block,ny_block,2,2,km) :: SLX_bid, SLY_bid
…

Note, for non-pointer (allocatable or fixed dimensioned) arrays, the arrays are contiguous. This provides you with the opportunity to peel off the right most indexes to pass on a contiguous array section, and do so with merely computing the offset to the subsection of the larger array. Whereas peeling indexes other than rightmost would require creating a temporary array, and should be avoided. Though there may be some cases where it might be beneficial to do so.

And the second nested level peeled off an additional array index of the do k loop, as well as compressed the first two indexes into one:

    !$OMP PARALLEL DEFAULT(SHARED)
    !$omp do 
    do k=1,km-1
        call ParallelNestedRank1_bid_k( &
            K_LEVEL_bid, KMT_bid, ZTW_bid, &
            KAPPA_THIC_bid(:,:,:,k), &
            KAPPA_THIC_bid(:,:,:,k+1),  KAPPA_THIC_bid(:,:,:,k+2),&
            SLX_bid(:,:,:,:,k), SLY_bid(:,:,:,:,k), &
            SLX_bid(:,:,:,:,k+1), SLY_bid(:,:,:,:,k+1), &
            SLX_bid(:,:,:,:,k+2), SLY_bid(:,:,:,:,k+2), &
            dz(k),dz(k+1),dz(k+2),dzwr(k),dzwr(k+1))
    end do
    !$omp end do
    !$OMP END PARALLEL
end subroutine ParallelNestedRank1_bid   

subroutine ParallelNestedRank11_bid_k( &
    k, K_LEVEL_bid, KMT_bid, ZTW_bid, &
    KAPPA_THIC_bid_k, KAPPA_THIC_bid_kp1, KAPPA_THIC_bid_kp2, &
    SLX_bid_k, SLY_bid_k, &
    SLX_bid_kp1, SLY_bid_kp1, &
    SLX_bid_kp2, SLY_bid_kp2, &
    dz_k,dz_kp1,dz_kp2,dzwr_k,dzwr_kp1)
    use mod_globals
    implicit none
    !-----------------------------------------------------------------------
    !
    !     dummy variables
    !
    !-----------------------------------------------------------------------
    integer :: k
    integer, dimension(nx_block*ny_block) :: K_LEVEL_bid, KMT_bid, ZTW_bid
    real, dimension(nx_block*ny_block,2) :: KAPPA_THIC_bid_k, KAPPA_THIC_bid_kp1
    real, dimension(nx_block*ny_block,2) :: KAPPA_THIC_bid_kp2
    real, dimension(nx_block*ny_block,2,2) :: SLX_bid_k, SLY_bid_k
    real, dimension(nx_block*ny_block,2,2) :: SLX_bid_kp1, SLY_bid_kp1
    real, dimension(nx_block*ny_block,2,2) :: SLX_bid_kp2, SLY_bid_kp2
    real :: dz_k,dz_kp1,dz_kp2,dzwr_k,dzwr_kp1
... ! next note index (i,j) compression to (ij)
    do kk=1,2
        do ij=1,ny_block*nx_block
            if ( LMASK1(ij) ) then

Note that at the point of the call, a contiguous array section (reference) is passed. The dummy arguments of the called routine specify a same sized contiguous chunk of memory with a different number of indexes. As long as you are careful in Fortran, you can do this.

The coding effort was mostly a copy and paste, then a find and replace operation. Other than this, there was no code flow changes. A meticulous junior programmer could have done this with proper instructions.

While future versions of compiler optimization may make this unnecessary, a little bit of “unnecessary” programming effort now can, at times, yield substantial performance gains (52% in this case).

The equivalent source code statement is now:

And the assembly code is now:

We are now down from 46 instructions to 6 instructions a 7.66x reduction. This illustrates that by reducing the number of array subscripts, that the complier optimization can reduce the instruction count.

Introducing a 2-Level nest with peel yielded a 1.52x performance boost. As to if a 52% boost in performance is worth the additional effort, this is a subjective measure for you to decide. I anticipate that future compiler optimizations will perform loop invariant array subscript lifting as performed manually above. But until then you can use the index peel and compress technique.

I hope that I have provided you with some useful tips.

Jim Dempsey
Quickthread Programming, LLC
A software consulting company.

Intel® RealSense™ Technology can be used to improve the e-sports experience for both players and spectators by allowing them to see each other on-screen during the game. Using background segmented video (BGS), players’ “floating heads” can be overlaid on top of a game using less screen real-estate than full widescreen video and graphics can be displayed behind them (like a meteorologist on television). Giving players the ability to see one another while they play in addition to speaking to one another will enhance their overall in-game communication experience. And spectators will get a chance to see their favorite e-sports competitors in the middle of all the action.

In this article, we will discuss how this technique is made possible by the Intel® RealSense™ SDK. This sample will help you understand the various pieces in the implementation (using the Intel RealSense SDK for background segmentation, networking, video compression and decompression), the social interaction, and the performance of this use case. The code in this sample is written in C++ and uses DirectX*.

Figure 1:Screenshot of the sample with two players with a League of Legends* video clip playing in the background.

Figure 2:Screenshot of the sample with two players and a Hearthstone* video clip playing in the background.

Installing, Building, and Running the Sample

Download the sample at: https://github.com/GameTechDev/ChatHeads

The sample uses the following third-party libraries:
(i) RakNet for networking
(ii) Theora Playback Library to play back ogg videos
(iii) ImGui for the UI
(iv) Windows Media Foundation* (WMF) for encoding and decoding the BGS video streams

(i) and (ii) are dynamically linked (required dll’s are present in the source repo), while (iii) is statically linked with source included.
(iv) is dynamically linked, and is part of the WMF runtime, which should be installed by default on a Windows* 8 or greater system. If it is not already present, please install the Windows SDK.

Install the Intel RealSense SDK (2015 R5 or higher) prior to building the sample. The header and library include paths in the Visual Studio project use the RSSDK_DIR environment variable, which is set during the RSSDK installation.

The solution file is at ChatheadsNativePOC\ChatheadsNativePOC and should build successfully with VS2013 and VS2015.

Install the Intel® RealSense™ Depth Camera Manager, which includes the camera driver, before running the sample. The sample has been tested on Windows® 8.1 and Windows® 10 using both the external and embedded Intel® RealSense™ cameras.

When you start the sample, the option panel shown in Figure 3 displays:

Figure 3:Option panel at startup.

Scene selection: Select between League of Legends* video, Hearthstone* video and a CPUT (3D) scene. Click the Load Scene button to render the selection. This does not start the Intel RealSense software; that happens in a later step.
Resolutions: The Intel RealSense SDK background segmentation module supports multiple resolutions. Setting a new resolution results in a shutdown of the current Intel RealSense SDK session and initializes a new one.
Is Server / IP Address: If you are running as the server, check the box labeled Is Server.
If you are running as a client, leave the box unchecked and enter the IP address you want to connect to.
Hitting Start initializes the network and Intel RealSense SDK and plays the selected scene. The maximum number of connected machines (server plus client(s)) is hardcoded to 4 in the file NetworkLayer.h
Note: While a server and client can be started on the same system, they cannot use different color stream resolutions. Attempting to do so will crash the Intel RealSense SDK runtime since two different resolutions can’t run simultaneously on the same camera.
After the network and Intel RealSense SDK initialize successfully, the panels shown in Figure 4 display:

Figure 4:Chat Heads option panels.

The Option panel has multiple sections, each with their own control settings. The sections and their fields are:

BGS/Media controls
- Show BGS Image – If enabled, the background segmented image (i.e., color stream without the background) is shown. If disabled, the color stream is simply used (even though BGS processing still happens). This affects the remote chat heads as well (that is, if both sides have the option disabled, you’ll see the remote players’ background in the video stream).
  Figure 5:BGS on (left) and off (right). The former blends into Hearthstone*, while the latter sticks out
- Pause BGS - Pause the Intel RealSense SDK BGS module, suspending segmentation processing on the CPU
- BGS frame skip interval - The frequency at which the BGS algorithm runs. Enter 0 to run every frame, 1 to run once in two frames, and so on. The limit exposed by the Intel RealSense SDK is 4.
- Encoding threshold – This is relevant only for multiplayer scenarios. See the Implementation section for details.
- Decoding threshold - This is relevant only for multiplayer scenarios. See the Implementation section for details.
Size/Pos controls
- Size - Click/drag within the boxes to resize the sprite. Use it with different resolutions to compare quality.
- Pos - Click/drag within the boxes to reposition the sprite.
Network control/information (This section is shown only when multiple players are connected)
- Network send interval (ms) - how often video update data is sent.
- Sent - Graph of data sent by a client or server.
- Rcvd - Graph of data received by a client or server. Clients send their updates to the server, which then broadcasts it to the other clients. For reference, to stream 1080p Netflix* video, the recommended b/w required is 5 Mbps (625 KB/s).
Metrics
- Process metrics
  - CPU Used - The BGS algorithm runs on several Intel® Threading Building Blocks threads and in the context of a game, can use more CPU resources than desired. Play with the Pause BGS and BGS frame skip interval options and change the Chat Head resolution to see how it affects the CPU usage.

Implementation

Internally, the Intel RealSense SDK does its processing on each new frame of data it receives from the Intel RealSense camera. The calls used to retrieve that data are blocking, making it costly to execute this processing on the main application thread. Therefore, in this sample, all of the Intel RealSense SDK processing happens on its own dedicated thread. This thread and the application thread never attempt to write to the same objects, making synchronization trivial.

There is also a dedicated networking thread that handles incoming messages and is controlled by the main application thread using signals. The networking thread receives video update packets and updates a shared buffer for the remote chat heads with the decoded data.

The application thread takes care of copying the updated image data to the DirectX* texture resource. When a remote player changes the camera resolution, the networking thread sets a bool for recreation, and the application thread takes care of resizing the buffer, recreating the DirectX* graphics resources (Texture2D and ShaderResourceView) and reinitializing the decoder.

Figure 6 shows the post-initialization interaction and data flow between these systems (threads).

Figure 6: Interaction flow between local and remote Chat Heads.

Color Conversion

The Intel RealSense SDK uses 32-bit BGRA (8 bits per channel) to store the segmented image, with the alpha channel set to 0 for background pixels. This maps directly to the DirectX texture format DXGI_FORMAT_B8G8R8A8_UNORM_SRGB for rendering the chat heads. In this sample, we convert the BGRA image to YUYV, wherein every pair of BGRA pixels is combined into one YUYV pixel. However, YUYV does not have an alpha channel, so to preserve the alpha from the original image, we set the Y, U, and V channels all to 0 in order to represent background segmented pixels.

The YUYV bit stream is then encoded using WMF’s H.264 encoder. This also ensures better compression, since more than half the image is generally comprised of background pixels.

When decoded, the YUYV values meant to represent background pixels can be non-zero due to the lossy nature of the compression. Our workaround is to use 8 bit encoding and decoding thresholds, exposed in the UI. On the encoding side, if the alpha of a given BGRA pixel is less than the encoding threshold, then the YUYV pixel will be set to 0. Then again, on the decoding side, if the decoded Y, U, and V channels are all less than the decoding threshold, then the resulting BGRA pixel will be assigned an alpha of 0.

When the decoding threshold is set to 0, you may notice green pixels (shown below) highlighting the background segmented image(s). This is because in YUYV, 0 corresponds to the color green and not black as in BGRA (with non-zero alpha). When the decoding threshold is set to 0, you may notice green pixels (shown below) highlighting the background segmented image(s). This is because in YUYV, 0 corresponds to the color green and not black as in BGRA (with non-zero alpha).

Figure 7: Green silhouette edges around the remote player when a 0 decoding threshold is used

Bandwidth

The amount of data sent over the network depends on the network send interval and local camera resolution. The maximum send rate is limited by the 30 fps camera frame rate, and is thus 33.33 ms. At this send rate, a 320x240 resolution video feed consumes 60-75 KBps with minimal motion (kilobytes per second) and 90-120 KBps with more motion. Note that the bandwidth figures depend on the number of pixels covered by the player. Increasing the resolution to 1280x720 doesn’t impact the bandwidth cost all that much; the net increase is around 10-20 KBps, since a sizable chunk of the image is the background (YUYV set to 0) which results in much better compression.
Reducing the send interval to 70ms results in a bandwidth consumption of ~20-30 KBps.

Performance

The sample uses Intel® Instrumentation and Tracing Technology (Intel® ITT) markers and Intel® VTune Amplifier XE to help measure and analyze performance. To enable them, uncomment

//#define ENABLE_VTUNE_PROFILING // uncomment to enable marker code

In the file

ChatheadsNativePOC\itt\include\VTuneScopedTask.h

and rebuild.

With the instrumentation code enabled, an Intel® VTune concurrency analysis of the sample can help understand the application’s thread profile. The platform view tab shows a colored box (whose length is based on execution time) for every instrumented section, and can help locate bottlenecks. The following capture was taken on an Intel® Core™ i7-4770R processor (8 logical cores) with varying BGS work. The “CPU Usage” row on the bottom shows the cost of executing the BGS algorithm every frame, every alternate frame, once in three frames and when suspended. As expected, the TBB threads doing the BGS work have lower CPU utilization when frames are skipped.

Figure 8: VTune concurrency analysis platform view with varying BGS work

A closer look at the RealSense thread shows the RSSDK AcquireFrame() call taking ~29-35ms on average, which is a result of the configured frame capture rate of 30 fps.

Figure 9: Closer look at the RealSense thread. The thread does not spin, and is blocked while trying to acquire the frame data

The CPU usage info can be seen via the metrics panel of the sample as well, and is shown in the table below:

BGS frequency	Chat Heads CPU Usage (approx.)
Every frame	23%
Every alternate frame	19%
Once in three frames	16%
Once in four frames	13%
Suspended	9%

Doing the BGS work every alternate frame, or once in three frames, results in a fairly good experience when the subject is a gamer because of minimal motion. The sample currently doesn’t update the image for the skipped frames – it would be interesting to use the updated color stream with the previous frame’s segmentation mask instead.

Conclusion

The Chat Heads usage enabled by Intel RealSense technology can make a game truly social and improve both the in-game and e-sport experience without sacrificing the look, feel and performance of the game. Current e-sport broadcasts generally show full video (i.e., with the background) overlays of the professional player (and/or) team in empty areas on the bottom UI. Using the Intel RealSense SDK's background segmentation, each players’ segmented video feed can be overlaid near the player’s character, without obstructing the game view. Combined with Intel RealSense SDK face tracking, it allows for powerful and fun social experiences in games.

Acknowledgements

A huge thanks to Jeff Laflam for pair-programming the sample and reviewing this article.
Thanks also to Brian Mackenzie for the WMF based encoder/decoder implementation, Doug McNabb for CPUT clarifications and Geoffrey Douglas for reviewing this article.

Download [PDF 1.2MB]

Introduction

Performance is regarded as one of the most valuable non-functional requirements of an application. If you are reading this, you are probably using an application like a browser or document reader, and understand how important performance is. In this article, I will talk about applications’ good performance and three developers’ behaviors that prevent it.

Behavior #1: Lack of understanding of the development technologies

It doesn’t matter whether you are someone who just graduated from school or have years of experience; when you have to develop something, you will probably look for something that was already developed. Hopefully in the same programming language.

This is not a bad thing. In fact, it often speeds up development. But, on the other hand, it also might prevent you from learning something. Because only rarely does this approach involve taking the time to inspect the code and understand not only the algorithm but also the inner workings of each line of code.

That is one example of us, as developers, falling into behavior number one. But there are other ways too. For example, when I was younger and just starting my journey in software development, my boss at the time was my role model, and whatever he did was the best someone could do. Whenever I had to do something, I looked at how he did it and replicated it as closely as possible. Many times, I did not understand why his approach worked, but who cares, right? It worked!

There is a kind of developer that I call a “4x4.” He or she is someone, who when asked to do something works as hard as possible to complete it. They usually look for building blocks, or pieces of things already done, puts them all together, and “voilà!” The thing is done! Rarely does this kind of developer spend any time understanding all the pieces he or she found and don’t consider or investigate scalability, maintainability, or performance.

There is one more situation that leads to developers not understanding how things actually work: never running into problems! When you use a technology for the first time and you run into problems, you dig into the details of the technology, and you end up understanding how it works.

At this point, let’s look at some examples that will help us understand the difference between understanding the technology and simply using it. Since I am, for the most part, a .NET* web developer, I will focus on that.

JavaScript* and the Document Object Module (DOM)

Let’s look at the code snippet below. Pretty plain. The code just updates the style of an element in the DOM. The problem (which is less of a problem with modern browsers but included to illustrate the point), is that it is traversing the DOM tree three times. If this code is repeated and the document is large and complex, there will be a performance hit in the application.

Fixing such a problem is easy. Look at the following code snipped. There is a direct reference hold in the variable myField prior to working on the object. This new code is less verbose, quicker to read and understand, and has better performance since there is only one access to the DOM tree.

Let’s look at another example. This example was taken from: http://code.tutsplus.com/tutorials/10-ways-to-instantly-increase-your-jquery-performance--net-5551

In the following figure, there are two equivalent code snippets. Each code creates a thousand list item li elements. The code on the right adds an id attribute to each li element, whereas the code on the left adds a class attribute to each li element.

As you can see, the second part of each code snippet simply accesses each of the thousand li elements that was created. In my benchmarking in Internet Explorer* 10 and Chrome* 48, the average time taken was 57 ms for the code on the left and 9 ms for the code on the right—significantly less. The difference is huge in this case when just accessing the elements in one way or the other.

This example has to be taken very carefully! There are so many additional things to understand that might make this example look wrong, like the order in which the selectors are evaluated, which is from right to left. If you are using jQuery*, read about the DOM context as well. For general CSS Selectors’ performance concepts, see the following article: https://smacss.com/book/selectors

Let’s provide a final example in JavaScript code. This example is more related to memory but will help you understand how things really work. High memory consumption in browsers will cause a performance problem as well.

The next image shows two different ways of creating an object with two properties and one method. On the left, the class’s constructor adds the two properties to the object and the additional method is added through the class’s prototype. On the right, the constructor adds the properties and the method at once.

After the objects are created, a thousand objects are created using both techniques. If you compare the memory used by the objects you’ll see differences of memory usage in the Shallow Size and Retained Size for both approaches in Chrome. The prototype approach uses about 20 percent less memory (20 Kbytes versus 24 Kbytes) in the Shallow Size and references up to 66 percent less in Retained Memory (20 Kbytes versus 60 Kbytes).

For a better understanding of how Shallow Size and Retained Size memory work, see:

https://developers.google.com/web/tools/chrome-devtools/profile/memory-problems/memory-101?hl=en

You can create objects by knowing how to use the technology. But understanding how the technology actually works gives you tools to improve the application in areas like memory management and performance.

LINQ

When I was preparing my conference presentation on this topic, I wanted to provide an example with server-side code. I decided to use LINQ*, since LINQ has become a first-hand tool in the .NET world for new development and is one of the most promising areas to look for performance improvements.

Consider this common scenario. In the following image there are two functionally equivalent sections of code. The purpose of the code is to list all departments and all courses for each department in a school. In the code titled Select N+1, we list all the departments and for each department list its courses. This means that if there are 100 departments, we will make 1+100 calls to the database.

There are many ways to solve this. One simple approach is shown in the code on the right side of the image. By using the Include method (in this case I am using a hardcoded string for ease in understanding) there will be one single database call in which all the departments and its courses will be brought at once. In this case, when the second foreach loop is executed, all the Courses collections for each department will already be in memory.

Improvements in performance on the order of hundreds of times faster are possible simply by avoiding the Select N+1 problem.

Let’s consider a less obvious example.

In the image below, there is only one difference between the two code snippets: the data type of the target list in the second line. You might ask, what difference does the target type make? When you understand how the technology works, you will realize that the target data type actually defines the exact moment when the query is executed against the database. That, in turn, defines when the filters of each query is applied.

In the case of the Code #1 sample where an IEnumerable is expected, the query is executed right before Take<Employee>(10) is executed. This means that if there are 1,000 employees, all of them will be retrieved from the database and then only 10 will be taken.

In the case of the Code #2 sample, the query is executed until Take<Employee>(10) is executed. That is, only 10 records are retrieved from the database.

The following article has an in-depth explanation of the differences in using multiple types of collections.

http://www.codeproject.com/Articles/832189/List-vs-IEnumerable-vs-IQueryable-vs-ICollection-v

SQL Server*

In SQL, there are many concepts to understand in order to get the best performance possible out of your database. SQL Server is complex because it requires an understanding of how the data is being used, and what tables are queried the most and by which fields.

Nevertheless, you can still apply some general concepts to improve performance, such as:

Clustered versus non-clustered indexes
Properly ordered JOINs
Understanding when to use #temp tables and variable tables
Use of views versus indexed views
Use of pre-compiled statements

For the sake of brevity, I won’t provide a specific use case, but these are the types of concepts that you can use, understand, and make the most of.

Mindset Change

So, what are the mindset changes we, as developers, must have in order to avoid behavior #1?

Stop thinking “I am a front-end or back-end developer!” You probably are an engineer and you may become an expert in one area, but don’t use that as a shield to avoid learning more about other areas.
Stop thinking “Let’s let the expert do it because he’s faster!” In the current world where agile is all over the place, we must be fungible resources, and we must learn about the areas we are weak in.
Stop telling yourself “I don’t get it!” Of course! If it was easy then we all would be experts! Spend your time reading, asking, and understanding. It’s not easy, but it pays off by itself.
Stop saying “I don’t have time!” OK, I get this one. It does happen. But once an Intel fellow told me “when you are passionate about something, your bandwidth is infinite.” And here I am, writing this article at 12:00 a.m. on a Saturday!

Behavior #2: Bias on specific technologies

I have developed in .NET since version 1.0. I knew every single little detail of how Web Forms worked as well as a lot of the .NET client-side libraries (I customized some of them). When I saw that Model View Controller (MVC) was coming out, I was reluctant to use it because “we didn’t need it.”

I won’t continue with the list of things that I didn’t like at the beginning but now use extensively. But this makes my point of people’s bias against using specific technologies, preventing themselves from getting better performance.

One of the discussions I often hear is either about LINQ-to-Entities in Entity Framework, or about SQL Stored Procedures when querying data. People are so used to one or the other that they try to continue using them.

Another aspect that makes people biased toward a particular technology is whether they are open source lovers or haters. This makes people not think about what is best for their current situation but rather what best aligns to their philosophy.

Sometimes external factors (for instance, deadlines) push us to make decisions. In order to choose the best technology for our applications, we require time to read, play, compare, and conclude. When we start developing a new product or version of an existing product, it’s not uncommon that we are already late. Two ways come to mind on how to solve this situation: stand up and ask for that time or work extra hours to educate ourselves.

Mindset Change

So what are the mindset changes we, as developers, must have in order to avoid behavior #2:

Stop saying “This has always worked,” “This is what we have always used,” and so on. We need to identify and use other options, especially if there is data that supports those options.
Stop fixing the solution! There are times when people want to use a specific technology that doesn’t provide the expected results. Then they spend hours and hours trying to tweak that technology. What they are doing in this case if “fixing the solution” instead of focusing on the problem and maybe finding a quicker, more elegant solution somewhere else.
“I don’t have time!” Of course, we don’t have time to learn or try new stuff. Again, I get this one.

Behavior #3: Not understanding the application’s infrastructure

After we have put a lot of effort into creating the best application, it is time to deploy it! We tested everything. Everything worked beautifully in our machines. All the 10 testers were so happy with it and its performance. So, what could go wrong after all?

Well, everything could go wrong!

Did you ask yourself any of the following questions?

Was the application expected to work in a load-balanced environment?
Is the application going to be hosted in the cloud with many instances of it?
How many other applications are running in my target production machine?
What else is running on that server? SQL Server? Reporting Services? Some SharePoint* extensions?
Where are my end users located? Are they all over the world?
How many final users will my application have in the next five years?

I understand that not all of these questions refer to the infrastructure but bear with me here. More often than not, the final conditions under which our application will run are not the same as our staging servers.

Let’s pick come of the possible situations that could affect the performance in our application. We will start with users around the world. Maybe our application is very fast and we hear no complaints from our customers in America, but our customers in Malaysia don’t have the same speedy experience.

There are many options to solve this situation. For one, we could use Content Delivery Networks (CDNs) where we can place static files, then loading pages would be faster from different locations. The following image shows what I am talking about.

Picking another potential situation, let’s consider applications running on servers having SQL Server and Web Server running together. In this case we have two CPU-intensive servers on the same machine. So, how can we solve this? Still assuming you are running a .NET application in an Internet Information Services (IIS) Server, we could take advantage of CPU affinity. CPU affinity ties one process to one or more specific cores in the machine.

For example, let’s say that we have SQL Server and Web Server (IIS) in a machine with four CPUs.

If we leave it to the operating system to determine what CPU uses the IIS or the SQL Server, there could be various setups. We can have two CPUs assigned to each server.

Or we could have all processors assigned to only one server!

In this case, we could fall into deadlock because the IIS might be attending too many requests that took all four processors, and probably some of them will require access to SQL, which, of course, won’t happen. This is admittedly an unlikely scenario, but it illustrates my point.

There is one additional situation: one process will not run on the same CPUs all the time. There will be too much context switching. This context switching will cause performance degradation in the server and then in the applications running on that server.

One way to minimize this is by using processor affinity for IIS and SQL. In this way, we can determine how many processors we need for the SQL Server and for the IIS. This is done by changing the Processor Affinity settings in the CPU category in the IIS and the “affinity mask” in the SQL server database. Both cases are shown in the following images.

I could continue with other options at the infrastructure level to improve applications’ performance, like the use of Web Gardens and Web Farms.

Mindset Change

What are the mindset changes we, as developers, must have in order to avoid behavior #3?

Stop thinking “That is not my job!” We, as engineers, must broaden our knowledge as much as possible in order to provide the best integral solution to our customers.
“I don’t have time!” Of course, we never have time. This is the commonality in the mindset change. Making time is what differentiate a professional that succeeds, exceeds, or outstands!

Don’t feel guilty!

But, do not feel guilty! It is not all on you! Really, we don’t have time! We have family, we have hobbies, and we have to rest!

The important thing here is to realize that sometimes there is more to performance than just writing good code. We all have shown and will show some or all of these behaviors during our lifetime.

Let me give you some tips to avoid these behaviors.

Make time. When asked for estimates for your projects, make sure you estimate for researching, testing, concluding, and making decisions.
Try to create along the way a personal test application. This application will avoid you having to try stuff in the application under development. This is a mistake we all make at some point.
Look for people that already know and do pair-programming. Work with your infrastructure person when he or she is deploying the application. This is time well spent.
Stack Overflow is evil!!! Actually, I do help there and a big percentage of my problems are already solved there. But, if you use it for “copy and paste” answers, you will end up with incomplete solutions.
Stop being the front-end person. Stop being the back-end person too. Become a subject matter expert if you will, but make sure you can hold a smart discussion when talking about the areas where you are not an expert.
Help out! This is probably the best way to learn. When you spend time helping people with their problems, you are in the long run saving yourself time by not encountering the same or similar situation

About the Author

Alexander García is a computer science engineer from Intel Costa Rica. Alex has over 14 years of professional experience in software engineering. His professional interests go from software engineering practices, software security, and performance to data mining and related fields. Alex is currently in the Master’s Degree in Computer Science.

Problem description:

Install the Parallel Studio (2016 update1) on windows operating system, Visual studio [2013/2015] was used for integration with Intel compiler. After IPS XE installation, visual Studio IntelliSense stopped recognizing many of the AVX, AVX2 intrinsics. Its long list of intrinsics but for example: _mm256_load_ps, _mm256_add_ps ,_mm256_sub_ps ,_mm256_mu l_ps ,_mm256_store_ps ...etc. would not be recognized by IntelliSense.

Reason:

The reason of this behavior of visual studio IntelliSense is that there are no declarations for these intrinsic in compiler's header files.

Workaround:

#define __INTEL_COMPILER_USE_INTRINSIC_PROTOTYPES 1
before #include <mmintrin.h>

This will add all the intrinsic prototypes back into the header files.

Advantages of header-less recognition:

This is a Intel compiler 15.0 to 16.0 regression. Because the intrinsic header files for AVX512 were taking a long time to compile, it was decided that the compiler could just learn to recognize them without header files or prototypes. In 16.0, the compiler does not require prototypes for intrinsic functions. However, when the compiler is recognizing functions without prototypes, it uses the old C rules, which view enums and integers are the same.We don't see acceptable fix for this situation in VS IDE.

By default, we don't want to include every AVX512 intrinsic, as there are so many of them it actually has a noticeable slowdown parsing the user code. This is why we changed to header-less recognition in 16.0.

Note in future header files:

we have added a comment in mmintrin.h for 17.0/18.0 to explain the compiler's use of these header files. The user will need to manually enable the function headers as per the instructions.

Here is excerpt from the latest mmintrin.h file.
/*
* Many of these function declarations are not visible to the
* compiler; they are for reference only. The compiler recognizes the
* function names as "builtins", without requiring the
* declarations. This improves compile-time. If user code requires
* the actual declarations, they can be made visible like
* this:
* #define __INTEL_COMPILER_USE_INTRINSIC_PROTOTYPES 1
* #include <mmintrin.h>
*/

Abstract

With the dual arrivals of the Intel® RealSense™ camera (SR300) and the Intel® RealSense™ SDK 2016 R1, comes a new mode of gesture interaction called Cursor Mode, for use with the SR300 only. This tutorial sets out to highlight code changes that developers will have to make in order to exploit the new capability.

Introduction

Prior to the release of Intel RealSense SDK 2016 R1, applications wanting to effect cursor movement and detect click-actions were reliant on using the Hand Mode and detecting the “click action” through gesture recognition. That functionality existing in the Hand Mode has now been decoupled into a new feature called Cursor Mode. As such, applications that relied on the previous functionality can now change their code to take advantage of the refinements and upgrades to cursor control with the new Cursor Mode.

Please note that Cursor Mode is available only to devices and peripherals that use the Intel RealSense camera (SR300). As a developer of Intel® RealSense™ applications looking to use the SR300, you must upgrade to Windows* 10, and we require that you use version 2016 R1 of the Intel RealSense SDK.

Tutorial

More than likely you already have an application that is written for the F200 camera with the Intel RealSense SDK R4 (v6.0) and would like to know how to move forward and use the new Cursor Mode functionality. This tutorial presents the following:

Part 1

Initialization of the processing pipeline must occur in a manner similar to the previous version of the Intel RealSense SDK. Thus, you must instantiate the Sense Manager and check for no errors in the process.

PXCSenseManager *pSenseMgr = new PXCSenseManager::CreateInstance();
if( !pSenseMgr ) {< continue on to creating the modes >
}

Part 2

Previously for the F200 Hand Mode, to get anything resembling cursor actions you had to rely on the Hand Module and track the hand set to various configurations. Your code might have looked like this (note that the following code is for reference purposes and will not compile directly as written below):

PXCHandModule *pHandModule;
PXCHandData *pHandData;
int confidence;
. . . <additional library and variables setup> . . .
pxcStatus status;
if( !pSenseMgr ) {
	status = pSenseMgr->EnableHand()
	if(status == pxcStatus::PXC_STATUS_NO_ERROR) {
	// Get an instance of PXCHandModule
handModule = pSenseMgr->QueryHand();
// Get an instance of PXCHandConfiguration
PXCHandConfiguration handConfig = handModule
handConfig->EnableGesture("cursor_click");
handConfig->ApplyChanges();
	. . . <additional configuration options> . . .
}
}

Part 3

Beginning with the Intel RealSense SDK 2016 R1 a new Cursor Mode has been implemented, and cursor actions have been decoupled from the Hand Mode. This means that previous code paths that queried the Hand Mode in the Sense Manager must change. The new code will take the following form:

PXCHandCursorModule *pCursorModule;
PXCCursorData::BodySideType bodySide;
// please note that the Confidence values no longer exist
. . . <additional library and variables setup> . . .
pxcStatus status;
if( !pSenseMgr ) {
// Enable handcursor tracking
status = pSenseMgr::EnableHandCursor();
	if(status == pxcStatus.PXC_STATUS_NO_ERROR) {
	// Get an instance of PXCCursorModule
pCursorModule = pSenseMgr->QueryHandCursor();
// Get an instance of the cursor configuration
PXCCursorConfiguration *pCursorConfig = CursorModule::CreateActiveConfiguration();

// Make configuration changes and apply them
pCursorConfig.EnableEngagement(true);
pCursorConfig.EnableAllGestures();
pCursorConfig.ApplyChanges();
	. . . <additional configuration options> . . .

}
}

Part 4

Implementation examples of the main processing loops for synchronous and asynchronous functions can be found in the Intel RealSense™ SDK 2016 R1 Documentation in the Implementing the Main Processing Loop subsection of the Cursor Module [SR300] section.

A summary of the asynchronous—and preferred—approach is as follows:

class MyHandler: public PXCSenseManager::Handler {
public:
    virtual pxcStatus PXCAPI OnModuleProcessedFrame(pxcUID mid, PXCBase *module, PXCCapture::Sample *sample) {
       // check if the callback is from the hand cursor tracking module
       if (mid==PXCHandCursorModule::CUID) {
           PXCHandCursorModule *cursorModule=module->QueryInstance<PXCHandCursorModule>();
               PXCCursorData *cursorData = cursorModule->CreateOutput();
           // process cursor tracking data
       }

       // return NO_ERROR to continue, or any error to abort
       return PXC_STATUS_NO_ERROR;
    }
};
. . . <SenseManager declaration> . . .
// Initialize and stream data
MyHandler handler; // Instantiate the handler object

// Register the handler object
pSenseMgr->Init(&handler);

// Initiate SenseManager’s processing loop in blocking mode
// (function exits only when processing ends)
pSenseMgr ->StreamFrames(true);

// Release SenseManager resources
pSenseMgr ->Release()

Conclusion

Though the Intel RealSense SDK 2016 R1 has changed the implementation and access to the hand cursor, it is worth noting that the changes have a consistency that allow for an easy migration of your code. The sample code above demonstrate that ease by showing that your general program structure during initialization, setup, and per-frame execution can remain unchanged while still harnessing the improved capabilities of the new Cursor Mode.

It is worth repeating that the new Cursor Mode is only available to systems that are enabled with the SR300 camera, either integrated or as a peripheral, and using RealSense™ SDK 2016 R1. The ability to detect and branch your code to support dual F200 and SR300 cameras, with either one as peripherals, during development will be discussed in other tutorials.

Visit with Intel at NAB Show (National Broadcasters Association Show) in Las Vegas, Apr. 18-21. As a world leader in computing, Intel® architecture, media accelerators and software are at the heart of innovative, advanced media solutions for video service providers and broadcasters. See how you can stay competitive and deliver the next generation of brilliant media experiences with Intel: innovate media experiences; get fast performance and efficiency for your media solutions; and reduce infrastructure and development costs.

Attend NAB with a free passcode LV6579 for entry to the Exhibit Hall. Visit Intel at SU621 (South Upper Hall).

Technology Showcase: Media Transcoding Software Preview

At NAB, we’re showing some exciting demos on new media software capabilities that bring tremendous performance and productivity, along with high-quality across AVC, HEVC, VP9, MPEG-2 and AVS 2.0 formats. Come check them out. Below is only a partial list—some are so special, that we can’t write about them yet!

Accelerate fast, dense, high-quality video transcoding, visualize CPU/GPU and memory usage with Intel® Media Server Studio. See also Intel® Visual Compute Accelerator in action.
Accelerate transitions to HEVC, 4K, or even 8K
Debug decode and encode, ensure encoder compliances with Intel® Video Pro Analyzer, a complete toolset for advanced video analysis
Develop robust decoders, accelerate media validation and debug
Innovative media demos with Ateme and Sharp
And more...

Learn how Intel® Media SDK, Intel® Media Server Studio, Intel® Video Pro Analyzer, and Intel® Stress Bitstreams and Encoder can help video solution providers and broadcasters innovate, speed media processing, improve quality - not to mention save time and costs.

Feel free to also contact us for a private meeting on how to optimize your media solutions.

Other activities in the Intel booth include demonstrations of the latest architecture platforms, devices and technologies for awesome media and broadcasting, and cloud delivery, along with a suite of industry-leading customers showing how they are using Intel media acceleration technologies to innovate today.

Abstract

This article describes a code walkthrough for creating a virtual joystick app (see Figure 1) that incorporates the new Hand Cursor Module in the Intel® RealSense™ SDK. This project is developed in C#/XAML and can be built using Microsoft Visual Studio* 2015.

Figure 1: RS Joystick app controlling Google Earth* flight simulator.

Introduction

Support for the new Intel® RealSense™ camera, model SR300, was introduced in R5 of the Intel RealSense SDK. The SR300 is the successor to the F200 model and provides a set of improvements along with a new feature known as the Hand Cursor Module.

As described in the SDK documentation , the Hand Cursor Module returns a single point on the hand that allows accurate and responsive tracking. Its purpose is to facilitate the hand-based UI control use case, along with supporting a limited set of gestures.

RS Joystick, the joystick emulator app described in this article, maps 3D hand data provided by the SDK to virtual joystick controls, resulting in a hands-free way to interact with software applications that work with joystick controllers.

The RS Joystick app leverages the following Hand Cursor Module features:

Body Side Type– The app notifies the user which hand is controlling the virtual joystick, based on a near-to-far access order.
Cursor-Click Gesture– The user can toggle the ON-OFF state of button 1 on the virtual joystick controller by making a finger-click gesture.
Adaptive Point Tracking– The app displays the normalized 3D point inside the imaginary “bounding box” defined by the Hand Cursor Module and uses this data to control the x-, y-, and z-axes of the virtual joystick.
Alert Data– The app uses Cursor Not Detected, Cursor Disengaged, and Cursor Out Of Border alerts to change the joystick border from green to red when the user’s hand is out of range of the SR300 camera.

(For more information on the Hand Cursor Module check out “What could you do with Intel RealSense Cursor Mode?”)

Prerequisites

You should have some knowledge of C# and understand basic operations in Visual Studio like building an executable. Previous experience with adding third-party libraries to a custom software project is helpful, but this walkthrough provides detailed steps, if this is new to you. Your system needs a front-facing SR300 camera, the latest versions of the SDK and Intel® RealSense™ Depth Camera Manager (DCM) installed, and must meet the hardware requirements listed here. Finally, your system must be running Microsoft Windows* 10 Threshold 2.

Third-Party Software

In addition to the Intel RealSense SDK, this project incorporates a third-party virtual joystick device driver called vJoy* along with some dynamic-link libraries (DLLs). These software components are not part of any distributed code associated with this custom project, so details on downloading and installing the device driver are provided below.

Install the Intel RealSense SDK

Download and install the required DCM and SDK at https://software.intel.com/en-us/intel-realsense-sdk/download. At the time of this writing the current versions of these components are:

Intel RealSense Depth Camera Manager (SR300) v3.1.25.1077
Intel RealSense SDK v8.0.24.6528

Install the vJoy Device Driver and SDK

Download and install the vJoy device driver: http://vjoystick.sourceforge.net/site/index.php/download-a-install/72-download. Reboot the computer when instructed to complete the installation.

Once installed, the vJoy device driver appears under Human Interface Devices in Device Manager (see Figure 2).

Figure 2: Device Manager.

Next, open the Windows 10 Start menu and select All apps. You will find several installed vJoy components, as shown in Figure 3.

Figure 3: Windows Start menu.

To open your default browser and go to the download page, click the vJoy SDK button.

Once downloaded, copy the .zip file to a temporary folder, unzip it, and then locate the C# DLLs in \SDK\c#\x86.

We will be adding these DLLs to our Visual Studio project once it is created, as described in the next step.

Create a New Visual Studio Project

Launch Visual Studio 2015.
From the menu bar, select File, New, Project….
In the New Project screen, expand Templates and select Visual C#, Windows.
Select WPF Application.
Specify the location for the new project and its name. For this project, our location is C:\ and the name of the application is RsJoystick.

Figure 4 show the New Project settings used for this project.

Figure 4: Visual Studio* New Project settings.

Click OK to create the project.

Copy Libraries into the Project

Two DLLs are required for creating Intel® RealSense™ apps in C#:

libpxcclr.cs.dll – the managed C# interface DLL
libpxccpp2c.dll – the unmanaged C++ P/Invoke DLL

Similarly, there are two DLLs required to allow the app to communicate with the vJoy device driver:

vJoyInterface.dll – the C-language API library
vJoyInterfaceWrap.dll – the C# wrapper around the C-language API library

To simplify the overall structure of our project, we’re going to copy all four of these files directly into the project folder:

Right-click the RsJoystick project and select Add, Existing Item…
Navigate to the location of the vJoy DLLs (that is, \SDK\c#\x86) and select both vJoyInterface.dll and vJoyInterfaceWrap.dll. Note: you may need to specify All Files (*.*) for the file type in order for the DLLs to become visible.
Click the Add button.

Similarly, copy the Intel RealSense SDK DLLs into the project:

Right-click the RsJoystick project and then select Add, Existing Item…
Navigate to the location where the x86 libraries reside, which is C:\Program Files (x86)\Intel\RSSDK\bin\win32 in a default SDK installation.
Select both libpxcclr.cs.dll and libpxccpp2c.dll.
Click the Add button.

All four files should now be visible in Solution Explorer under the RsJoystick project.

Create References to the Libraries

Now that the required library files have been physically copied to the Visual Studio project, you must create references to the managed (.NET) DLLs so they can be used by your app. Right-click References (which is located under the RsJoystick project) and select Add Reference… In the Reference Manager window, click the Browse button and navigate to the project folder (c:\RsJoystick\RsJoystick). Select both the libpxcclr.cs.dll and vJoyInterfaceWrap.dll files, and then click the Add button. Click the OK button in Reference Manager.

In order for the managed wrapper DLLs to work properly, you need to ensure the unmanaged DLLs get copied into the project’s output folder before the app runs. In Solution Explorer, click libpxccpp2c.dll to select it. The Properties screen shows the file properties for libpxccpp2c.dll. Locate the Copy to Output Directory field and use the drop-down list to select Copy Always. Repeat this step for vJoyInterface.dll. This ensures that the unmanaged DLLs get copied to the project output folder when you build the application.

At this point you may see a warning about a mismatch between the processor architecture of the project being built and the processor architecture of the referenced libraries. Clear this warning by doing the following:

Locate the link to Configuration Manager in the drop-down list in the menu bar (see Figure 5).
Select Configuration Manager.
In the Configuration Manager screen, expand the drop-down list in the Platform column, and then select New.
Select x86 as the new platform and then click OK.
Close the Configuration Manager screen.

Figure 5: Configuration Manager.

At this point the project should build and run without any errors or warnings. Also, if you examine the contents of the output folder (c:\RsJoystick\RsJoystick\bin\x86\Debug) you should find that all four of the DLLs got copied there as well.

The User Interface

The user interface (see Figure 6) displays the following information:

The user’s hand that is controlling the virtual joystick, based on a near-to-far access order (that is, the hand that is closest to the camera is the controlling hand).
The ON-OFF state of Button 1 on the virtual joystick controller, which is controlled by making a finger-click gesture.
An ellipse that tracks the relative position of the user’s hand in the x- and y-axes, and changes diameter based on the z-axis to indicate the hand’s distance from the camera.
The x-, y-, and z-axis Adaptive Point data from the SDK, which is presented as normalized values in the range of zero to one.
A colored border that changes from green to red when the user’s hand is out of range of the SR300 camera.
Slider controls that allow the sensitivity to be adjusted for each axis.

Figure 6: User Interface.

The complete XAML source listing is presented in Table 1. This can be copied and pasted directly over the MainWindow.xaml code that was automatically generated when the project was created.

Table 1:XAML Source Code Listing: MainWindow.xaml.

<Window x:Class="RsJoystick.MainWindow"
        xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
        xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
        xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
        xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
        xmlns:local="clr-namespace:RsJoystick"
        mc:Ignorable="d"
        Title="RSJoystick" Height="420" Width="420" Background="#FF222222" Closing="Window_Closing"><Window.Resources><Style x:Key="TextStyle" TargetType="TextBlock"><Setter Property="Foreground" Value="White"/><Setter Property="FontSize" Value="14"/><Setter Property="Text" Value="-"/><Setter Property="Margin" Value="4"/><Setter Property="HorizontalAlignment" Value="Center"/></Style></Window.Resources><StackPanel VerticalAlignment="Center" HorizontalAlignment="Center" Width="320"><TextBlock x:Name="uiBodySide" Style="{StaticResource TextStyle}"/><TextBlock x:Name="uiButtonState" Style="{StaticResource TextStyle}"/><Border x:Name="uiBorder" BorderThickness="2" Width="200" Height="200" BorderBrush="Red" Margin="4"><Canvas x:Name="uiCanvas" ClipToBounds="True"><Ellipse x:Name="uiCursor" Height="10" Width="10" Fill="Yellow"/><Ellipse Height="50" Width="50" Stroke="Gray" Canvas.Top="75" Canvas.Left="75"/><Rectangle Height="1" Width="196" Stroke="Gray" Canvas.Top="100"/><Rectangle Height="196" Width="1" Stroke="Gray" Canvas.Left="100"/></Canvas></Border><StackPanel Orientation="Horizontal" HorizontalAlignment="Center"><TextBlock x:Name="uiX" Style="{StaticResource TextStyle}" Width="80"/><Slider x:Name="uiSliderX" Width="150" ValueChanged="sldSensitivity_ValueChanged" Margin="4"/></StackPanel><StackPanel Orientation="Horizontal" HorizontalAlignment="Center"><TextBlock x:Name="uiY" Style="{StaticResource TextStyle}" Width="80"/><Slider x:Name="uiSliderY" Width="150" ValueChanged="sldSensitivity_ValueChanged" Margin="4"/></StackPanel><StackPanel Orientation="Horizontal" HorizontalAlignment="Center"><TextBlock x:Name="uiZ" Style="{StaticResource TextStyle}" Width="80"/><Slider x:Name="uiSliderZ" Width="150" ValueChanged="sldSensitivity_ValueChanged" Margin="4"/></StackPanel></StackPanel></Window>

Program Source Code

The complete C# source listing for the RSJoystick app is presented in Table 2. This can be copied and pasted directly over the MainWindow.xaml.cs code that was automatically generated when the project was created.

Table 2.C# Source Code Listing: MainWindow.xaml.cs

//--------------------------------------------------------------------------------------
// Copyright 2016 Intel Corporation
// All Rights Reserved
//
// Permission is granted to use, copy, distribute and prepare derivative works of this
// software for any purpose and without fee, provided, that the above copyright notice
// and this statement appear in all copies.  Intel makes no representations about the
// suitability of this software for any purpose.  THIS SOFTWARE IS PROVIDED "AS IS."
// INTEL SPECIFICALLY DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, AND ALL LIABILITY,
// INCLUDING CONSEQUENTIAL AND OTHER INDIRECT DAMAGES, FOR THE USE OF THIS SOFTWARE,
// INCLUDING LIABILITY FOR INFRINGEMENT OF ANY PROPRIETARY RIGHTS, AND INCLUDING THE
// WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  Intel does not
// assume any responsibility for any errors which may appear in this software nor any
// responsibility to update it.
//--------------------------------------------------------------------------------------
using System;
using System.Windows;
using System.Windows.Controls;
using System.Windows.Media;
using vJoyInterfaceWrap;
using System.Threading;
using System.Windows.Shapes;

namespace RsJoystick
{
    /// <summary>
    /// Interaction logic for MainWindow.xaml
    /// </summary>
    public partial class MainWindow : Window
    {
        private PXCMSenseManager sm;
        private PXCMHandCursorModule cursorModule;
        private PXCMCursorConfiguration cursorConfig;
        private vJoy joystick;
        private Thread update;
        private double joySensitivityX;
        private double joySensitivityY;
        private double joySensitivityZ;
        private const uint joyID = 1;
        private const uint MaxSensitivity = 16384;

        public MainWindow()
        {
            InitializeComponent();

            // Configure the sensitivity controls
            uiSliderX.Maximum = MaxSensitivity;
            uiSliderY.Maximum = MaxSensitivity;
            uiSliderZ.Maximum = MaxSensitivity;
            joySensitivityX = uiSliderX.Value = MaxSensitivity / 2;
            joySensitivityY = uiSliderY.Value = MaxSensitivity / 2;
            joySensitivityZ = uiSliderZ.Value = MaxSensitivity / 2;

            // Create an instance of the joystick
            joystick = new vJoy();
            joystick.AcquireVJD(joyID);

            // Configure the cursor mode module
            ConfigureRealSense();

            // Start the Update thread
            update = new Thread(new ThreadStart(Update));
            update.Start();
        }

        public void ConfigureRealSense()
        {
            // Create an instance of the SenseManager
            sm = PXCMSenseManager.CreateInstance();

            // Enable cursor tracking
            sm.EnableHandCursor();

            // Get an instance of the hand cursor module
            cursorModule = sm.QueryHandCursor();

            // Get an instance of the cursor configuration
            cursorConfig = cursorModule.CreateActiveConfiguration();

            // Make configuration changes and apply them
            cursorConfig.EnableEngagement(true);
            cursorConfig.EnableAllGestures();
            cursorConfig.EnableAllAlerts();
            cursorConfig.ApplyChanges();

            // Initialize the SenseManager pipeline
            sm.Init();
        }

        private void Update()
        {
            bool handInRange = false;
            bool joyButton = false;

            // Start AcquireFrame-ReleaseFrame loop
            while (sm.AcquireFrame(true).IsSuccessful())
            {
                PXCMCursorData cursorData = cursorModule.CreateOutput();
                PXCMPoint3DF32 adaptivePoints = new PXCMPoint3DF32();
                PXCMCursorData.BodySideType bodySide;

                // Retrieve the current cursor data
                cursorData.Update();

                // Check if alert data has fired
                for (int i = 0; i < cursorData.QueryFiredAlertsNumber(); i++)
                {
                    PXCMCursorData.AlertData alertData;
                    cursorData.QueryFiredAlertData(i, out alertData);

                    if ((alertData.label == PXCMCursorData.AlertType.CURSOR_NOT_DETECTED) ||
                        (alertData.label == PXCMCursorData.AlertType.CURSOR_DISENGAGED) ||
                        (alertData.label == PXCMCursorData.AlertType.CURSOR_OUT_OF_BORDERS))
                    {
                        handInRange = false;
                    }
                    else
                    {
                        handInRange = true;
                    }
                }

                // Check if click gesture has fired
                PXCMCursorData.GestureData gestureData;

                if (cursorData.IsGestureFired(PXCMCursorData.GestureType.CURSOR_CLICK, out gestureData))
                {
                    joyButton = !joyButton;
                }

                // Track hand cursor if it's within range
                int detectedHands = cursorData.QueryNumberOfCursors();

                if (detectedHands > 0)
                {
                    // Retrieve the cursor data by order-based index
                    PXCMCursorData.ICursor iCursor;
                    cursorData.QueryCursorData(PXCMCursorData.AccessOrderType.ACCESS_ORDER_NEAR_TO_FAR,
                                               0,
                                               out iCursor);

                    adaptivePoints = iCursor.QueryAdaptivePoint();

                    // Retrieve controlling body side (i.e., left or right hand)
                    bodySide = iCursor.QueryBodySide();

                    // Control the virtual joystick
                    ControlJoystick(adaptivePoints, joyButton);
                }
                else
                {
                    bodySide = PXCMCursorData.BodySideType.BODY_SIDE_UNKNOWN;
                }

                // Update the user interface
                Render(adaptivePoints, bodySide, handInRange, joyButton);

                // Resume next frame processing
                cursorData.Dispose();
                sm.ReleaseFrame();
            }
        }

        private void ControlJoystick(PXCMPoint3DF32 points, bool buttonState)
        {
            double joyMin;
            double joyMax;

            // Scale x-axis data
            joyMin = MaxSensitivity - joySensitivityX;
            joyMax = MaxSensitivity + joySensitivityX;
            int xScaled = Convert.ToInt32((joyMax - joyMin) * points.x + joyMin);

            // Scale y-axis data
            joyMin = MaxSensitivity - joySensitivityY;
            joyMax = MaxSensitivity + joySensitivityY;
            int yScaled = Convert.ToInt32((joyMax - joyMin) * points.y + joyMin);

            // Scale z-axis data
            joyMin = MaxSensitivity - joySensitivityZ;
            joyMax = MaxSensitivity + joySensitivityZ;
            int zScaled = Convert.ToInt32((joyMax - joyMin) * points.z + joyMin);

            // Update joystick settings
            joystick.SetAxis(xScaled, joyID, HID_USAGES.HID_USAGE_X);
            joystick.SetAxis(yScaled, joyID, HID_USAGES.HID_USAGE_Y);
            joystick.SetAxis(zScaled, joyID, HID_USAGES.HID_USAGE_Z);
            joystick.SetBtn(buttonState, joyID, 1);
        }

        private void Render(PXCMPoint3DF32 points,
                            PXCMCursorData.BodySideType bodySide,
                            bool handInRange,
                            bool buttonState)
        {
            Dispatcher.Invoke(delegate
            {
                // Change drawing border to indicate if the hand is within range
                uiBorder.BorderBrush = (handInRange) ? Brushes.Green : Brushes.Red;

                // Scale cursor data for drawing
                double xScaled = uiCanvas.ActualWidth * points.x;
                double yScaled = uiCanvas.ActualHeight * points.y;
                uiCursor.Height = uiCursor.Width = points.z * 100;

                // Move the screen cursor
                Canvas.SetRight(uiCursor, (xScaled - uiCursor.Width / 2));
                Canvas.SetTop(uiCursor, (yScaled - uiCursor.Height / 2));

                // Update displayed data values
                uiX.Text = string.Format("X Axis: {0:0.###}", points.x);
                uiY.Text = string.Format("Y Axis: {0:0.###}", points.y);
                uiZ.Text = string.Format("Z Axis: {0:0.###}", points.z);
                uiBodySide.Text = string.Format("Controlling Hand: {0}", bodySide);
                uiButtonState.Text = string.Format("Button State (use 'Click' gesture to toggle): {0}",
                                                    buttonState);
            });
        }

        private void Window_Closing(object sender, System.ComponentModel.CancelEventArgs e)
        {
            update.Abort();
            cursorConfig.Dispose();
            cursorModule.Dispose();
            sm.Dispose();
            joystick.ResetVJD(joyID);
            joystick.RelinquishVJD(joyID);
        }

        private void sldSensitivity_ValueChanged(object sender,
                                                 RoutedPropertyChangedEventArgs<double> e)
        {
            var sliderControl = sender as Slider;

            switch (sliderControl.Name)
            {
                case "uiSliderX":
                    joySensitivityX = sliderControl.Value;
                    break;
                case "uiSliderY":
                    joySensitivityY = sliderControl.Value;
                    break;
                case "uiSliderZ":
                    joySensitivityZ = sliderControl.Value;
                    break;
            }
        }
    }
}

Code Details

To keep this code sample as simple as possible, all methods are contained in a single class. As shown in the source code presented in Table 2, the MainWindow class is composed of the following methods:

MainWindow() – Several private objects and member variables are declared at the beginning of the MainWindow class. These objects are instantiated and variables initialized in the MainWindow constructor.
ConfigureRealSense()– This method handles the details of creating the SenseManager object and hand cursor module, and configuring the cursor module.
Update()– As described in the Intel RealSense SDK Reference Manual, the SenseManager interface can be used either by procedural calls or by event callbacks. In the RSJoystick app we are using procedural calls as the chosen interfacing technique. The acquire/release frame loop runs in the Update() thread, independent of the main UI thread. This thread runs continuously and is where hand cursor data, gestures, and alert data is acquired.
ControlJoystick()– This method is called from the Update() thread when the user’s hand is detected. Adaptive Point data is passed to this method, along with the state of the virtual joystick button (toggled by the CURSOR_CLICK gesture). The Adaptive Point data is scaled using values from the sensitivity slider controls. The slider controls and scaling calculations allow the user to select the full-scale range of values that are sent to the vJoy SetAxis() method, which expects values in the range of 0 to 32768. With a sensitivity slider set to its maximum setting, the corresponding cursor data point will be converted to a value in the range of 0 to 32768. Lower sensitivity settings will narrow this range for the same hand trajectory. For example: 8192 to 24576.
Render()– This method is called from the Update() thread and uses the Dispatcher.Invoke() method to perform operations that will be executed on the UI thread. This includes updating the position of the ellipse on the canvas control and data values shown in the TextBlock controls.
sldSensitivity_ValueChanged() – This event handler is raised whenever any of the slider controls are adjusted.

Using the Application

You can test the app by running vJoy Monitor from the Windows 10 Start menu (see Figure 3). As shown in Figure 7, you can monitor the effects of moving your hand in three axes and performing the click gesture to toggle button 1.

Figure 7: Testing the app with vJoy Monitor.

For a more fun and practical usage, you can run the flight simulator featured in Google Earth* (see Figure 1). According to their website, “Google Earth lets you fly anywhere on Earth to view satellite imagery, maps, terrain, 3D buildings, from galaxies in outer space to the canyons of the ocean.” (https://www.google.com/earth).

After downloading and installing Google Earth, refer to the instructions located here to run the flight simulator. Start by reducing the x- and y-axis sensitivity controls in RSJoystick to minimize the effects of hand motions on the airplane, and set the z-axis slider to its maximum position. After some experimentation you should be able to control the airplane using subtle hand motions.

Summary

This article provided a simple walkthrough describing how to create an Intel RealSense SDK-enabled joystick emulator app from scratch, and how to use the Hand Cursor Module supported by the SR300 camera.

About Intel RealSense Technology

To learn more about the Intel RealSense SDK for Windows, go to https://software.intel.com/en-us/intel-realsense-sdk.

About the Author

Bryan Brown is a software applications engineer at Intel Corporation in the Software and Services Group.

Introduction

The Intel® RealSense™ camera SR300 is the latest front-facing camera in our product lineup. Intel has added a number of new features and significant improvements to the SR300 over the first-generation Intel® RealSense™ camera F200. The SR300 improves the depth range of the camera to 1.5 meters and provides dynamic motion capture with higher-quality depth data, decreased power consumption, and increased middleware quality and robustness. With 1080p full HD video image quality at up to 30 frames per second (FPS), or 720p HD video image quality at up to 60 FPS, the SR300 model provides improved Skype* support. The SR300 supports legacy Intel® RealSense™ camera F200 applications and RGB usage. The Intel® RealSense™ SDK has added a new 3D Cursor mode, improved background segmentation, and 3D object scanning for the SR300 camera. The article A Comparison of Intel® RealSense™ Front-Facing Camera SR300 and F200 shows the differences between the SR300 and F200 models, and motivations to move to SR300.

Figure 1:SR300 camera model.

The dimensions of the SR300 camera are approximately 110 mm x 12.6 mm x 3.8–4.1 mm, and its weight is 9.4 grams. Its size and weight allow it to be clipped onto a mobile platform lid or desktop monitor and provide stable video output. The SR300 will be built into multiple form factors in 2016, including PCs, all-in-ones, notebooks, and 2-in-1s. The SR300 camera can use the Intel RealSense SDK for Windows* or librealsense software. The SDK version that added support for SR300 is 2016 R1 or later.

New Feature and Improvements

New Features

Cursor mode
Person tracking

Improvements

Increased range and lateral speed
Improved color quality under low-light capture and improved RGB texture for 3D scan
Improved color and depth stream synchronization
Decreased power consumption

Visit A Comparison of Intel® RealSense™ Front-Facing Camera SR300 and F200 to learn more about fast VGA and more new features and improvements.

Additional Intel® RealSense™ SDK Features Planned for the SR300 Camera

Future releases of the Intel® RealSense™ SDK will include great new updates and features: Auto-range, High Dynamic Range (HDR) mode, and Confidence Map.

Planned for Second Half of 2016

Auto-Range

Auto-range improves the image quality, especially at close range. It controls laser gain in proximity and exposure in the long range.

High Dynamic Range (HDR) Mode

High Dynamic Range (HDR) is a technique used to add more dynamic range to the image. The dynamic range is the ratio of light to dark in the image. With HDR mode enabled, the images will be reproduced with more detail. HDR mode is useful in low-light or backlit conditions and allows an application with tolerant frame rate variation.

With HDR mode enabled, images reveal more details of regular and highlighted hair:

Figure 2: Reveal more hair details.

Figure 3: Improve highlighted hair.

HDR mode will resolve confusing scenarios such as a black foreground over a black background, providing a significant improvement in background segmentation (BGS). HDR will be available only for BGS and initially may not be used at the same time as any other middleware. More information will be available in a future Intel RealSense SDK at release.

Figure 4: Black hair over a black background.

Confidence Map

The Confidence map feature will provide a confidence value associated with the depth map in the range of 0–15. The low range of 0–4 will provide more depth accuracy while the full range will be helpful for blob segmentation, edge detection, and edge gap-fills.

SR300 Targeted Usages

Full-hand skeletal tracking and gesture recognition
Cursor mode
Head tracking
3D segmentation and background removal
Depth-enhanced augmented reality
Voice command and control
3D scanning: face and small object
Facial recognition

Camera Applications

Dynamic BSG

The User Segmentation module masks out the background when a user is in front of the camera so you can place the user’s face in a new background. This module is being integrated into video conferencing applications. With the HDR mode enabled, the SR300 model provides high-quality masking and significant improved color quality in low-light conditions.

3D Scanning

The SR300 model provides significantly improved color quality in low-light conditions, resulting in improved RGB texture that can be applied to the mesh to create a more appealing visualization than the F200 model. Either front-facing camera can scan the user’s face or a small object. However, with the SR300, the range for the capture increases to 70 cm at 50 FPS while achieving more details compared to the F200 model. You can use the Intel RealSense SDK to create a 3D scan and then use Sketchfab* to share it on Facebook*. For more information on Sketchfab, visit Implementing Sketchfab Login in your app and Sketchfab Integration. The 3D scan module is being integrated into AAA games in order to capture and use end-user face scans on in-game characters.

Hand Gesture Recognition – Cursor Mode Only Available in SR300

There are three main tracking modes supported by the hand module: cursor mode, extremities and full-hand mode. Cursor mode is a new feature that is only available with the SR300 camera. This mode returns a single point on the hand, allowing accurate and responsive tracking and gestures. Cursor mode is used when faster, lighter weight, more accurate hand tracking combined with a few highly robust gestures are sufficient. Cursor mode includes hand tracking movement and a click gesture. It provides twice the range and speed with no latency and low power consumption compared with full-hand mode.

Figure 1: Cursor Mode.

Dual-Array Microphones

The Intel RealSense camera SR300 has a microphone array consisting of two microphones that provide audio input to the client system. Using the two microphones improves the voice module robustness in noisy environments.

Intel® RealSense™ Camera SR300 Details

	SR300 Camera
Range**	0.2 meters to 1.2 meters, indoors and indirect sunlight;
Depth/IR	640x480 resolution at 60 FPS
Color camera**	Up to 1080p at 30 FPS, 720p at 60 FPS
Depth Camera**	Up to 640x480 60 FPS (Fast VGA, VGA), HVGA 110 FPS
IR Camera**	Up to 640x480 200 FPS
Motherboard interfaces	USB 3.0, 5V, GND
Developer Kit Dimensions**	110 mm x 12.6 mm x 3.8–4.1 mm
Weight**	9.4 grams
Required OS	Microsoft Windows* 10 64-bit RTM
Language	C++, C#, Visual Basic, Java, JavaScript*

DCM Driver

The Intel® RealSense™ Depth Camera Manager (DCM) 3.x is required for the SR300 camera. As of this writing, the gold DCM version for the SR300 is DCM 3.0.24.59748, and updates will be provided in the Windows 10 Update. Visit the Intel RealSense SDK download page to download the latest DCM. For more information on the DCM, go to Intel RealSense Cameras and DCM Overview.

Firmware Updates

The Intel RealSense camera supports firmware updates provided by the DCM driver. If a firmware update is required, the DCM driver will prompt the user and the user must accept before proceeding.

Hardware Requirements

To support the bandwidth needed by the Intel RealSense camera, a powered USB 3.0 port is required on the client system. The SR300 camera requires Windows 10 and a 6th generation Intel® Core™ processor or later. For details on system requirements and supported operating systems for SR300 and F200, visit the page Buy a Dev Kit.

Summary

This document summarized the new features of the front-facing Intel RealSense camera SR300 as provided by current and future versions of the Intel RealSense SDK. Go here to download the latest Intel RealSense SDK. You can order the new camera at http://click.intel.com/intel-realsense-developer-kit.html

Helpful References

Here is a collection of useful references for the Intel RealSense DCM and SDK, including release notes and instructions for how to download and update the software.