Quantcast
Channel: Intel Developer Zone Articles
Viewing all 461 articles
Browse latest View live

Intel® Software Development tools integration to Microsoft* Visual Studio 2017 issue

$
0
0

Issue: Installation of Intel® Parallel Studio XE with Microsoft* Visual Studio 2017 integration hangs and fails on some systems. The problem is intermittent and not reproducible on every system. Any attempts to repair it fails with the message "Incomplete installation of Microsoft Visual Studio* 2017 is detected". Note, in some cases the installation may complete successfully with no error/crashes, however, the integration to VS2017 is not installed. The issue may be observed with Intel® Parallel Studio XE 2017 Update 4, Intel® Parallel Studio XE 2018 Beta and later versions as well as Intel® System Studio installations.

Environment: Microsoft* Windows, Visual Studio 2017

Root Cause: A root cause was identified and reported to Microsoft*. Note that there may be different reasons of integration failures. We are documenting all cases and providing to Microsoft for further root-cause analysis.

Workaround:

Note that with Intel Parallel Studio XE 2017 Update 4 there is no workaround for this integration problem. The following workaround is expected to be implemented in Intel Parallel Studio XE 2017 Update 5. It is implemented in Intel Parallel Studio XE 2018 Beta Update 1.

Integrate the Intel Parallel Studio XE components manually. You need to run all the files from the corresponding folders:

  • C++/Fortran Compiler IDE: <installdir>/ide_support_2018/VS15/*.vsix
  • Amplifier: <installdir>/VTune Amplifier 2018/amplxe_vs2017-integration.vsix
  • Advisor: <installdir>/Advisor 2018/advi_vs2017-integration.vsix
  • Inspector: <installdir>/Inspector 2018/insp_vs2017-integration.vsix
  • Debugger: <InstallDir>/ide_support_2018/MIC/*.vsix
                      <InstallDir>/ide_support_2018/CPUSideRDM/*.vsix

If this workaround doesn't work and installation still fails then please report the problem to Intel through the Intel® Developer Zone Forums or Online Service Center. You will need to supply the installation log file and error message from Microsoft installer.


Coarray Fortran 32-bit doesn't work on 64-bit Microsoft* Windows

$
0
0

Version : Intel® Visual Fortran Compiler 17.0, 18.0

Operating System : Microsoft* Windows 10 64-bit, Microsoft* Windows Server 2012 R2 64-bit

Problem Description : Coarray Fortran 32-bit doesn't work on Microsoft* Windows 10 or Microsoft* Windows Server 2012 R2 (only on 64-bit OS)  due to required utilities “mpiexec.exe” and “smpd.exe” not working properly.

Resolution Status :

It is a compatibility issue. You need to change the compatibility properties in order to run “mpiexec.exe” and “smpd.exe” correctly. Following workaround should resolve the problem:

1. Go to folder where your “mpiexec.exe” and “smpd.exe” files are located.
2. For both files follow these steps:

  • Right click > Properties > Compatibility Tab
  • Make sure the “Run this program in compatibility mode for:” box is checked and Windows Vista (Service Pack 2) is chosen.
  • Click Apply and close the Properties window.

Coarray Fortran 32-bit application should work fine if all steps followed carefully.

Intel® Data Analytics Acceleration Library - Decision Trees

$
0
0

Introduction

Decision trees method is one of most popular approaches in machine learning. They can easily be used to solve different classification and regression tasks. Often, decision trees endear by their universality and by the fact that the model obtained by learning the decision tree is easy to interpret even by an unprepared person.

The universality of decision trees is a consequence of two main factors. First, the decision tree method is non-parametric machine learning method. It means that its usage does not need to know or assume the probabilistic characteristics of the data with which it is supposed to work. Second, the decision tree method naturally incorporates mixtures of variables with different levels of measurement [1].

At the same time, the decision tree model is a white-box, from which it is clear to understand for which particular data a particular class for the classification problem, or one or another value of the dependent variable for regression problem, will be predicted, which features or dependent variables have impact on this and how.

This article describes the decision trees algorithm and how Intel® Data Analytics Acceleration Library (Intel® DAAL) [2] helps optimize this algorithm when running it on systems equipped with Intel® Xeon® processors.

What is a Decision tree?

Decision trees partition the feature space into a set of hypercubes, and then fit a simple model in each one. Such a simple model can be a prediction model, which ignores all predictors and predicts the majority (most frequent) class (or mean of dependent variable for regression), also known as 0-R or constant classifier.

Decision tree induction constructs a tree-like graph structure as shown on the figure below where each internal (non-leaf) node denotes a test on features, each branch descending from node corresponds to an outcome of the test, and each external node (leaf) node donates the mentioned simple model. 

The test is a rule, which depends on feature values, to perform the partitioning of the feature space: each outcome of the test represents an appropriate hypercube associated with both the test and one of descending branches. If the test is Boolean expression (e.g. f< c or f = c, where f is a feature and c is a constant fitted while decision tree induction), the inducted decision tree is a binary tree and so each its non-leaf node has exactly two branches (“true” and “false”) according to result of such a Boolean expression. In this case, often, left branch implicitly assumed to be associated with “true” outcome, while right branch implicitly assumed to be associated with “false” outcome.

Test selection is performed as a search through all reasonable tests to find best one according to some criterion, named split criterion. There are many widely used split criteria, including Gini index [3] and Information Gain [4] for classification, and Mean-Squared Error (MSE) [3] for regression.

To improve prediction, decision tree can be pruned [5]. Pruning technics that are embedded in the training process named pre-pruning, because they stop further growing of the decision tree. There are also post-pruning technics that replace already completely trained decision tree by another one [5].

For instance, Reduced Error Pruning (REP), described in [5], assumes an existence of a separate pruning dataset, each observation in which is used to get prediction by the original (unpruned) tree. For every non-leaf subtree, the change in mispredictions is examined over the pruning dataset that would occur if this subtree were replaced by the best possible leaf:

where ESubtree and Eleaf are numbers of errors in case of classification and MSE in case of regression respectively for given subtree and a best possible leaf, which replaces given subtree. If the new tree would give an equal or fewer mispredictions (DE £ 0) and subtree contains no subtree with the same property, the subtree is replaced by the leaf. The process continues until any further replacements would increase mispredictions over the pruning dataset. The final tree is the most accurate subtree of the original tree with respect to the pruning dataset and is the smallest tree with that accuracy. Pruning dataset can be some fraction of original training dataset (e.g. randomly chosen 20% of observations), but in this case those observations must be excluded from the training dataset.

The prediction is performed by starting at the root node of the tree, testing features by test specified by this node, then moving down the tree branch corresponding to the outcome of the test for the given example. This process is then repeated for the subtree rooted at the new node. The final result of the prediction of is the prediction of simple model at leaf node.

Applications of Decision trees

Decision trees can be used in many real-world applications [6]:

  • Agriculture
  • Astronomy (e.g. for filtering noise from Hubble Space Telescope images)
  • Biomedical Engineering
  • Control Systems
  • Financial analysis
  • Manufacturing and Production
  • Medicine
  • Molecular biology
  • Object recognition
  • Pharmacology
  • Physics (e.g. for the detection of physical particles)
  • Plant diseases (e.g. to assess the hazard of mortality to pine trees)
  • Power systems (e.g. power system security assessment and power stability prediction)
  • Remote sensing
  • Software development (e.g. to estimate the development effort of a given software module)
  • Text processing (e.g. medical text classification)
  • Personal learning assistants
  • Classifying sleep signals

Advantages and disadvantages of Decision trees

Using Decision trees has advantages and disadvantages [7]:

  • Advantages
    • Simple to understand and interpret. Have a white-box model.
    • Able to handle both numerical and categorical data.
    • Requires little data preparation.
    • Non-statistical approach that makes no assumptions of the training data or prediction residuals; e.g., no distributional, independence, or constant variance assumptions
    • Performs well even with large datasets.
    • Mirrors human decision making more closely than other approaches.
    • Robust against co-linearity.
    • Have built in feature selection.
    • Have value even with small datasets.
    • Can be combined with other techniques.
  • Disadvantages
    • Trees do not tend to be as accurate as other approaches.
    • Trees can be very non-robust. A small change in the training data can result in a big change in the tree, and thus a big change in final predictions.
    • The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristics such as the greedy algorithm where locally-optimal decisions are made at each node.
    • Decision-tree learners can create over-complex trees that do not generalize well from the training data. Mechanisms such as pruning are necessary to avoid this problem.
    • There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. In such cases, the decision tree becomes prohibitively large.

Intel® Data Analytics Acceleration Library

Intel® DAAL is a library consisting of many basic building blocks that are optimized for data analytics and machine learning. Those building blocks are highly optimized for the latest features of latest Intel® processors. More about Intel® DAAL can be found in [2]. Intel® DAAL provides Decision tree classification and regression algorithms.

Using Decision trees in Intel® Data Analytics Acceleration Library

This section shows how to invoke Decision trees classification and regression using Intel® DAAL.

Do the following steps to invoke Decision tree classification algorithm from Intel® DAAL:

1.	Ensure that you have Intel® DAAL installed and environment is prepared. See details in [8, 9, 10] according to your operating system.
2.	Include header file daal.h into your application:
#include <daal.h>
3.	To simplify usage of Intel® DAAL namespaces we will use following using directives:
using namespace daal;
using namespace daal::algorithms;
4.	We will assume that training, pruning and testing datasets are in appropriate .csv files. If so, we must read first and second of them into Intel® DAAL numeric tables:
const size_t nFeatures = 5; /* Number of features in training and testing data sets */

/* Initialize FileDataSource<CSVFeatureManager> to retrieve the input data from a .csv
   file */
FileDataSource<CSVFeatureManager> trainDataSource(“train.csv”,
    DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext);

/* Create Numeric Tables for training data and labels */
NumericTablePtr trainData(new HomogenNumericTable<>(nFeatures, 0,
    NumericTable::notAllocate));
NumericTablePtr trainGroundTruth(new HomogenNumericTable<>(1, 0,
    NumericTable::notAllocate));
NumericTablePtr mergedData(new MergedNumericTable(trainData, trainGroundTruth));

/* Retrieve the data from the input file */
trainDataSource.loadDataBlock(mergedData.get());

/* Initialize FileDataSource<CSVFeatureManager> to retrieve the pruning input data from a
   .csv file */
FileDataSource<CSVFeatureManager> pruneDataSource(“prune.csv”,
    DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext);

/* Create Numeric Tables for pruning data and labels */
NumericTablePtr pruneData(new HomogenNumericTable<>(nFeatures, 0,
    NumericTable::notAllocate));
NumericTablePtr pruneGroundTruth(new HomogenNumericTable<>(1, 0,
    NumericTable::notAllocate));
NumericTablePtr pruneMergedData(new MergedNumericTable(pruneData, pruneGroundTruth));

/* Retrieve the data from the pruning input file */
pruneDataSource.loadDataBlock(pruneMergedData.get());
5.	Create an algorithm object to train the model:
const size_t nClasses = 5;  /* Number of classes */
/* Create an algorithm object to train the Decision tree model */
decision_tree::classification::training::Batch<> algorithm1(nClasses);
6.	Pass the training data and labels with pruning data and labels to the algorithm:
/* Pass the training data set, labels, and pruning dataset with labels to the algorithm */
algorithm1.input.set(classifier::training::data, trainData);
algorithm1.input.set(classifier::training::labels, trainGroundTruth);
algorithm1.input.set(decision_tree::classification::training::dataForPruning, pruneData);
algorithm1.input.set(decision_tree::classification::training::labelsForPruning,
    pruneGroundTruth);
7.	Train the model:
/* Train the Decision tree model */
algorithm1.compute();
where algorithm1 is variable as defined in step 5.
8.	Store result of training in variable:
decision_tree::classification::training::ResultPtr trainingResult =
    algorithm1.getResult();
9.	Read testing dataset from appropriate .csv file:
/* Initialize FileDataSource<CSVFeatureManager> to retrieve the test data from a .csv
   file */
FileDataSource<CSVFeatureManager> testDataSource(“test.csv”,
    DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext);

 /* Create Numeric Tables for testing data and labels */
NumericTablePtr testData(new HomogenNumericTable<>(nFeatures, 0,
    NumericTable::notAllocate));
testGroundTruth = NumericTablePtr(new HomogenNumericTable<>(1, 0,
    NumericTable::notAllocate));
NumericTablePtr mergedData(new MergedNumericTable(testData, testGroundTruth));

/* Retrieve the data from input file */
testDataSource.loadDataBlock(mergedData.get());
10.	Create an algorithm object to test the model:
/* Create algorithm objects for Decision tree prediction with the default method */
decision_tree::classification::prediction::Batch<> algorithm2;
11.	Pass the testing data and trained model to the algorithm:
/* Pass the testing data set and trained model to the algorithm */
algorithm2.input.set(classifier::prediction::data, testData);
algorithm2.input.set(classifier::prediction::model,
    trainingResult->get(classifier::training::model));
12.	Test the model:
/* Compute prediction results */
algorithm2.compute();
13.	Retrieve the results of the prediction:
/* Retrieve algorithm results */
classifier::prediction::ResultPtr predictionResult = algorithm2.getResult();

For decision tree regression, the steps 1-4, 7, 9, 12 are same, while other are very similar:

1.	Ensure that you have Intel® DAAL installed and environment is prepared. See details in [8, 9, 10] according to your operating system.
2.	Include header file daal.h into your application:
#include <daal.h>
3.	To simplify usage of Intel® DAAL namespaces we will use following using directives:
using namespace daal;
using namespace daal::algorithms;
4.	We will assume that training, pruning and testing datasets are in appropriate .csv files. If so, we must read first and second of them into Intel® DAAL numeric tables:
const size_t nFeatures = 5; /* Number of features in training and testing data sets */

/* Initialize FileDataSource<CSVFeatureManager> to retrieve the input data from a .csv
   file */
FileDataSource<CSVFeatureManager> trainDataSource(“train.csv”,
    DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext);

/* Create Numeric Tables for training data and labels */
NumericTablePtr trainData(new HomogenNumericTable<>(nFeatures, 0,
    NumericTable::notAllocate));
NumericTablePtr trainGroundTruth(new HomogenNumericTable<>(1, 0,
    NumericTable::notAllocate));
NumericTablePtr mergedData(new MergedNumericTable(trainData, trainGroundTruth));

/* Retrieve the data from the input file */
trainDataSource.loadDataBlock(mergedData.get());

/* Initialize FileDataSource<CSVFeatureManager> to retrieve the pruning input data from a
   .csv file */
FileDataSource<CSVFeatureManager> pruneDataSource(“prune.csv”,
    DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext);

/* Create Numeric Tables for pruning data and labels */
NumericTablePtr pruneData(new HomogenNumericTable<>(nFeatures, 0,
    NumericTable::notAllocate));
NumericTablePtr pruneGroundTruth(new HomogenNumericTable<>(1, 0,
    NumericTable::notAllocate));
NumericTablePtr pruneMergedData(new MergedNumericTable(pruneData, pruneGroundTruth));

/* Retrieve the data from the pruning input file */
pruneDataSource.loadDataBlock(pruneMergedData.get());
5.	Create an algorithm object to train the model:
/* Create an algorithm object to train the Decision tree model */
decision_tree::regression::training::Batch<> algorithm;
6.	Pass the training data and labels with pruning data and labels to the algorithm:
/* Pass the training data set, dependent variables, and pruning dataset with dependent
   variables to the algorithm */
algorithm.input.set(decision_tree::regression::training::data, trainData);
algorithm.input.set(decision_tree::regression::training::dependentVariables,
    trainGroundTruth);
algorithm.input.set(decision_tree::regression::training::dataForPruning, pruneData);
algorithm.input.set(decision_tree::regression::training::dependentVariablesForPruning,
    pruneGroundTruth);
7.	Train the model:
/* Train the Decision tree model */
algorithm1.compute();
where algorithm1 is variable as defined in step 5.
8.	Store result of training in variable:
decision_tree::regression::training::ResultPtr trainingResult =
    algorithm1.getResult();
9.	Read testing dataset from appropriate .csv file:
/* Initialize FileDataSource<CSVFeatureManager> to retrieve the test data from a .csv
   file */
FileDataSource<CSVFeatureManager> testDataSource(“test.csv”,
    DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext);

 /* Create Numeric Tables for testing data and labels */
NumericTablePtr testData(new HomogenNumericTable<>(nFeatures, 0,
    NumericTable::notAllocate));
testGroundTruth = NumericTablePtr(new HomogenNumericTable<>(1, 0,
    NumericTable::notAllocate));
NumericTablePtr mergedData(new MergedNumericTable(testData, testGroundTruth));

/* Retrieve the data from input file */
testDataSource.loadDataBlock(mergedData.get());
10.	Create an algorithm object to test the model:
/* Create algorithm objects for Decision tree prediction with the default method */
decision_tree::regression::prediction::Batch<> algorithm2;
11.	Pass the testing data and trained model to the algorithm:
/* Pass the testing data set and trained model to the algorithm */
algorithm.input.set(decision_tree::regression::prediction::data, testData);
algorithm.input.set(decision_tree::regression::prediction::model,
    trainingResult->get(decision_tree::regression::training::model));
12.	Test the model:
/* Compute prediction results */
algorithm2.compute();
13.	Retrieve the results of the prediction:
/* Retrieve algorithm results */
decision_tree::regression::prediction::ResultPtr predictionResult =
    algorithm2.getResult();

Conclusion

Decision tree is a powerful method, which can be used for both classification and regression. Intel® DAAL optimized the decision tree algorithm. By using Intel® DAAL developers can take advantage of new features in future generations of Intel® Xeon® processors without having to modify their applications. They only need to link their applications to the latest version of Intel® DAAL.

References

  1. https://en.wikipedia.org/wiki/Level_of_measurement
  2. https://software.intel.com/en-us/blogs/daal
  3. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, Charles J. Stone. Classification and Regression Trees. Chapman & Hall. 1984.
  4. J. R. Quinlan. Induction of Decision Trees. Machine Learning, Volume 1 Issue 1. pp. 81-106. 1986.
  5. J. R. Quinlan. Simplifying decision trees. International journal of Man-Machine Studies, Volume 27 Issue 3. pp. 221-234. 1987.
  6. http://www.cbcb.umd.edu/~salzberg/docs/murthy_thesis/survey/node32.html
  7. https://en.wikipedia.org/wiki/Decision_tree_learning
  8. https://software.intel.com/en-us/get-started-with-daal-for-linux
  9. https://software.intel.com/en-us/get-started-with-daal-for-windows
  10. https://software.intel.com/en-us/get-started-with-daal-for-macos

Intel(R) Math Kernel Library - Introducing Vectorized Compact Routines

$
0
0

Introduction     

    Many high performance computing applications depend on matrix operations performed on large groups of matrices of small sizes. Intel® Math Kernel Library (Intel® MKL) 2018 and later versions provide new compact routines that include optimizations for problems of this type.

The main idea behind these compact routines is to create true SIMD computations, in which subgroups of matrices are operated on with kernels that abstractly appear as scalar kernels while registers are filled by cross-matrix vectorization. Intel MKL compact routines provide significant performance benefits compared to batched techniques (see https://software.intel.com/en-us/articles/introducing-batch-gemm-operations for more detailed information about Intel MKL Batch functions), while maintaining ease-of-use through the inclusion of compact service functions that facilitate the reformatting of matrix data for use in these routines.

Compact routines operate on matrices that have been packed into a contiguous segment of memory in an interleaved format, called compact format. Six compact routines have been introduced in Intel MKL 2018: general matrix-multiply (mkl_?gemm_compact), triangular matrix equation solve (mkl_?trsm_compact), inverse calculation (mkl_?getrinp_compact), LU factorization (mkl_?getrfnp_compact), Cholesky decomposition (mkl_?potrf_compact), and QR decomposition (mkl_?geqrf_compact). These routines can only be used for groups of matrices of identical dimensions, where the layout (row-major or column-major) and the stride are identical throughout the group. 

Compact Format

    In compact format, for real precisions, matrices are organized into packs of size V, where V is related to the SIMD vector length of the underlying architecture. Each pack is a 3D tensor with the matrix index incrementing the fastest. These packs can then be loaded into registers and operated on using SIMD instructions.

The picture below demonstrates the packing of a set of 4, 3 x 3, real-precision matrices into compact format. The pack length for this example is V = 2, resulting in 2 compact packs.

 

   Figure 1: Compact format for 4, 3 x 3, real precision matrices with pack length V = 2

The particular form for the packs for each architecture and problem precision are specified by a MKL_COMPACT_PACK enum type.

Before calling a BLAS or LAPACK compact function, the input data must be packed in compact format. After execution, the output data should be unpacked from this compact format, unless another compact routine will be called immediately following the first. Two service functions, mkl_?gepack_compact, and mkl_?geunpack_compact, facilitate the process of storing matrices in compact format. It is recommended that the user call these service functions before calling the mkl_?gepack_compact routine to obtain the optimal format for performance, but advanced users can pack and unpack the matrices themselves and still use Intel MKL compact kernels on the packed set.

For more details, including a description of the compact format of complex-type arrays, see <Compact Format> in the Intel MKL User’s guide.

A SIMPLE VISUAL EXAMPLE

A simple compact version of a matrix multiplication is illustrated in this section, performing the operation C = A * B for a set of 4, 3 x 3, real-precision matrices. Generic (or batched) routines require 4 matrix-matrix multiplications to be performed for a problem of this type, as illustrated in Figure 2.

                               Figure 2: Generic GEMM for a set of 4, 3 x 3 matrices

Assuming that the matrices have been packed into compact format using a pack length of V = 2, the compact version of this problem involves two matrix-matrix multiplications, as illustrated in Figure 3

                                Figure 3: Compact GEMM for a set of 4, 3 x 3 matrices

The elements of the matrices involved in these two multiplications are vectors of length V, which are loaded into registers and operated on as if they were a scalar element in an ordinary matrix-matrix multiplication. Clearly, it is optimal to have pack length V equal to the length of the SIMD registers of the architecture.

NUMERICAL LIMITATIONS

Compact routines are subject to a set of numerical limitations, and they skip most of the checks presented in regular BLAS and LAPACK routines to provide effective vectorization. Error checking is the responsibility of the user. For more information on limitations in compact routines, see <MKL User Guide Numerical Limitations>

BLAS COMPACT ROUTINES

Intel MKL BLAS provides compact routines for general matrix-matrix multiplication and solving triangular matrix equations. The following table provides a brief description of the new routines. For detailed information on usage for these routines, see the Intel MKL User’s Guide.

MKL Routine

Description

  • mkl_?gemm_compact

 

 

 

 

  • mkl_?trsm_compact

 

  • General matrix-matrix multiply

Performs the operation

C = alpha*op(A)*op(B) + beta*C

where op(X) is one of op(X) = X, op(X) = X^T, or op(X) = X^H, alpha and beta are scalars, and A, B, and C are matrices stored in compact format.

  • Solves a triangular matrix equation

Computes the solution of one of the following matrix equations:

op(A) * X = alpha * B, or X*op(A) = alpha*B

where alpha is a scalar, X and B are m x n matrices stored in compact format, and A is a unit (or non-unit) triangular matrix stored in compact format.

LAPACK COMPACT ROUTINES

Intel MKL LAPACK provides compact functions to calculate QR, LU, and Cholesky decompositions, as well as inverses, in Intel MKL 2018 (and later versions). The compact routines for LAPACK follow the same optimization principles as the compact BLAS routines. The following table provides a brief description of the new routines. For detailed information on these routines, see the Intel MKL User’s Guide.

MKL Routine

Description

  • mkl_?geqrf_compact

 

 

  • mkl_?getrfnp_compact

 

 

 

  • mkl_?getrinp_compact

 

 

  • mkl_?potrf_compact
  • QR decomposition

Computes the QR factorization of a set of general m x n, matrices, stored in the compact format.

  • LU decomposition, without pivoting

Computes the LU factorization, without pivoting, of a set of general, m x n matrices A, which are stored in array ap in the compact format (see Compact Format).

  • Inverse, without pivoting

Computes the inverse, of a set of LU factorized (without pivoting), general matrices A, which are stored in the compact format (see Compact Format).

  • Cholesky decomposition

Computes the Cholesky factorization of a set of symmetric (Hermitian), positive-definite, matrices, stored in the compact format.

 

Example

The following example uses Intel MKL compact routines to calculate first the LU factorizations, then the inverses (from the LU factorizations), of a group of 2048, 8x8 matrices. Within this example, the same calculations are made using an OpenMP loop on the group of matrices. The time that each routine takes is printed so that the user can verify the performance improvement when using compact routines.

Notice that the routines mkl_dgetrfnp_compact and mkl_dgetrinp_compact are called between the mkl_dgepack_compact and mkl_dgeunpack functions. Because the mkl_?gepack_compact and mkl_?geunpack_compact functions add overhead, users who call multiple compact routines on the same group of matrices will see the greatest performance benefit from using compact routines.

The complex compact routines are executed similarly, but it is important to note that for complex precisions, all input parameters are of real type. For more details, see <Compact Format> in the Intel MKL User’s guide. Examples of the calling sequences for each individual routine can be found in the Intel MKL 2018 product.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include "mkl.h"

#define N                        8
#define NMAT                  2048

#define NITER_WARMUP            10

void test(double *t_compact, double *t_omp) {
    MKL_INT i, j;

    MKL_LAYOUT layout = MKL_COL_MAJOR;
    MKL_INT m = N;
    MKL_INT n = N;
    MKL_INT lda = m;

    MKL_INT info;
    MKL_COMPACT_PACK format;
    MKL_INT nmat = NMAT;

    /* Set up standard arrays in P2P (pointer-to-pointer) format */
    MKL_INT a_size = lda * n;
    MKL_INT na = a_size * nmat;
    double *a_ref = (double *)mkl_malloc(na * nmat * sizeof(double), 128);
    double *a = (double *)mkl_malloc(na * nmat * sizeof(double), 128);
    double *a_array[NMAT];
    double *a_compact;

    /* For random generation of matrices */
    MKL_INT idist = 1;
    MKL_INT iseed[] = { 0, 1, 2, 3 };
    double diag_offset = (double)n;

    /* For workspace calculation */
    MKL_INT imone = -1;
    MKL_INT lwork;
    double work_query[1];
    double *work_compact;

    /* For threading */
    MKL_INT nthr = omp_get_max_threads();
    MKL_INT ithr;
    MKL_INT lwork_i;
    double *work_omp;
    double* work_i = work_omp;

    /* For setting up compact arrays */
    MKL_INT a_buffer_size;
    MKL_INT ldap = lda;
    MKL_INT sdap = n;

    /* Random generation of matrices */
    dlarnv(&idist, iseed, &na, a);

    for (i = 0; i < nmat; i++) {
        /* Make matrix diagonal dominant to avoid accuracy issues
                 in the non-pivoted LU factorization */
        for (j = 0; j < m; j++) {
            a[i * a_size + j + j * lda] += diag_offset;
        }
        a_array[i] = &a[i * a_size];
    }
    /* Set up a_ref to use in OMP version */
    for (i = 0; i < na; i++) {
        a_ref[i] = a[i];
    }

    /* -----Start Compact----- */

    /* Set up Compact arrays */
    format = mkl_get_format_compact();

    a_buffer_size = mkl_dget_size_compact(ldap, sdap, format, nmat);

    a_compact = (double *)mkl_malloc(a_buffer_size, 128);

    /* Workspace query */
    mkl_dgetrinp_compact(layout, n, a_compact, ldap, work_query, imone, &info, format, nmat);
    lwork = (MKL_INT)work_query[0];
    work_compact = (double *)mkl_malloc(sizeof(double) * lwork, 128);

    /* Start timing compact */
    *t_compact = dsecnd();

    /* Pack from P2P to Compact format */
    mkl_dgepack_compact(layout, n, n, a_array, lda, a_compact, ldap, format, nmat);

    /* Perform Compact LU Factorization */
    mkl_dgetrfnp_compact(layout, n, n, a_compact, ldap, &info, format, nmat);

    /* Perform Compact Inverse Calculation */
    mkl_dgetrinp_compact(layout, n, a_compact, ldap, work_compact, lwork, &info, format, nmat);

    /* Unpack from Compact to P2P format */
    mkl_dgeunpack_compact(layout, n, n, a_array, lda, a_compact, ldap, format, nmat);

    /* End timing compact */
    *t_compact = dsecnd() - *t_compact;
    /* -----End Compact----- */

    /* -----Start OMP----- */
    for (i = 0; i < nmat; i++) {
        a_array[i] = &a_ref[i * a_size];
    }

    /* Workspace query */
    mkl_dgetrinp(&n, a_array[0], &lda, work_query, &imone, &info);
    lwork = (MKL_INT)work_query[0] * nthr;
    work_omp = (double *)mkl_malloc(sizeof(double) * lwork, 128);

    /* Start timing OMP */
    *t_omp = dsecnd();

    /* OpenMP loop */
    #pragma omp parallel for
    for (i = 0; i < nmat; i++) {
        /* Set up workspace for thread */
        ithr = omp_get_thread_num();
        lwork_i = lwork / nthr;
        work_i = &work_omp[ithr * lwork_i];

        /* Perform LU Factorization */
        mkl_dgetrfnp(&n, &n, a_array[i], &lda, &info);

        /* Perform Inverse Calculation */
        mkl_dgetrinp(&n, a_array[i], &lda, work_i, &lwork_i, &info);
    }

    /* End timing OMP */
    *t_omp = dsecnd() - *t_omp;
    /* -----End OMP----- */

    /* Deallocate arrays */
    mkl_free(a_compact);
    mkl_free(a);
    mkl_free(a_ref);
    mkl_free(work_compact);
    mkl_free(work_omp);
}

int main() {
    MKL_INT i = 0;
    double t_compact;
    double t_omp;
    double flops = NMAT * ((2.0 / 3.0 + 4.0 / 3.0) * N * N * N);
    for (i = 0; i < NITER_WARMUP; i++) {
        test(&t_compact, &t_omp);
    }
    test(&t_compact, &t_omp);
    printf("N = %d, NMAT = %d\n", N, NMAT);
    printf("Compact time = %fs, GFlops = %f\n", t_compact, flops / t_compact / 1e9);
    printf("OMP     time = %fs, GFlops = %f\n", t_omp,     flops / t_omp / 1e9);
    return 0;
}

PERFORMANCE RESULTS

The following four charts demonstrate the performance improvement for the following operations: general matrix-matrix multiplication (GEMM), triangular matrix equation solve (TRSM), non-pivoting LU-factorization of a general matrix (GETRFNP), and inverse calculation of an LU-factorized (without pivoting) general matrix (GETRINP). The results were measured against calls to the generic BLAS and LAPACK functions, as in the above example.

 

Introducing Batch GEMM Operations

$
0
0

The general matrix-matrix multiplication (GEMM) is a fundamental operation in most scientific, engineering, and data applications. There is an everlasting desire to make this operation run faster. Optimized numerical libraries like Intel® Math Kernel Library (Intel® MKL) typically offer parallel high-performing GEMM implementations to leverage the concurrent threads supported by modern multi-core architectures. This strategy works well when multiplying large matrices because all cores are used efficiently. When multiplying small matrices, however, individual GEMM calls may not optimally use all the cores. Developers wanting to improve utilization usually batch multiple independent small GEMM operations into a group and then spawn multiple threads for different GEMM instances within the group. While this is a classic example of an embarrassingly parallel approach, making it run optimally requires a significant programming effort that involves threads creation/termination, synchronization, and load balancing. That is, until now. 

Intel MKL 11.3 Beta (part of Intel® Parallel Studio XE 2016 Beta) includes a new flavor of GEMM feature called "Batch GEMM". This allows users to achieve the same objective described above with minimal programming effort. Users can specify multiple independent GEMM operations, which can be of different matrix sizes and different parameters, through a single call to the "Batch GEMM" API. At runtime, Intel MKL will intelligently execute all of the matrix multiplications so as to optimize overall performance. Here is an example that shows how "Batch GEMM" works:

Example

Let A0, A1 be two real double precision 4x4 matrices; Let B0, B1 be two real double precision 8x4 matrices. We'd like to perform these operations:

C0 = 1.0 * A0 * B0T  , and C1 = 1.0 * A1 * B1T

where C0 and C1 are two real double precision 4x8 result matrices. 

Again, let X0, X1 be two real double precision 3x6 matrices; Let Y0, Y1 be another two real double precision 3x6 matrices. We'd like to perform these operations:

Z0 = 1.0 * X0 * Y0T + 2.0 * Z0and Z1 = 1.0 * X1 * Y1T + 2.0 * Z1

where Z0 and Z1 are two real double precision 3x3 result matrices.

We could accomplished these multiplications using four individual calls to the standard DGEMM API. Instead, here we use a single "Batch GEMM" call for the same with potentially improved overall performance. We illustrate this using the "cblas_dgemm_batch" function as an example below.

#define    GRP_COUNT    2

MKL_INT    m[GRP_COUNT] = {4, 3};
MKL_INT    k[GRP_COUNT] = {4, 6};
MKL_INT    n[GRP_COUNT] = {8, 3};

MKL_INT    lda[GRP_COUNT] = {4, 6};
MKL_INT    ldb[GRP_COUNT] = {4, 6};
MKL_INT    ldc[GRP_COUNT] = {8, 3};

CBLAS_TRANSPOSE    transA[GRP_COUNT] = {'N', 'N'};
CBLAS_TRANSPOSE    transB[GRP_COUNT] = {'T', 'T'};

double    alpha[GRP_COUNT] = {1.0, 1.0};
double    beta[GRP_COUNT] = {0.0, 2.0};

MKL_INT    size_per_grp[GRP_COUNT] = {2, 2};

// Total number of multiplications: 4
double    *a_array[4], *b_array[4], *c_array[4];
a_array[0] = A0, b_array[0] = B0, c_array[0] = C0;
a_array[1] = A1, b_array[1] = B1, c_array[1] = C1;
a_array[2] = X0, b_array[2] = Y0, c_array[2] = Z0;
a_array[3] = X1, b_array[3] = Y1, c_array[3] = Z1;

// Call cblas_dgemm_batch
cblas_dgemm_batch (
        CblasRowMajor,
        transA,
        transB,
        m,
        n,
        k,
        alpha,
        a_array,
        lda,
        b_array,
        ldb,
        beta,
        c_array,
        ldc,
        GRP_COUNT,
        size_per_group);



The "Batch GEMM" interface resembles the GEMM interface. It is simply a matter of passing arguments as arrays of pointers to matrices and parameters, instead of as matrices and the parameters themselves. We see that it is possible to batch the multiplications of different shapes and parameters by packaging them into groups. Each group consists of multiplications of the same matrices shape (same m, n, and k) and the same parameters. 

Performance

While this example does not show performance advantages of "Batch GEMM", when you have thousands of independent small matrix multiplications then the advantages of "Batch GEMM" become apparent. The chart below shows the performance of 11K small matrix multiplications with various sizes using "Batch GEMM" and the standard GEMM, respectively. The benchmark was run on a 28-core Intel Xeon processor (Haswell). The performance metric is Gflops, and higher bars mean higher performance or a faster solution.

The second chart shows the same benchmark running on a 61-core Intel Xeon Phi co-processor (KNC). Because "Batch GEMM" is able to exploit parallelism using many concurrent multiple threads, its advantages are more evident on architectures with a larger core count. 

Summary

This article introduces the new API for batch computation of matrix-matrix multiplications. It is an ideal solution when many small independent matrix multiplications need to be performed. "Batch GEMM" supports all precision types (S/D/C/Z). It has Fortran 77 and Fortran 95 APIs, and also CBLAS bindings. It is available in Intel MKL 11.3 Beta and later releases. Refer to the reference manual for additional documentation.  

Optimization Notice in English

Wrong Intel® Fortran compiler version displayed in Microsoft* Visual Studio 2012

$
0
0

Issue: Microsoft* Visual Studio 2012 is supported by Intel® Parallel Studio XE 2017. It is not supported by Intel® Parallel Studio XE 2018. Wrong Intel® Fortran compiler version displayed in Microsoft* Visual Studio 2012  in case both Intel® Parallel Studio XE 2017 and Intel® Parallel Studio XE 2018 are installed on the same system with Microsoft* Visual Studio 2012.

It may be observed while opening "Tools > Options > Intel Compilers and Tools > Visual Fortran > Compilers".
The 'selected compiler' may be shown as "Intel(R) Visual Fortran Compiler 18.0", which is not correct.

Once the compilation process is invoked the correct compiler version is used. The output window shows a wrong compiler name "Intel(R) Visual Fortran Compiler 18.0". For example, there are both 17.0 Update 4 and 18.0 compiler versions installed:

1>------ Rebuild All started: Project: Console8, Configuration: Debug Win32 ------
1>Deleting intermediate files and output files for project 'Console8', configuration 'Debug|Win32'.
1>Compiling with Intel(R) Visual Fortran Compiler 18.0.0.118 [IA-32]...
1>Console8.f90
1>Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on IA-32, Version 17.0.4.210 Build 20170411
1>Copyright (C) 1985-2017 Intel Corporation. All rights reserved.
1>Linking...

Environment: Both Intel(R) Parallel Studio XE 2017 Update 4 and Intel(R) Parallel Studio XE 2018 are installed, Microsoft* Visual Studio 2012 is installed

Root Cause: A root cause was identified and will be fixed in the upcoming compiler versions.

Workaround:

The user should select “Intel(R) Visual Fortran Compiler 17.0” at the 'Select compiler' option at "Tools > Options > Intel Compilers and Tools > Visual Fortran > Compilers". Then the correct name and compiler will be displayed as expected:

1>------ Rebuild All started: Project: Console8, Configuration: Debug Win32 ------
1>Deleting intermediate files and output files for project 'Console8', configuration 'Debug|Win32'.
1>Compiling with Intel(R) Visual Fortran Compiler 17.0.4.210 [IA-32]...
1>Console8.f90
1>Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on IA-32, Version 17.0.4.210 Build 20170411
1>Copyright (C) 1985-2017 Intel Corporation. All rights reserved.
1>Linking...

An update to the integration of Intel® Media SDK and FFmpeg

$
0
0

Introduction

Intel® GPUs contain fixed function hardware to accelerate video encode, decode, and frame processing, which can now be used with a variety of interfaces.  Media SDK and Media Server Studio provide great performance with an API designed around delivering full hardware capabilities that is portable between OSes.  However, there is a big limitation: the Media SDK API only processes video elementary streams.  FFmpeg is one of the most popular media frameworks.  It is open source and easily expandable.  Because of this it has a very wide range of functionality beyond just codecs: muxing and demuxing(splitting), audio, network streaming, and more.  It is straightforward to extend FFmpeg with wrappers for Intel® HW acceleration.  Various forms of these wrappers have existed for many years, and they provide important ease of use benefits compared to writing encode/decode code directly with the Media SDK API.  However, the tradeoff for this ease of use is that performance is still left on the table.  To get the best of both worlds – full performance and access to the full range of capabilities in FFmpeg – a hybrid approach is recommended.

Intel® provides several ways for you to use hardware acceleration in FFmpeg.  

  • FFmpeg wrappers for lower level APIs "underneath" Media SDK in the stack: libva (Linux) and DXVA (Windows)
  • FFmpeg supports the default Media SDK plugin and this article describes the transcoding performance of the plugin, the detailed installation and validation guide is here;
  • The Intel® FFmpeg plug-in project is a fork of FFmpeg which attempts to explore additional options to improve performance for Intel hardware within the FFmpeg framework.
  • A 2012 article by Petter Larsson began exploring how to use the FFmpeg libav* APIs and Media SDK APIs together in the same application.

This article provides important updates to the 2012 article.  It describes the process to use the FFmpeg libraries on Ubuntu 16.04.  The example code will be based on our tutorial code so the user will have a better view on how the FFmpeg API is integrated with the media pipeline.  The example code will also update the deprecated FFmpeg API so it is synced with the latest FFmpeg releases.

Build FFmpeg libraries and run the tutorial code

Requirements

  • Hardware: An Intel® hardware platform which has the Intel Quick Sync Video capability. It is recommended to use the  latest hardware version since the better support. For Linux, a computer with 5th or 6th generation Core processor; for Windows®, 5th generation or late.
  • Linux OS: The sample code was tested on Ubuntu 16.04.3LTS, but the user can try other Linux distribution like CentOS.
  • Intel® Media Server Studio: For the hardware you have, please go to the MSS documentation page to check the release notes and identify the right MSS version, for the latest release, click the Linux link on "Essential/Community Edition"; for the previous releases, click the link on "Historical release notes and blogs".
  • FFmpeg: This should be the latest release from FFmpeg website, for this article, V3.4 is used.
  • Video File: Any mp4 video container with H.264 video content, for testing purpose, we use the BigBuckBunny320x180.mp4

Project File Structure

The project to run the tutorial has the following file structure:

FolderContentNotes
simple_decode_ffmpeg

src/simple_decode_ffmpeg.cpp

Makefile

simple_decode_ffmpeg.cpp is the Media SDK application to create a simple decode pipeline and call the function defined in ffmpeg_utils.h to hook up the demux APIs of FFmpeg library
simple_encode_ffmpeg

src/simple_encode_ffmpeg.cpp

-Makefile

simple_encode_ffmpeg.cpp is the Media SDK application to create a simple encode pipeline and call the ffmpeg adaptive function defined in ffmpeg_utils.h to hook up with the mux APIs of FFmpeg library.
common

ffmpeg_utils.h

ffmpeg_utils.cpp

The API in these files defines and implements the API to initialize, execute and close the mux and demux functions of the FFmpeg library.
$(HOME)/ffmpeg_build

- inlcude

- lib

This is the built FFmpeg libraries, the libraries involved are libavformat.so, libavcodec.so and libavutil.so

 

How to build and execute the workload

  1. Download the Media Server Studio and validate the successful installation
    • Based on the hardware platform, identify the right Media Server Studio version.
    • Go to Media Server Studio landing page to download the release package.
    • Following this instruction to install the Media Server Studio on Ubuntu 16.04; following this instruction if you install on CentOS 7.3(the instruction can also be found in the release package).
    • Following above instruction to validate it before the next step.
  2. Download the FFmpeg source code package and build the libraries.
    • Following the instruction in the generic compilation guide of FFmpeg, in the guide, select the Linux and the distribution you are working on, for example, the Ubuntu build instructions. This project requires the shared FFMpeg library, refer to the following instruction to build the final FFMpeg library.
    • After building the requested FFMpeg modules. To build the shared library, several argument should be appended the general instructions. When configuring the final build, please append the following arguments to the original "./configure..." command: "--enable-shared --enable-pic --extra-cflags=-fPIC", for example, 
      PATH="$HOME/bin:$PATH" PKG_CONFIG_PATH="$HOME/ffmpeg_build/lib/pkgconfig" ./configure \
        --prefix="$HOME/ffmpeg_build" \
        --pkg-config-flags="--static" \
        --extra-cflags="-I$HOME/ffmpeg_build/include" \
        --extra-ldflags="-L$HOME/ffmpeg_build/lib" \
        --extra-libs=-lpthread \
        --bindir="$HOME/bin" \
        --enable-gpl \
        --enable-libass \
        --enable-libfdk-aac \
        --enable-libfreetype \
        --enable-libmp3lame \
        --enable-libopus \
        --enable-libtheora \
        --enable-libvorbis \
        --enable-libvpx \
        --enable-libx264 \
        --enable-libx265 \
        --enable-nonfree \
        --enable-shared \
        --enable-pic \
        --extra-cflags=-fPIC

       

    • Note: the general instruction download the latest(snapshot) package with the following command "wget http://ffmpeg.org/releases/ffmpeg-snapshot.tar.bz2", there might be build/configure mistake with this package since it is not the official release, please download your favorite release packages if the build failed. For this tutorial, version 3.4 is used.
    • Set the path LD_LIBRARY_PATH to point to $HOME/ffmpeg_build/lib, it is recommend to set to the system environment variable, for example, add it to /etc/environment; Or user can use the following command as a temporary way to set in the current environment:
      # export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$HOME/ffmpeg_build/lib"
  3. Download the sample code attached to this article and uncompress it in a local directory.
  4. Build the source code
    • Add the library path to the LD_LIBRARY_PATH:Check the Makefile in directory "simple_decode_ffmpeg" and "simple_encode_ffmpeg), noticed "FFMPEG_BUILD=$(HOME)/ffmpeg_build", the directory "$(HOME)/ffmpeg_build" is the default build directory if you followed the general FFMpeg compilation instructions; if you have build the library in different directory, you have to change the $(FFMPEG_BUILD) variable to that directory.
    • At the root directory of the project, do "make", the binary will be built at "~/_build" directory
    • The user could disable the audio code and check video only by followings:
      Remove "-DDECODE_AUDIO" from the Makefile in simple_decode_ffmpeg project
      Remove "-DENCODE_AUDIO" from the Makefile in simple_encode_ffmpeg project
    • The user could also turn of the debug build by changing the Makefile to switch the definition of "CFLAG"
  5. Run the binary with the video workload
    • Download the BigBuckBunny320x180.mp4 and save it locally.
    • To decode the video file with the following command:
      # _build/simple_decode_ffmpeg ~/Downloads/BigBuckBunny_320x180.mp4 out.yuv

      The command generates 2 output files: out.yuv--the raw video stream; audio.dat--the raw audio PCM 32bit stream.

    • To encode the result from the decoding with the following command:
      # _build/simple_encode_ffmpeg -g 320x180 -b 20000 -f 24/1 out.yuv out.mp4

      The command reads the raw audio with the name "audio.dat" by default.

 

Known Issue

  • When running the sample to validate the MSS installation, there is a failure when the patched the kernel was not applied to the platform, run the following command to check(take patched kernel 4.4 as an example):
    uname -r
    4.4.0

    In the installation instruction, the kernel 4.4 was patched, this implies the driver update to access the media fixed functions. If the command doesn't show the expected kernel version, user has to switch the kernel at the boot time at the grub option menu, to show the grub menu, refer to this page.

  • The following table shows all the video clips being tested successful, for the other codecs and containers, please feel free to extends the current code.
    Tested sample vs container with codecs
  • container with codecssample_decode_ffmpegsample_encode_ffmpeg
    .mp4(h.264/hevc/MPEG2, aac)(h.264, aac)
    .mkv(h.264/hevc/MPEG2, ac3)(h.264, ac3)
    .ts(h264/hevc, ac3)(MPEG2, aac)
    .mpg, mpeg(MPEG2, ac3)(MPEG2, aac)









     
  • The audio codec uses the FFMpeg's library, among the audio codec, only AAC is well tested, for the other codec, Vorbis and AC3 has encoding error, so the default audio for the container ".mkv", ".mpeg", "mpg" and ".ts" is forced to other audio codec.
  • To validate the successful installation of the Media Server Studio, after installing it, download the Media SDK sample from this page and run the following command:
    ./sample_multi_transcode -i::h264 test_stream.264 -o::h264 out.264
    Multi Transcoding Sample Version 8.0.24.698
    
    libva info: VA-API version 0.99.0
    libva info: va_getDriverName() returns 0
    libva info: User requested driver 'iHD'
    libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
    libva info: Found init function __vaDriverInit_0_32
    libva info: va_openDriver() returns 0
    Pipeline surfaces number (DecPool): 20
    MFX HARDWARE Session 0 API ver 1.23 parameters:
    Input  video: AVC
    Output video: AVC
    
    Session 0 was NOT joined with other sessions
    
    Transcoding started
    ..
    Transcoding finished
    
    Common transcoding time is 0.094794 sec
    -------------------------------------------------------------------------------
    *** session 0 PASSED (MFX_ERR_NONE) 0.094654 sec, 101 frames
    -i::h264 test_stream.264 -o::h264 out.264
    
    -------------------------------------------------------------------------------
    
    The test PASSED
    

The design of the mux/demux functions with the FFmpeg library APIs

The sample code is modified base on our original tutorial code, simple_decode and simple_encode. The call to the FFMpeg integration is added to the original source code, the modified area is wrapped by the following comment line:

// =========== ffmpeg splitter integration ============
......

// =========== ffmpeg splitter integration end ============

Demux functions

The structure demuxControl keeps the control parameters of the demux process; the function openDemuxControl() initializes and configures the demuxControl structure; the structure is then used for the demux and decoding process; during the decoding, the function ffmpegReadFrame() reads the video frame after demuxing; finally the function closeDemuxControl() releases the system resources.

In the code "DECODE_AUDIO" turns on the audio decoding and demux the audio stream and use the FFMpeg audio decoder to uncompress the audio stream into the raw audio file "Audio.dat".

Mux functions

The structure muxControl keeps the control parameters of the mux process; the function openMuxControl initializes and configures the muxControl structure, the structure is then used for the encoding and mux process; during the encoding, the function ffmpegWriteFrame() writes the encoded stream into the output container via the FFmpeg muxer; finally the function closeMuxControl() releases the system resources.

In the code "ENCODE_AUDIO" turns on the audio encoding and mux/compress the audio raw data from "Audio.dat" to the video container.

Reference

FFmpeg: examples

FFmpeg: build with shared libraries

Luca Barbato's blog about the bitstream filtering

Luca Barbato's blog about the new AVCodec API

 

Build an Autonomous Mobile Robot with the Intel® RealSense™ Camera, ROS*, and SAWR

$
0
0

Overview

The Simple Autonomous Wheeled Robot (SAWR) project defines the hardware and software required for a basic "example" robot capable of autonomous navigation using the Robot Operating System* (ROS*) and an Intel® RealSense™ camera. In this article, we give an overview of the SAWR project and also offer some tips for building your own robot using the Intel RealSense camera and SAWR projects.

Mobile Robots – What They Need

Mobile robots require the following capabilities:

  • Sense a potentially dynamic environment. The environment surrounding robots is not static. Obstacles, such as furniture, humans, or pets, are sometimes moving, and can appear or disappear.
  • Determine current location. For example, imagine that you are driving a car. You need to specify "Where am I?" in the map or at least know your position relative to a destination position.
  • Navigate from one location to another. For example, to drive your car to your destination, you need both driver (deciding on how much power to apply and how to steer) and navigator (keeping track of the map and planning a route to the destination) skills.
  • Interact with humans as needed. Robots in human environments need to be able to interact appropriately with humans. This may mean the ability to recognize an object as a human, follow him or her, and respond to voice or gesture commands.

The SAWR project, based on ROS and the Intel RealSense camera, covers the first three of these requirements. It can also serve as a platform to explore how to satisfy the last requirement: human interaction.

A Typical Robot Software Stack

To fulfill the above requirements, a typical robot software stack consists of many modules (see Figure 1). At the bottom of the stack, sensor hardware drivers, including those for the Intel RealSense camera in the case of the SAWR, deliver environmental information to a set of sensing modules. These modules recognize environmental information as well as human interaction. Several sources of information are fused to create various models: a world model, an estimate of the robot state (including position in the world), and command inputs (for example, voice recognition).

The Plan module decides how the robot will act in order to achieve a goal. For mobile robotics, the main purpose is navigating from one place to another, for which it is necessary to calculate obstacle-free paths given the current world model and state.

Based on the calculated plan, the Act module manages the actual movement of the robot. Typically, motor control is the main function of this segment, but other actions are possible, such as speech output. When carrying out an action, a robot may also be continuously updating its world model and replanning. For example, if an unexpected obstacle arises, the robot may have to update its model of the world and also replan its path. The robot may even make mistakes (for example, its estimate of its position in the world might be incorrect), in which case it has to figure out how to recover.

Autonomous navigation requires a lot of computation to do the above tasks. Some tasks can be offloaded to the cloud, but due to connectivity and latency issues this is frequently not an option. The SAWR robot can do autonomous navigation using only onboard computational resources, but the cloud can still be useful for adding other capabilities, such as voice control (for example, using Amazon Voice Services*).


Figure 1. A typical robot software stack.

Navigation Capabilities - SLAM

Simultaneous localization and mapping (SLAM) is one of the most vital capabilities for autonomous mobile robots. In a typical implementation, the robot navigates (plans paths) through a space using an occupancy map. This map needs to be dynamically updated as the environment changes. In lower-end systems, this map is typically 2D, but more advanced systems might use a 3D representation such as a point cloud. This map is part of the robot’s world representation. The “localization” part of SLAM means that in addition to maintaining the map, the robot needs to estimate where it is located in the map. Normally this estimation uses a probabilistic method; rather than a single estimated location, the robot maintains a probability distribution and the most probable location is used for planning. This allows the robot to recover from errors and reason about uncertainty. For example, if the estimate for the current location is too uncertain, the robot could choose to acquire more information from the environment (for example, by rotating to scan for landmarks) to refine its estimate.

In the default SAWR software stack, the open source slam_gmapping package is used to create and manage the map, although there are several other options available, such as cartographer and rgbd-slam. This module is continually integrating new sensor data into the map and clearing out old data if it is proven incorrect. Another module, amcl, is used to estimate the current location by matching sensor data against the map. These modules run in parallel to constantly update the map and the estimate of the robot’s position. Figure 2 shows a typical indoor environment and a 2D map created by this process.


Figure 2. Simultaneous localization and mapping (SLAM) with 2D mapping.

Hardware for Robotics

Figure 3 shows the hardware architecture of the SAWR project. Like many robotics systems, the architecture consists of a master and slave system. The master takes care of high-level processing (such as SLAM and planning), and the slave takes care of real-time processing (such as motor speed control). This is similar to how the brain and spinal reflexes work together in animals. Several different options can be used for this model, but typically a Linux* system is used for the master and one or more microcontroller units (MCUs) are used for the slave.


Figure 3. Robot architecture.

In this article, Intel RealSense cameras are used as the primary environmental sensor. These cameras provide depth data and can be used as input to a SLAM system. The Intel® RealSense™ camera R200 or Intel® RealSense™ camera ZR300 are used in the current SAWR project. The Intel® RealSense™ camera D400 series, shown in Figure 4, will soon become a common depth camera of choice, but since this camera provides similar data but with improved range and accuracy, and uses the same driver, an upgrade is straightforward. As for drivers, librealsense and realsense_ros_camera drivers are available on GitHub*. You can use any Intel RealSense camera with them.


Figure 4. Intel® RealSense™ Depth Camera D400 Series.

For the master computer, you can choose from various hardware, including Intel® NUC with Intel® Core™ i5 and Intel® Core™ i7 processors (see Figure 5). This choice provides maximum performance for robotics development. You can also use OEM boards for robotics, such as one of the Aaeon UP* boards, for rapid prototype-to-production for robotics development. Even the diminutive Aaeon UP Core* has enough performance to do SLAM. The main requirement is that the board runs Linux. The SAWR software stack uses ROS, which runs best under Ubuntu*, although it is possible to install it under other distributions, such as Debian* or Yocto*.


Figure 5. Intel® NUC.

SAWR Basic Mobile Robot

The following is a spec overview of the SAWR basic mobile robot, shown in Figure 6, which is meant to be an inexpensive reference design that is easy to reproduce (the GitHub site includes the files to laser-cut your own frame). The SAWR software stack can be easily adapted to other robot frames. For this design, the slave computers are actually embedded inside the Dynamixel servos. The MCUs in these smart motors take care of low-level issues like position sensing and speed control, making the rest of the robot much simpler.

Computer: Aaeon UP board

Camera: Intel RealSense camera

Actuation: Two Dynamixel MX-12W* smart servos with magnetic encoders

Software: Xubuntu* 16.04 and ROS Kinetic*

Frame: Laser-cut acrylic or POM, Polulo sphere casters, O-ring tires and belt transmission

Other: DFRobot 25W/5V power regulator

Extras: Jabra Speak* 510+ USB speakerphone (for voice I/O, if desired)

Instructions and software: https://github.com/01org/sawr


Figure 6. SAWR basic mobile robot.

One of distinctive parts of the SAWR project is that both the hardware and the software have been developed in an open source style. The software is based on modifying and simplifying the Open Source Robotics Foundation Turtlebot* stack, but adds a custom motor driver using the Dynamixel Linux* SDK. For the hardware, the frame is parametrically modeled using OpenSCAD*, and then converted to laser-cut files using Inkscape*. You can download all the data from GitHub, and then make your own frame using a laser cutter (or a laser-cutter service). Most of other parts are available from a hardware store. Detailed instructions, assembly, and setup plans are available online.

Using an OEM Board for Robotics

When you choose an OEM board for robotics, such as an UP board for SAWR or any other robotics system, using active cooling to get higher performance is strongly recommended. Usually robotics middleware consumes a high level of CPU resources, and lack of CPU resource sometimes will translate into low quality or low speed of autonomous movement. With active cooling, you can maintain the CPU’s highest speed indefinitely. In particular, the UP board can turbo with active cooling and run at a much higher clock rate with it than without.

You may be concerned about power resources for active cooling and higher clock rates. However power consumption is not usually a limiting factor in robotics, because motors are usually the primary power load. In fact, instead of the basic UP board, you can select the UP Squared*, which has much better performance.

Another issue is memory. The absolute minimum is 2 GB, but 4 GB is highly recommended. The SLAM system uses a lot of memory to maintain the world state and position estimate. Remember that the OS needs memory too, and Ubuntu tends to use about 500 MB doing nothing. So a 4 GB system has 7x the available space for applications than a 1 GB system, not just 4x.

ROS Overview

Despite its name, ROS is not an OS, but a middleware software stack that can run on top of various operating systems, although it is primarily used with Ubuntu. ROS supports a distributed, concurrent processing model based on a graph of communicating nodes. Thanks to this basic architecture, you can not only easily network together multiple processing boards on the same robot if you need to, but you can also physically locate boards away from the actual robot by using Wi-Fi* (with some loss of performance and reliability, however). From a knowledge base perspective, ROS has a large community with many existing open source nodes supporting a wide range of sensors, actuators, and algorithms. That and its excellent documentation are good reasons to choose ROS. From a development and debugging perspective, various powerful and attractive visualization tools and simulators are also available and useful.

Basic ROS Concepts

This section covers the primary characteristics of the ROS architecture. To learn more, refer to the ROS documentation and tutorials.

  • Messages and topics (see Figure 7). ROS uses a publish and subscribe system for sending and receiving data on uniquely named topics. Each topic can have multiple publishers and subscribers. Messages are typed and can carry multiple elements. Message delivery is asynchronous, and it's usually recommended to use this for most interprocess communication in ROS.


    Figure 7. Messages and topics.

  • Service calls (see Figure 8). Service calls use synchronous remote procedure call semantics, also known as “request/response.” When using service calls, the caller blocks communication until a response is received. Due to this behavior, which can lead to various problems such as deadlocks and hung processes, you should consider whether you really need to build your communication with service calls. They are primarily used for updating parameters, where the buffering for messages creates too much overhead (for example, for updating maps) or where synchronization between activities is actually needed.


    Figure 8.  Service calls.

  • Actions (see Figure 9). Actions are used to define long-running tasks with goals, the possibility of failure, and where periodic status reports are useful. In the SAWR software stack actions are mainly used for setting the destination goal and monitoring the progress of navigation tasks. Actions generally support asynchronous goal-directed behavior control based on a standard set of topics. In the case of SAWR, you can trigger a navigation action by using Rviz (the visualizer) and the 2D Nav Goal button.


    Figure 9. Actions.

  • Parameters (see Figure 10). Parameters are used to set various values for each node. A parameter server provides typed constant data at startup, and the latest version of ROS also supports dynamic parameter update after node launch. Parameters can be specified in various ways, including through the command line, parameter files, or launch file parameters.


    Figure 10. Parameters.

  • Other ROS concepts. There are several other important concepts relevant to the ROS architecture.
    • Packages: Collections of files used to implement or specify a service or node in ROS, built together using the catkin build system (typically).
    • Universal Robot Description Format (URDF): XML files describing joints and transformations between joints in a 3D model of the robot.
    • Launch files: XML files describing a set of nodes and parameters for a ROS graph.
    • Yet Another Markup Language: Used for parameter specification on the command line and in files.

ROS Tools

A lot of powerful development and debug tools are available for ROS. The following tools are typically used for autonomous mobile robots.

  • Rviz (see Figure 11). Visualize various forms of dynamic 3D data in context: transforms, maps, point clouds, images, goal positions, and so on.


    Figure 11. Rviz.

  • Gazebo. Robot simulator, including collisions, inertia, perceptual errors, and so on.
  • Rqt. Visualize graphs of nodes and topics.
  • Command-line tools. Listen to and publish on topics, make service calls, initiate actions. Can filter and monitor error messages.
  • Catkin. Build system and package management.

ROS Common Modules for Autonomous Movement

The following modules are commonly used for autonomous mobile robots, and SAWR also adopts them as well.

  • Tf (tf2) (see Figure 12). Coordinate the transform library. It's one of the most important packages for ROS. Thanks to tf, you can manage all coordinate values, including the position of the robot or relations between the camera and wheels. For treating various categories of coordinates, several distinctive concepts such as frame and tree are adopted.


    Figure 12. tf frame example.

  • slam_gmapping. ROS wrapper for OpenSlam's Gmapping. gmapping is one of the most famous SLAM algorithms. While still popular, there are also several alternatives now for this function.
  • move_base. Core module for autonomous navigation. This package provides various functions, including planning a route, maintaining cost maps, and issuing speed and direction commands for motors.
  • Robot_state_publisher. Publishes the 3D poses of the robot links, which are important for a manipulator or humanoid. In the case of SAWR, the most important data maintained by this module is the position and orientation of the robot and the location of the camera relative to the robot’s position.

Tips for Building a Custom Robot using the SAWR Stack

SAWR consists of the following subdirectories, which you can use as-is if you want to utilize the complete SAWR software and hardware package (see Figure 13). You can also use them as a starting point for your original robot with the Intel RealSense camera. Also below are tips for customizing the SAWR stack for use with other robot hardware.

  • sawr_master: Master package, launch scripts.
    • Modify if you change another ROS module.
  • sawr_description: Runtime physical description (URDF files).
    • Modify urdf and xacro files according to your robot’s dimension (check with tf tree/frame).
  • sawr_base: Motor controller and hardware interfacing.
    • Prepare your own motor controller and odometry libraries.
  • sawr_scan: Camera configuration.
  • sawr_mapping: SLAM configuration.
    • You can begin as-is if you use the same Intel RealSense camera configuration with SAWR.
  • sawr_navigation: Move-base configuration.
    • Modify and tune parameters of global/local costmap, move_base. This is the most difficult part of tuning your own hardware.


Figure 13. SAWR ROS node graph viewed by rqt_graph.

Conclusion

Autonomous mobile robotics is an emerging area, but the technology for mobile robotics is already relatively mature. ROS is a key framework for robot software development that provides a wide range of modules covering many areas of robotics. The latest version is Lunar—the 12th generation.. Robotics involves all aspects of computer science and engineering, including artificial intelligence, computer vision, machine learning, speech understanding, the Internet of Things, networking, and real-time control—and SAWR project is good start point for developing ROS* based robotics.

About the Author

Sakemoto is an application engineer in the Intel® Software and Services Group. He is responsible for software enabling and also works with application vendors in the area of embedded systems and robotics. Prior to his current job, he was a software engineer for various mobile devices including embedded Linux and Windows*.


How to create video wall with MSDK sample_multi_transcode

$
0
0

1. Download and install MSDK samples

Download and install MSDK sample from https://software.intel.com/en-us/intel-media-server-studio/code-samples or https://software.intel.com/en-us/media-sdk/documentation/code-samples

2. Create par file for sample_multi_transcode for 4 sources

-i::h264 crowd_run_1080p.264 -vpp_comp_dst_x 0 -vpp_comp_dst_y 0 -vpp_comp_dst_w 960 -vpp_comp_dst_h 540     -join -o::sink
-i::h264 crowd_run_1080p.264 -vpp_comp_dst_x 960 -vpp_comp_dst_y 0 -vpp_comp_dst_w 960 -vpp_comp_dst_h 540   -join -o::sink
-i::h264 crowd_run_1080p.264 -vpp_comp_dst_x 0 -vpp_comp_dst_y 540 -vpp_comp_dst_w 960 -vpp_comp_dst_h 540   -join -o::sink
-i::h264 crowd_run_1080p.264 -vpp_comp_dst_x 960 -vpp_comp_dst_y 540 -vpp_comp_dst_w 960 -vpp_comp_dst_h 540 -join -o::sink
-vpp_comp_only 4 -join -i::source

3. About parameters use in sample_multi_transode

-o::sink   | Output of this session serves as input for all sessions using the -i::source.
-i::source | The session receives the output of the session using the -o::sink option at input.
-join      | Join the session to another session.

More information about parameters, please refer to readme-multi-transcode.pdf in the source code folder.

A script for performance test with MSDK samples

$
0
0

1. Introduce the script

When MSDK run on difference platform, performance test usually needed when do the evaluation. MSDK sample is very good tools to do the performance, it supports media classic pipeline: decode, VPP, and encode, also has useful information to calculate the performance like running time, frame number etc. When do performance test, we also need to need the resouces usage, including CPU, memory, GPU usage etc. So auto run script is better to handle this. Followings will introduce this such scripts.

2. Download the script from Github.

$ git clone https://github.com/zchrzhou/mdk-perf-script.git

highly recommend to read : https://github.com/zchrzhou/mdk-perf-script/blob/master/readme.txt

3. Features of MSDK performance script

  • Perl script, easy extend, auto collect performance data (FPS/GPU/CPU/MEM usage).
  • Batch run many test cases in order.
  • Loop run test cases for stress test.
  • Support MSDK sample_decode, sample_encode, sample_vpp, sample_multi_transcode
  • Multi-OS support, support both Windows and Linux.

4. How to use this tool

$ ./run.sh or ./main.pl
Welcome to Intel MediaSDK sample multiable process test tool.
Enjoy and good luck.
Performance test with Intel MSDK sample
Use example:
    main.pl [--test <item-1> --test <item-n>] [--all] [--loop n] [--start n1 --end n2]
    main.pl [--test A01 --test B1 --test C1] [--loop 2]
    main.pl [--start 1 --end 10] [--loop 2]
    main.pl [--test A01] [--loop -1]

    --loop:         loop run (-1 will run forever)
    --start|--end:  run with the range from --start to --end
                    refer to lib/config.pl -> %conf{"range_template"}
    --all:          for all test items in lib/config.pl -> %test_map
    --test:         test item, refter to lib/config.pl -> %test_map

    --input-dir:    set input file folder
    --output-dir:   set output file folder
    --sample-dir:   set sample binary folder

    --with-output:  save output of transcode
    --with-par:     only for transcode test
    --with-gpu:     only for linux
    --with-cpu-mem  only for linux

5. Configure your test cases

please modify config.pm to custom your tests.

$ vim lib/config.pm
### Test Map
##Transcode: ITEM => [channel-num, test-type, input-codec, input-file, output-codex, output-ext, head-args, tail-args]
##Decode:    ITEM => [channel-num, test-type, input-param, input-file, output-param, output-ext, head-args, tail-args]
##Encode:    ITEM => [channel-num, test-type, input-param, input-file, output-param, output-ext, head-args, tail-args]
##VPP:       ITEM => [channel-num, test-type, input-param, input-file, output-param, output-ext, head-args, tail-args]
our %test_map = ("A01" => [1,  "transcode",  "-i::h264", "1080p.h264", "-o::h264", "h264", "", "-hw -w 1920 -h 1080 -u 7 -b 6000"],"A02" => [2,  "transcode",  "-i::h264", "1080p.h264", "-o::h264", "h264", "", "-hw -w 1920 -h 1080 -u 7 -b 6000"],"B1"  => [5,  "transcode",  "-i::h264", "1080p.h264", "-o::h264", "h264", "", "-hw -w 1280 -h 720 -u 7 -b 2048"],"C5"  => [4,  "transcode",  "-i::mpeg2", "1080p.m2t", "-o::h264", "h264", "", "-hw -w 176  -h 144  -u 7 -b 80"  ],"D2"  => [6,  "decode", "-i", "JOY_1080.h264", "-o", "yuv",  "h264", "-hw" ],"D3"  => [2,  "decode", "-i", "JOY_1080.h264", "-r", "",     "h264", "-hw" ],"E2"  => [6,  "encode", "-i", "JOY_1080.yuv", "-o", "h264", "h264", "-hw -w 1920  -h 1080" ],"F1"  => [2,  "vpp", "-i", "JOY_1080.yuv", "-o", "yuv", "-lib hw", "-scc nv12 -dcc nv12 -sw 1920  -sh 1080 -dw 1280  -dh 720 -n 5" ],"V0"  => [16, "transcode",  "-i::h264", "1080p.h264", "-o::h264", "h264", "", "-hw -w 1920 -h 1080 -u 7 -b 15000"],"V1"  => [16, "transcode",  "-i::h264", "1080p.h264", "-o::h264", "h264", "", "-hw -w 1920 -h 1080 -u 7 -b 15000"],"V2"  => [16, "transcode",  "-i::h264", "1080p.h264", "-o::h264", "h264", "", "-hw -w 1920 -h 1080 -u 7 -b 15000"],
);

6. A demo for run this scripts

$ ./run.sh --test A02 --with-fps --with-cpu-mem --with-par --with-gpu
Welcome to Intel MediaSDK sample multiable process test tool.
Enjoy and good luck.
mkdir -p input
mkdir -p output/A02
rm -rf   output/A02/*
rm -f input/1080p.h264
cp -f /home/zhoujd/perf-script/stream/1080p.h264 input/1080p.h264
Start top by child process
Test --with-par is used
Start top by child process
/home/zhoujd/perf-script/binary/sample_multi_transcode  -par output/A02/multi-channel.par > output/A02/A02-with-par.log
libva info: VA-API version 0.99.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
[sudo] password for zhoujd:
top process num: zhoujd   18207 18200  0 22:20 pts/5    00:00:00 /bin/bash /home/zhoujd/perf-script/tools/cpu_mem/top.sh 1 sample_m

gpu process num: zhoujd   18208 18200  0 22:20 pts/5    00:00:00 /bin/bash /home/zhoujd/perf-script/tools/gpu/metrics_monitor.sh

0-time: 4.21 sec
0-frames: 500
0-fps: 118.69
1-time: 4.21 sec
1-frames: 500
1-fps: 118.75
AVG FPS: 118.72
Num Stream: 2
CPU: 13.1 %
MEM: 46.86 MB
GPU: 58.00% 95.75% 0.00%
mv /home/zhoujd/perf-script/tools/cpu_mem/cpu_mem.txt output/A02
mv /home/zhoujd/perf-script/tools/gpu/gpu.log output/A02
Wait 2s ...
run finished

Explore Unity Technologies ML-Agents* Exclusively on Intel® Architecture

$
0
0

Abstract

This article describes how to install and run Unity Technologies ML-Agents* in CPU-only environments. It demonstrates how to:

  • Train and run the ML-Agents Balance Balls example on Windows* without CUDA* and cuDNN*.
  • Perform a TensorFlow* CMake build on Windows optimized for Intel® Advanced Vector Extensions 2 (Intel® AVX2).
  • Create a simple Amazon Web Services* (AWS) Ubuntu* Amazon Machine Image* environment from scratch without CUDA and cuDNN, build a “headless” version of Balance Balls for Linux*, and train it on AWS.

Introduction

Unity Technologies released their beta version of Machine Learning Agents* (ML-Agents*) in September 2017, offering an exciting introduction to reinforcement learning using their 3D game engine. According to Unity’s introductory blog, this open SDK will potentially benefit academic researchers, industry researchers interested in “training regimes for robotics, autonomous vehicle, and other industrial applications,” and game developers.

Unity’s ML-Agents SDK leverages TensorFlow* as the machine learning framework for training agents using a Proximal Policy Optimization (PPO) algorithm. There are several example projects included in the GitHub* download, as well as a Getting Started example and documentation on how to install and use the SDK.

One downside of the SDK for some developers is the implied dependencies on CUDA* and cuDNN* to get the ML-Agents environment up and running. As it turns out, it is possible to not only explore ML-Agents exclusively on a CPU, but also perform a custom build of TensorFlow on a Windows® 10 computer to include optimizations for Intel® architecture.

In this article we show you how to:

  • Train and run the ML-Agents Balance Balls (see Figure 1) example on Windows without CUDA and cuDNN.
  • Perform a TensorFlow CMake build on Windows* optimized for Intel® Advanced Vector Extensions 2 (Intel® AVX2).
  • Create a simple Amazon Web Services* (AWS) Ubuntu* Amazon Machine Image* (AMI) environment from scratch without CUDA and cuDNN, build a “headless” version of Balance Balls for Linux*, and train it on AWS.


Figure 1. Trained Balance Balls model running in Unity* software.

 

Target Audience

This article is intended for developers who have had some exposure to TensorFlow, Unity software, Python*, AWS, and machine learning concepts.

System Configurations

The following system configurations were used in the preparation of this article:

Windows Workstation

  • Intel® Xeon® processor E3-1240 v5
  • Microsoft Windows 10, version 1709

Linux Server (Training)

  • Intel® Xeon® Platinum 8180 processor @ 2.50 GHz
  • Ubuntu Server 16.04 LTS

AWS Cloud (Training)

  • Intel® Xeon® processor
  • Ubuntu Server 16.04 LTS AMI

In the section on training ML-Agents in the cloud we use a free-tier Ubuntu Server 16.04 AMI.

Install Common Windows Components

This section describes the installation of common software components required to get the ML-Agents environment up and running. The Unity ML-Agents documentation contains an Installation and Setup procedure that links to a webpage instructing the user to install CUDA and cuDNN. Although this is fine if your system already has a graphics processing unit (GPU) card that is compatible with CUDA and you don’t mind the extra effort, it is not a requirement. Either way, we encourage you to review the Unity ML-Agents documentation before proceeding. 

There are essentially three steps required to install the common software components:

  1. Download and install Unity 2017.1 or later from the package located here.
  2. Download the ML-Agents SDK from GitHub. Extract the files and move them to a project folder of your choice (for example, C:\ml-agents).
  3. Download and install the Anaconda* distribution for Python 3.6 version for Windows, located here.

Install Prebuilt TensorFlow*

This section follows the guidelines for installing TensorFlow on Windows with CPU support only. According to the TensorFlow website, “this version of TensorFlow is typically much easier to install (typically, in 5 or 10 minutes), so even if you have an NVIDIA* GPU, we recommend installing this version first.” Follow these steps to install prebuilt TensorFlow on your Windows 10 system:

  1. In the Start menu, click the Anaconda Prompt icon (see Figure 2) to open a new terminal.


    Figure 2. Windows* Start menu.

  2. Type the following commands at the prompt:

    > conda create -n tensorflow-cpu python=3.5
    > activate tensorflow-cpu
    > pip install --ignore-installed --upgrade tensorflow

  3. As specified in the TensorFlow documentation, ensure the installation worked correctly by starting Python and typing the following commands:

    > python
    >>> import tensorflow as tf
    >>> hello = tf.constant('Hello')
    >>> sess = tf.Session()
    >>> print (sess.run(hello))

  4. If everything worked correctly, 'Hello' should print to the terminal as shown in Figure 3.


    Figure 3. Python* test output.

    You may also notice a message like the one shown in Figure 3, stating “Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2.” This message may vary depending on the Intel® processor in your system; it indicates TensorFlow could run faster on your computer if you build it from sources, which we will do in the next section.

  5. To close Python, at the prompt, press CTRL+Z.
     
  6. Navigate to the python subdirectory of the ML-Agents repository you downloaded earlier, and then run the following command to install the other required dependencies:

    > pip install.

  7. Refer to the Building Unity Environment section of the “Getting Started with Balance Ball Example” tutorial to complete the ML-Agents tutorial.

Install TensorFlow from Sources

This section describes how to build an optimized version of TensorFlow on your Windows 10 system.

The TensorFlow website states, “We don't officially support building TensorFlow on Windows; however, you may try to build TensorFlow on Windows if you don't mind using the highly experimental Bazel on Windows or TensorFlow CMake build.” However, don’t let this discourage you. In this section we provide instructions on how to perform a CMake build on your Windows system.

The following TensorFlow build guidelines complement the Step-by-step Windows build instructions shown on GitHub. To get a more complete understanding of the build process, we encourage you to review the GitHub documentation before continuing. 

  1. Install Microsoft Visual Studio* 2015. Be sure to check the programming options as shown in Figure 4.


    Figure 4. Visual Studio* programming options.

  2. Download and install Git from here. Accept all default settings for the installation.
     
  3. Download and extract swigwin from here. Change folders to C:\swigwin-3.0.12 (note that the version number may be different on your system).
     
  4. Download and install CMake version 3.6 from here. During the installation, be sure to check the option Add CMake to the system path for all users.
     
  5. In the Start menu, click the Anaconda Prompt icon (see Figure 2) to open a new terminal. Type the following commands at the prompt:

    > conda create -n tensorflow-custom36 python=3.6
    > activate tensorflow-custom36

  6. Run the following command to set up the environment:

    > "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat"

    (Note: If vcvarsall.bat is not found, try following the instructions provided here.)
     
  7. Clone the TensorFlow repository and create a working directory for your build:

    cd /
    > git clone https://github.com/tensorflow/tensorflow.git
    > cd tensorflow\tensorflow\contrib\cmake
    > mkdir build
    > cd build

  8. Type the following commands (Note: Be sure to check the paths and library version shown below on your own system, as they may be different):

    > cmake .. -A x64 -DCMAKE_BUILD_TYPE=Release ^
    -DSWIG_EXECUTABLE=C:\swigwin-3.0.12/swig.exe ^
    -DPYTHON_EXECUTABLE=C:/Users/%USERNAME%/Anaconda3/python.exe ^
    -DPYTHON_LIBRARIES=C:/Users/%USERNAME%/Anaconda3/libs/python36.lib ^
    -Dtensorflow_WIN_CPU_SIMD_OPTIONS=/arch:AVX2

  9. Build the pip package, which will be created as a .whl file in the directory .\tf_python\dist (for example, C:\tensorflow\tensorflow\contrib\cmake\build\tf_python\dist\tensorflow-1.4.0-cp36-cp36m-win_amd64.whl).

    > C:\Windows\Microsoft.NET\Framework64\v4.0.30319\MSBuild /p:Configuration=Release tf_python_build_pip_package.vcxproj

    (Note: Be sure to check the path to MSBuild on your own system as it may be different.)
     
  10. Install the newly created TensorFlow build by typing the following command:

    pip install C:\tensorflow\tensorflow\contrib\cmake\build\tf_python\dist\tensorflow-1.4.0-cp36-cp36m-win_amd64.whl

  11. As specified in the TensorFlow documentation, ensure the installation worked correctly by starting Python and typing the following commands:

    > python
    >>> import tensorflow as tf
    >>> hello = tf.constant('Hello')
    >>> sess = tf.Session()
    >>> print (sess.run(hello))

  12. If everything worked correctly, 'Hello' should print to the terminal. Also, we should not see any build optimization warnings like we saw in the previous section (see Figure 5).


    Figure 5. Python* test output.

  13. To close Python, at the prompt, press CTRL+Z.
     
  14. Navigate to the python subdirectory of the ML-Agents repository you downloaded earlier, and then run the following command to install the other required dependencies:

    > pip install .

  15. Refer to the Building Unity Environment section of the “Getting Started with Balance Ball Example” tutorial to complete the ML-Agents tutorial.

Train ML-Agents in the Cloud

The ML-Agents documentation provides a guide titled “Training on Amazon–Web Service” that contains instructions for setting up an EC2 instance on AWS for training ML-Agents. Although this guide states, “you will need an EC2 instance which contains the latest Nvidia* drivers, CUDA8, and cuDNN,” there is a simpler way to do cloud-based training without the GPU overhead.

In this section we perform the following steps:

  • Create an Ubuntu Server 16.04 AMI (free tier).
  • Install prerequisite applications on Windows for interacting with the cloud server.
  • Install Python and TensorFlow on the AMI.
  • Build a headless Linux version of the Balance Balls application on Windows.
  • Export the Python code in the PPO.ipynb Jupyter Notebook* to run as a stand-alone script in the Linux environment.
  • Copy the python directory from Windows to the Linux AMI.
  • Run a training session on AWS for the ML-Agents Balance Balls application.
  1. Create an account on AWS if you don’t already have one. You can follow the steps shown in this section with an AWS Free Tier account; however, we do not cover every detail of creating an account and configuring an AMI, because the website contains detailed information on how to do this.
  2. Create an Ubuntu Server 16.04 AMI. Figure 6 shows the machine instance we used for preparing this article.


    Figure 6. Linux* Server 16.04 LTS Amazon Machine Image*.

  3. Install PuTTY* and WinSCP* on your Windows workstation. Detailed instructions and links for installing these components, connecting to your Linux instance from Windows using PuTTY, and transferring files to your Linux instance using WinSCP are provided on the AWS website.
     
  4. Log in to the Linux Server AMI using PuTTY, and then type the following commands to install Python and TensorFlow:

    > sudo apt-get update
    > sudo apt-get install python3-pip python3-dev
    > pip3 install tensorflow
    > pip3 install image

    Note: The next steps assume you have already completed the ML-Agents Getting Started with Balance Ball Example tutorial. If not, be sure to complete these instructions and verify you can successfully train and run a model on your local Windows workstation before proceeding.
     
  5. Ensure your Unity software installation includes Linux Build Support. You need to explicitly specify this option during installation, or you can add it to an existing installation by running the Unity Download Assistant as shown in Figure 7.


    Figure 7. Unity* software Linux* build support.

  6. In Unity software, open File – Build Settings and make the following selections:
    • Target Platform: Linux
    • Architecture: x86_64
    • Headless Mode: Checked
  7. These settings are shown in Figure 8.


    Figure 8. Unity* software build settings for headless Linux operation.

  8. After clicking Build, create a unique name for the application and save it in the repository’s python folder (see Figure 9). In our example we named it Ball3DHeadless.x86_64 and will refer to it as such for the remainder of this article.


    Figure 9. Build Linux* application.

  9. In order to run through a complete training session on the Linux AMI we will export the Python code in the PPO.ipynb Jupyter Notebook so it can run as a stand-alone script in the Linux environment. To do this, follow these steps:

    - In the Start menu, to open a new terminal, click the Anaconda Prompt icon (Figure 2).
    - Navigate to the python folder, and then type Jupyter Notebook on the command line.
    - Open the PPO.ipynb notebook, and then click File – Download As – Python (.py). This will save a new file named “ppo.py” in the Downloads folder of your Windows computer.
    - Change the filename to “ppo-test.py” and then copy it to the python folder in your ML-Agents repository.
    - Open ppo-test.py in a text editor, and then change the env_name variable to “Ball3DHeadless”:
    - env_name = “Ball3DHeadless” # Name of the training environment file.
    - Save ppo-test.py, and then continue to the next step.

  10. Once the application has been built for the Linux environment and the test script has been generated, use WinSCP to copy the python folder from your ML-Agents repository to the Ubuntu AMI. (Details on transferring files to your Linux instance using WinSCP are provided on the AWS website.)
  11. In the PuTTY console, navigate to the python folder and run the following commands:

    > cd python
    > chmod +x Ball3DHeadless.x86_64
    > python3 ppo-test.py
    If everything went well you should see the training session start up as shown in Figure 10.


    Figure 10. Training session running on an Amazon Web Services* Linux* instance.

Summary

In the output shown in Figure 10, notice that the time (in seconds) is printed to the console after every model save. Code was added to the ppo-test.py script for this article in order to get a rough measure of the training time between model saves.

To instrument the code we made the following modifications to the Python script:

import numpy as np
import os
import tensorflow as tf
import time # New Code
.
.
.
trainer = Trainer(ppo_model, sess, info, is_continuous, use_observations, use_states)
timer_start = time.clock() # New Code
.
.
.
Save_model(sess, model_path=model_path, steps=steps, saver=saver)
print(“ %s seconds “ % (time.clock() – timer_start)) # New Code
timer_start = time.clock() # New Code
.
.
.

Using this informal performance metric, we found that the average difference in training time between a prebuilt TensorFlow GPU binary and prebuilt CPU-only binary on the Windows workstation was negligible. The training time for the custom CPU-only TensorFlow build was roughly 19 percent faster than the prebuilt CPU-only binary on the Windows workstation. When training was performed in the cloud, the AWS Ubuntu Server AMI performed roughly 29 percent faster than the custom TensorFlow build on Windows.

Cannot Find "stdint.h" after Upgrade to Visual Studio 2017

$
0
0

Problem Description

When running Visual Studio 2017 C++ compiler under Intel(R) C++ Compiler environment, or Visual Studio 2017 solution contains mixed projects with both Intel compiler and Visual Studio C++ compiler may encounter 

fatal error C1083: Cannot open include file: '../../vc/include/stdint.h': No such file or directory

Root Cause

In some header files of Intel C++ compiler, we need to include particular Microsoft VC++ header files by path. With Microsoft Visual Studio 2015 and older, we can use the relative path, like “../vc”. Starting from Microsoft Visual Studio 2017 the include directory name contains full VC Tools version number.

For example, the Visual Studio 2017 stdint.h is here:

c:/Program files (x86)/Microsoft Visual Studio/2017/Professional/VC/Tools/MSVC/14.10.24930/include/stdint.h

For Visual Studio 2015, it is here:

c:/Program files (x86)/Microsoft Visual Studio 14.0/VC/INCLUDE/stdint.h

Solution

Workaround is to define __MS_VC_INSTALL_PATH macro in command line (-D option), e.g.:

-D__MS_VC_INSTALL_PATH="c:/Program files (x86)/Microsoft Visual Studio/2017/Professional/VC/Tools/MSVC/14.10.24930" 

To resolve the issue still relies on Microsoft's support. Please see our registered an issue in Microsoft's forum:

https://visualstudio.uservoice.com/forums/121579-visual-studio-ide/suggestions/30930367-add-a-built-in-precompiled-macro-to-vc-that-poin

For users encountered this issue, you are encouraged to vote this idea from above link.

Intel® Facial Special Effects Filter

$
0
0

Intel® Facial Special Effects Filter allows camera pipeline to apply different adjustment effects to the video.

The following features for video adjustment are supported:

Face Special Effect Features

The Lip Color module takes YUV input data. With the facial landmark information fed into the module, the module identifies the lip area when there is a face within the input frame. For input frame with detected faces, the module further performs color modification for lip area with the users’ preference of level of adjustment input from the application. 

The Dark Circles block takes YUV input data. Facial landmark points and skin-tone likelihood are utilized to locate the under eye region for processing. Once identifying the regions, a content-adaptive blending with the users’ preference of level of adjustment input from the application is performed to blend the original YUV.

The Under Eye module takes YUV input data. Facial landmark points and skin-tone likelihood information are utilized jointly to locate the under eye regions for processing. 

The Aging module takes YUV input data. Facial landmark information, semi-circle regions of the detected face, and skin-tone likelihood map fed from the analytic module are utilized to locate the area around eyes for processing. Once identifying the target area, a process with the users’ preference of level of adjustment input from the application is operated on YUV values for all pixels within the area.

The Eye Size module takes YUV input data. With the facial landmark information fed into the module and the users’ preference of level of enlargement input from the application, the module internally derives the proper location within the face and the shape of the eyes users intend to have. 

The Face Shape module takes YUV input data. With the facial landmark information fed into the module and the users’ preference of level of face change-effect input from the application, the module internally derives the different shape version of the original face area and performs morphological warping to create the face special filters effect.

The Skin Foundation module takes YUV input data and blends the input with a user selected foundation color where the per-pixel skin-tone likelihood score serves as the blending factor here.

The Nose feature module takes YUV input data. With the facial landmark information fed into the module and the users’ preference of level of adjustment input from the application, the module internally derives the modified shape of the nose area and performs morphological warping to create the special effect.

The Cheekbone Color block takes YUV input data. Facial landmark and skin-tone likelihood information are utilized to locate the cheekbone regions for processing. Once identifying the regions, a content-adaptive blending with the users’ preference of level of adjustment input from the application is performed to blend the original YUV values and a pre-defined red color value for pixels within the cheekbone region.

Emotions Special Effects Feature

The Emotions module takes YUV input data. With the facial landmark information fed into the module and the users’ preference of level of adjustment input from the application, the module internally derives the modified shape of the mouth area and performs morphological warping to create the smiling face effect via changing the shape of users’ mouths.

Face Tone Special Effects Feature 

The Face Tone block takes YUV input data and adjusts all 3-channel information to produce a smooth version of the input. The skin-tone likelihood map fed from the analytic module with the users’ preference of level of adjustment input from the application is utilized to blend the input image and bilateral filtered image to produce the output smoothed image.

 

Control Panel

We provide a utility that can enable and adjust the strength of each feature. This utility is in the attachment FBContorl_20180208.7z

 

Sample Source Code

We provide DirectShow based and MFT0 based sample source code to demonstrate the usage. The two filter samples will be described as following. You are welcome to integrate this filter into your camera driver package if you're camera IHVs.

 

DirectShow based filter

DirectShow based filter is a combination of enumerated capture filter and custom source filter. Enumerated capture filter is referenced from Microsoft AmCap Sample while custom source filter implementation is referenced from Microsoft Push Source Filters Sample.

DirectShow based filter wrapping camera video input with picture enhancement

 

The Null Renderer filter is a renderer that discards every sample it receives, without displaying or rendering the sample data. Maybe we could think of it like /dev/null in Linux. (Null Renderer Filter)

 

DirectShow standard filter works as an interface to set configure (color format/resolution/FPS) to physical capture device. It exposes PIN_CATEGORY_CAPTURE for downstream to receive sample data. (DirectShow Video Capture Filters)

DirectShow standard filter design to grab sample data from upstream, it exposes both input and output pin. In virtual camera we connect its input pin to Capture Filter’s capture output pin, and connect its output pin to Null Filter. (Sample Grabber Filter)

Implement custom source filter by deriving from DirectShow CSource. (Custom Source Filter)

 

A module that performs Face Special Effects Filters via D3D11 interface VideoProcessorBlt and video processing extension. Support NV12 input/output, it’s not literally a DirectShow filter, but do the fface special effects function after CSC stage.

A module that performs color space conversion via D3D11 interface VideoProcessorBlt. Used to convert YUY2/I420 format into NV12 as FB stage feed-in. It’s not literally a DirectShow filter.

 

Build and Installation:

In the attachment Intel®_Facial_Special_Effects_Filter_dshow_20170810.7z

  1. Install Visual Studio 2015 and Windows 10 SDK (10.0.14393)
  2. Build Win32 baseclasses\Release\strmbase.lib via baseclasses\baseclasses.sln
  3. Build Win32 filters\vcam\Release\x86\vcam_x86.dll via filters\vcam\vcam.sln
  4. Copy and register vcam_x86.dll via amdin command "regsvr32.exe vcam_x86.dll"
  5. Download and install vc_redist.x86.exe from Visual C++ Redistributable for Visual Studio 2015 
  6. Run 32-bit desktop camera app and select virtual camera device "Face Effects Camera"
  7. Change registry keys under [HKEY_CURRENT_USER\SOFTWARE\Intel\Display\DXVA\Intel Dxva Virtual Camera] or run virtual_camera_face_effects_setting.reg
       [0]: default strength
       [1-100]: face special effect strength
       [101]: disable face special effect
  8. Check the face effects.

Uninstall: regsvr32.exe /u vcam_x86.dll

Capabilities for DirectShow based filter:

  1. Only support camera with MJPG or YUY2
  2. Color format: Filter outputs YUY2
  3. FPS: 60 fps
  4. Three available resolutions 1280x720, 640x480 and 640x360 *
  5. Not support rotation mode since DShow app isn’t available under Metro mode.
  6. Max face special effects to 8 human faces.
  7. Only support for DShow app, not support for MF-based app

* GPU can support to 1920x1080. This filter software could be modified to support 1080P.

MFT0 based filter

Windows 10 offers IHVs and system OEMs the ability to create video processing plug-ins in the form of a Media Foundation Transform (MFT), known as the camera driver MFT. The driver MFT is also referred to as MFT0 to indicate that it's the first MFT to operate in the source reader. We add face special effects for video processing based on Driver MFT Sample. We implement capture and preview pin driver MFT on multi-pin cameras. For more info about creating a driver MFT, see Creating a camera driver MFT.

Notice:

  1. Need to install camera driver before the following installation. If you install this MFT0 driver before installing camera driver, the filter could not be applied.
  2. If your camera has its proprietary MFT0 filter, this MFT0 filter will replace with it. Only one MTF0 will be effective.

Build and Installation:

In the attachment Intel®_Facial_Special_Effects_Filter_mft0_20170810.7z

  1. Install Visual Studio 2015 and Windows 10 SDK (10.0.14393)
  2. Build x64 Release\x64\IntelMft0_x64.dll via SampleMft0.sln
  3. Download and install vc_redist.x64.exe from Visual C++ Redistributable for Visual Studio 2015
  4. Run install_intel_mft0.bat with admin to install MFT0 as step mentioned in Installing and registering the driver MFT
  5. Run UWP (Metro mode) camera app.
  6. Change registry keys under [HKEY_LOCAL_MACHINE\SOFTWARE\Intel\Display\DXVA\Intel Dxva Virtual Camera] or run create_mft0_regkey_local_machine.reg
       [0]: default strength
       [1-100]: face special effect strength
       [101]: disable face special effect
  7. Check the face effects.

Uninstall: regsvr32 /u "C:\Program Files\Intel\Intel MFT0\IntelMft0_x64.dll"

Capabilities for MFT0 based filter:

  1. Only support camera with MJPG or YUY2
  2. Color format: MFT0 outputs NV12 (if camera support MJPG) or YUY2 (if camera support YUY2)
  3. FPS: 60 fps
  4. Max resolution 1280x720 defined in config.h *
  5. Support rotation mode 0, 90, 180 and 270 degree with cropping and black filler (1280x720 -> 1080x720)
  6. Max face special effects up to 8 human faces.
  7. Only for MF or UWP (Metro) app on Win10 RS1, supports both MF and DShow app on Win10 RS2 with DShow Bridge enabled

* GPU can support to 1920x1080. This filter software could be modified to support 1080P.

Try it via Installer

Notice:

  1. Need to install camera driver before the following installation. If you install this MFT0 driver before installing camera driver, the filter could not be applied.
  2. If your camera has its proprietary MFT0 filter, this MFT0 filter will replace with it. Only one MTF0 will be effective.

In the attachment Intel®_Facial_Special_Effects_Filter_Installer_20170427.7z

  1. Run Setup.exe, it will install DirectShow based filter and MFT0 based filter.
  2. Run 32-bit desktop camera app for DirectShow based or UWP (Metro mode) camera app for MFT0 based filter.
  3. Check the face effects.

System Requirements

  • OS: Windows 10 RS1+
  • Platform: Skylake+
  • Intel® Graphics Driver for Windows: 15.46+
  • Apps: Skype (Desktop mode) / PotPlayer (Desktop mode) / Camera (Metro mode)

 

One Door VR: The First Proof of Concept on Un-Tethered VR Using MSI* Backpack PC

$
0
0

Corey Warning and Will Lewis are the co-founders of Rose City Games*, an independent game studio in Portland, Oregon.

Rose City Games was recently awarded a development stipend and equipment budget to create a VR Backpack Early Innovation Project. The challenge was to come up with something that could only be possible with an un-tethered VR setup. In this article, you’ll find documentation about concepting the project, what we learned, and where we hope to take it in the future. Below is the introductory video for the project.


Figure 1. Watch the introductory video of the One Door VR project.

Inspirations Behind Project: One Door

Earlier this year, our team attended the Resident Evil Escape Room in Portland, Oregon. Being huge fans of that franchise, experiencing that world in a totally new medium was really exciting, and it got us thinking about what other experiences could cross over in a similar fashion.

At the time, we were also trying out as many VR experiences as we could get our hands on. When we heard about the opportunity to work on an un-tethered VR experience, we knew there had to be something interesting we could bring to the table.

We’re currently operating out of a co-working space with some friends working on a variety of VR projects. The WILD crew had some experience in merging real space and VR, so I asked Gabe Paez if he remembered any specific challenges he encountered during that project. “Doors” was his response, and I decided to chase after creating a “VR Escape Room” experience, with the idea of moving through doors as the core concept!

Overview

The scope of this project is to create a proof of concept VR application using the MSI* One VR Backpack. We’re attempting to create a unique experience that’s only possible using this hardware, specifically, an un-tethered setup.

Right away, we knew this project would require an installation, and because of this, we’re not considering this product for mass market. This will likely be interesting content for exhibitions such as GDC Alt.Ctrl, Unite*, VR LA, etc.

One Door Game Concept

Players will be in a completely virtual space, interacting with a physical door installation. They will be wearing the MSI One VR Backpack, with a single HTC Vive* controller, and an HTC Vive headset. Each level will contain a simple puzzle or action the player must complete. Once completed, the player will be able to open the door and physically step through to the next level. At that point, they will be presented with a new puzzle or action, and the game will progress in this fashion.


Figure 2. The proof of concept setup for One Door

The player can open the door at any time. However, if a puzzle or action is incomplete, they will see the same level/door on the other side of the installation. We’re considering using an HTC Vive Tracker for the actual door handle, so that we can easily track and calibrate where the player needs to grab.


Figure 3. One Door front view


Figure 4. One Door top view

Installation Specifics

  • The door will need to be very light weight.
  • We’ll need support beams to make sure the wall doesn’t tip over.
    • Sandbags on the bases will be important.
  • We should use brackets or something similar that allows assembling and disassembling the installation quickly, without sacrificing the integrity each time.
  • The HTC Vive lighthouses will need to be set higher than the wall in order to capture the entire play area.
    • We’ll need quality stands, and likely more sandbags.
  • We may need something like bean bag chairs to place around the support beams/lighthouses to ensure people don’t trip into anything.
    • Another consideration is having someone attending to the installation at all times.


Figure 5. One Door field setup inside the HTC* lighthouses

Our Build-Out

  • Mobile, free-standing door with handle and base
  • MSI One VR Backpack and off-site computer for development
    • Additional DisplayPort-to-HDMI cable required
    • Mouse/keyboard/monitor
    • OBS to capture video
  • 2 lighthouses
    • Stands
    • Adjustable grips to point lighthouses at an angle
    • Placed diagonally on each side of the door
  • 1 Vive Tracker
    • Gaffe tape to attach it to the door, ripped at the charging port
    • Extension cables and charging cables run to the tracker for charging downtime
  • 2 Vive controllers
    • We didn’t need them, but showing hand positioning was helpful for recording video
  • iPhone* to capture real-world video


Figure 6. Door with HTC Vive*

This project was very much VR development training for us in many ways. This was our first time working with a Vive, and implementing the additional physical build out for a new interactive experience created a bit of a learning curve. I feel like a majority of our hang-ups were typical of any VR developer, but of course we created some unique challenges for ourselves that we're happy to have experience with now. I would definitely recommend that VR developers thoughtfully explore the topics below and learn from our assumptions and processes before kicking off a project of their own.

Our First Time with HTC Vive*

We've played with the Vive a ton, but this was our first time developing for it. Setting up the general developer environment and Unity* plugins didn't take much time, but we had to think very strategically about how to develop and test more seamlessly past that point. Very commonly, it saved us an immense amount of time to have two people on site at a time: One person tending to Unity, while the other moved controllers and trackers, re-adjusted lighthouses, adjusted room scale, and acted as a second pair of eyes.


Figure 7. One Door VR development and testing

With regard to hardware specifically as well as our project needing to use a physical prop, we went back and forth on many choreographies for how lighthouses were able to track devices, and even had quite a bit of trouble with hooking up a monitor. Since the MSI One VR Backpack has one HDMI output and one DisplayPort input, we had to borrow (and later buy) a DisplayPort-to-HDMI converter to both develop the application and use the Vive headset simultaneously. Luckily, this didn't delay development for too long, and was a better solution than our initial workaround — attaching the HDMI output to an HDMI switcher that we already had, and flipping between our monitor/dev environment and the headset. Continuing with this process for the duration of the project would have been very unrealistic and a huge waste of time.

We were introduced to more new experiences during this project, like being able to remotely work from home and use Unity's Collaborate feature, exploring how awesome it was to experience VR without being tethered, and becoming very familiar with how quickly we’re able to kick off a VR project.

Budget

Almost directly paired with testing new equipment and working with a physical build-out, our budget was another challenge we had to overcome. The recommended list of equipment provided by Intel was not totally covered by the allotted funding, so we had to pick and choose a bare minimum for what we might be able to use in our project, then consider how leftovers could satisfy the hours put in by an experienced developer. Luckily, because of our connections in the local game developer community, we were able to work with one of our friends who's been interested in experimenting on a project like this for some time. Still, if we were to do this project from scratch, we would very likely scope it with a higher budget in mind, as at least two more trackers, converter cables, adjustable joints for the tops of lighthouse stands, and a few other small items would have been considered in our minimum requirements to complete this project on a tighter timeline with a more polished product in mind.

Location/Space

From a consumer standpoint, we know that room-scale VR is unrealistic for many, and we still ran into a few issues as we planned for and worked on this project. One of my biggest recommendations to other developers working in room-scale VR would be to buy a tape measure early and make sure you have space solely dedicated to your project for the entirety of its development. We share a co-working space with about 20 other local VR developers, artists, game makers, and web designers, so needing to push our build-out to the side of the room at the end of every dev session added to our overall setup time. It did give us a lot of practice with setup and familiarity with devices, but another interesting revelation was that we never would have been able to do this from any of our homes!

Unique Build-Out

Since our project involved a prop (a full-sized, free-standing door), we had to make obvious considerations around moving it, storing it, and occlusion for the lighthouses. When we think about taking our project beyond a prototype, there are so many more issues that become apparent. Thinking about how this project would likely continue in the future as a tech demo, festival/museum installation, or resume piece, we also had to consider that we would need to show it to more people than ourselves and direct supporters. With this comes an additional consideration: safety. We definitely cut corners to very quickly build a functional prototype, but thinking around polish and transportation readiness, we would definitely recommend spending more time and resources towards creating a safer experience catered to those unfamiliar with VR.

As we prototyped, we were able to remember to pick our feet up in order to not trip, slowly move forward to avoid bashing into an outcropping in the door, and find the door handle without any problem. What we've made serves as an excellent tech demo, but we would definitely take another pass at the door prop before considering it any sort of consumable, public product, or experience. To make transportation easier, we would also build the door differently so that we could disassemble it on the fly.

Moving Forward

We're confident in what we have as a technical demo for how easy, interesting, and liberating it can be to use the MSI One VR Backpack, and we're also very proud and excited of what we were able to learn and accomplish. So much so that we'd like to continue implementing simple puzzles, art, voiceover, and accessibility features to make it more presentable. After some additional testing and polish, we'd like to shop the prototype around, searching for a sponsor related to content and IP, VR tech, interactive installations, or trade shows so that we can share the project with a wider audience! Intel is a prime candidate for this collaboration, and we'd love to follow up after giving another round on the demo.

Thanks for letting us be a part of this!

Code Sample (Unity)

When using a peripheral as large as a door, the room choreography needs to be spot-on with regard to your lighthouse and tracker setup — particularly the tracker, which we affixed to our door to gauge its orientation at any given time (this mainly allowed us to tell whether the door was closed or open). We made a simple setup script to position the door, door frame, and door stand/stabilizers properly.

The Setup Helper is a simple tool that provides a solution for the position and rotation of the door and door frame relative to the Vive Tracker position. Setup Helper runs in Editor mode, allowing it to be updated without having to be in Play mode, but should be disabled after running the application to allow the door to swing independent of the frame in game. Multiple Setup Helpers can be created to position any other geometry that needs to be spaced relative to the door, like room walls, floors, room decor, etc. in order to avoid potential visual/collision-oriented gaps or clipping.

The Setup Helper hierarchy is shown above. The following applies to the areas highlighted in blue, including the tracker (attached to the door) and doorway.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
[ExecuteInEditMode]
public class SetupHelper : MonoBehaviour {
	public bool setDoorFrameToTracker = false;
	public GameObject doorFrameGo;
	public Transform trackerTransform;
    public bool trackRotation = false;
	public Vector3 doorframeShift;//used to set the difference in placement to make it fit perfectly on the tracker position
	// Use this for initialization
	void Start () {

	}

	// Update is called once per frame
#if UNITY_EDITOR
	void Update () {
		if (setDoorFrameToTracker)
			SetDoorFrameToTracker();
	}
	void SetDoorFrameToTracker()
	{
		doorFrameGo.transform.position = trackerTransform.position + doorframeShift;
        if (trackRotation)
            doorFrameGo.transform.rotation = trackerTransform.parent.rotation;
	}
#endif

}

About the Authors

Corey Warning and Will Lewis are the co-founders of Rose City Games, an independent game studio in Portland, Oregon.

How to use the Intel® Advisor Collection Control APIs

$
0
0

Overview

Intel® Advisor collection can be sped up, and the size of samples reduced, by using the Instrumentation and Tracing Technology (ITT) APIs. These ITT APIs were supported in the Intel® Advisor survey collection since the product release but now, as from Intel® Advisor 2018, you can also use the APIs on Trip Counts and FLOP collection. This can make the Roofline analysis an option for larger and longer running applications.

In this article, we will show how to use the collection control APIs to command Intel® Advisor when to start and stop the collection of performance data during the execution of your target application.

Background

Intel® Advisor typically starts collecting data as from the moment the analysis is started. As such, Intel® Advisor may be collecting data for sections of the large codebase in which you may not have interest. With the collection control ITT APIs, you can choose which sections of your source code that Intel® Advisor should monitor and record performance data.

Usage example: Focus on a specific code section

The first step is to  wrap your source code of interest between resume and pause APIs methods and then start Intel Advisor in paused mode. When Intel Advisor hits the resume method, it will start collecting performance data and stop when it sees the pause method.

Below are a series of detailed steps with a small code-snippet to get you started:

  1. First, your C/C++ application needs to understand the ITT APIs. In your source code, include the "ittnotify.h" header file, located in the include directory where Intel Advisor has been installed. By default, the installation path on Windows is:
    C:\Program Files (x86)\IntelSWTools\Advisor 2018\include
    On Linux, the default path will be:
    /opt/intel/advisor_2018/include

    Note: The "ittnotify.h"header file contains all the ITT APIs templates that you can use for instrumentation.

    Include the path to the header file above so that your compiler knows where to find the library. In Microsoft Visual Studio for example, navigate to Property Pages>C/C++>General>Additional Include Directories

    intel-advisor-visual-studio-itt-notify

  2. Finally, link to the ITT library (libittnotify.lib) and recompile your application.In Visual Studio, navigate to the Linker settings (Property Pages>Linker>Input>Additional Include Directories) and add the path to the library. By default, on Windows, the path will be:
    C:\Program Files (x86)\IntelSWTools\Advisor 2018\lib64\libittnotify.lib
    On Linux, the default path is: /opt/intel/advisor_2018/. Then, you would configure your build scripts to include the path to the library and link to the libittnotify.a library by passing -littnotify to your compiler.
  3. Next, you need to start Intel Advisor in Paused mode. Look for the Play with a Pause symbol icon like the one below:

    Intel-advisor-start-paused-button

    In Intel Advisor, the Survey Analysis and the Trip Counts and FLOP Analysis support the collection control APIs.

Example:

#include "ittnotify.h"

int main(int argc, char* argv[])
{
        // Do initialization work here
        __itt_resume(); //Intel Advisor starts recording performance data
        for (int i=0;i<size;i++)
        {
               do_some_math();
        }
        __itt_pause(); // Intel Advisor stops recording performance data

        for (int i=0;i<size;i++)
        {
               do_some_other_math();
        }

       return 0;
}

In the scenario above, Intel Advisor will give you performance data for the loop containing the do_some_math() method and not the one containing the do_some_other_math() method. If you draw the roofline model for that analysis, you would see one dot on the graph, as opposed to two, if you were to run Intel Advisor without the collection control APIs.


Troubleshooting Visual Studio Command Prompt Integration Issue

$
0
0

Issue Description

Nmake and ifort are not recognized from a command window, however using Intel Fortran under Visual Studio works perfectly.

Troubleshooting

Follow below checklist to troubleshooting Visual Studio command environmental issues:

1. Verify whether ifort and nmake are installed correctly:

    For Visual Studio 2017, nmake is installed at:

    C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Tools\MSVC\14.10.25017\bin\HostX64\x64\nmake.exe

    Find this by running below commands from a system ifort or nmake setup correctly:

> where nmake
  > where ifort

    Also check whether the location is included from PATH environment:

> echo %PATH%

2. If nmake can be found, verify if VS setup script runs properly.
    Start a cmd window, and run Visual Studio setup script manually:

> "C:\Program Files (x86)\Microsoft Visual Studio\2017 \Professional\VC\Auxiliary\Build\vcvars64.bat"

    An expected output is as below

    vscmd_setup.png

3.If nmake cannot be found. It’s your Visual Studio installation is incomplete. Please try re-install Visual Studio and find instructions from below articles:
 
4.Got an error in step 2?
> "C:\Program Files (x86)\Microsoft Visual   Studio\2017\Professional\VC\Auxiliary\Build\vcvars64.bat"
  \Common was unexpected at this time.
If yes, try debug the setup script set VSCMD_DEBUG environment variable:
> set VSCMD_DEBUG=3
Run the setup script again and redirect the output to log file:
> "C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Auxiliary\Build\vcvars64.bat"> setup.log 2>&1

5.If you got the same error as above, there are some references from Visual Studio community:

    The solution is to remove all quotes from PATH environment variable value.
 
6.If you got a different error, again get an expected output from any system that runs the script correctly and compare with your current one. This will help you locate which command in the setup script that triggers the error. 
    You may also consider to report such issue to Visual Studio community directly at

 

Accelerating Media, Video & Computer Vision Processing: Which Tool Do I Use?

$
0
0

Intel has a multitude of awesome software development tools, including ones for innovating and optimizing media applications, immersive video including 360° and virtual reality, graphics, integrating visual understanding, and more. But sometimes, it's hard to figure out just which development tool or tools are best for your particular needs and usages.

Below you'll find a few insights to help you get to the right Intel software tool faster for media and video solutions, so you can focus on the really fun stuff - like building new competitive products and solutions, improving media application performance or video streaming quality for devices from edge to cloud, or even transitioning to more efficient formats like HEVC. 


Intel® Media SDK

Developing for:

  • Intel® Core™ or Intel® Core™ M processors 
  • Select SKUs of Intel® Celeron™, Intel® Pentium® and Intel® Atom® processors with Intel® HD Graphics supporting Intel® Quick Sync Video
  • Client, mobile and embedded devices - desktop or mobile media applications
  • OS - Windows* and Embedded Linux*
  • An Open Source version is also available at Github under the MIT license 

Uses & Needs

  • Fast video playback, encode, processing, media formats conversion or video conferencing
  • Accelerated processing of RAW video or images
  • Screen capture
  • Audio decode & encode support
  • Used with smart cameras across drones, phones, editors/players, network video recorders, and connected cars
  • Supports HEVC, AVC, MPEG-2 and audio codecs

Free Download


intel media server studioIntel® Media Server Studio

Three editions are available:

  • FREE Community
  • Essentials
  • Professional

Developing for:

Format Support HEVC, AVC, MPEG-2 and MPEG-Audio

Uses & Needs

  • High-density and fast video decode, encode, transcode
  • Optimize performance of Media/GPU pipeline 
  • Enhanced graphics programmability or visual analytics (for use with OpenCL™ applications)
  • Low-level control over encode quality
  • Debug, analysis and performance/quality optimization tools
  • Speed ransition to real-time 4K HEVC
  • Need to measure visual quality (Video Quality Caliper)
  • Looking for an enterprise-grade telecine interlace reverser (Premium Telecine Interlace Reverser)
  • Audio codecs
  • Screen capture

 Free Download & Paid Edition Options 


Intel® Collaboration Suite for WebRTC

This Client SDK builds on top of the W3C standard WebRTC APIs to accelerate development of real-time communications (RTC), including broadcast, peer-to-peer, conference mode communications, and online gaming/VR streaming. 

Use with Andrioid*, web (JavaScript* built), iOS* and Windows* applications. 

Free Download

Intel® SDK for OpenCL™ Applications

Developing for:

General purpose GPU acceleration on select Intel® processors (see technical specifications). OpenCL primarily targets execution units. An increasing number of extensions are being added to Intel processors to make the benefits of Intel’s fixed function hardware blocks  accessible to OpenCL applications.

Free Download


Intel® Computer Vision SDK

Accelerate computer vision solutions:

  • Easily harness the performance of computer vision accelerators from Intel
  • Add your own custom kernels into your workload pipeline
  • Quickly deploy computer vision algorithms with deep-learning support using the included Deep Learning Deployment Toolkit Beta
  • Create OpenVX* workload graphs with the intuitive and easy-to-use Vision Algorithm Designer

Free Download


Altera® Software (now part of Intel) 
Video & Image Processing Suite MegaCore Functions (part of Intel® Quartus® Prime Software Suite IP Catalog)

Developing for:

  • All Altera FPGA families
  • Video and image processing applications, such as video surveillance, broadcast, video conferencing, medical and military imaging and automotive displays

Uses & Needs

  • For design, simulation, verification of hardware bit streams for FPGA devices
  • optimized building blocks for deinterlacing, color space conversion, alpha blending, scaling and more

 


 

 

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Intel® Media SDK & Intel® Media Server Studio Historical Release Notes

$
0
0

Release Notes of Intel® Media SDK include important information, such as system requirements, what's new, feature table and known issues since the previous release. Below is the list of release notes from previous releases of different products to track new features and supported system requirements.

Intel Media SDK

Version/What's NewRelease DateRelease NotesPlatform Support
2017 R1Jun. 9, 2017WindowsSupports select SKUs of 6th & 7th generation
Intel® Core™ processors (codename Skylake & KabyLake)

Intel® Media Server Studio

Version/What's NewRelease DateRelease NotesPlatform Support
2017 R3Aug. 1, 2017Windows*|Linux*Supports select SKUs of 5th & 6th generation
Intel® Xeon® & Core™ processors (codename Broadwell & Skylake)
2017 R2Jan. 4, 2017Windows*|Linux*Supports select SKUs of 5th & 6th generation
Intel® Xeon® & Core™ processors (codename Broadwell & Skylake)
2017 R1Sept. 1, 2016Windows*|Linux*Supports select SKUs of 5th & 6th generation
Intel® Xeon® & Core™ processors (codename Broadwell & Skylake)
Professional 2016Feb. 18, 2016Windows*|Linux*Supports select SKUs of 4th & 5th generation
Intel® Xeon® & Core™ processors (codename Haswell & Broadwell)
2016Feb. 18, 2016Windows*|Linux*Supports select SKUs of 4th & 5th generation
Intel® Xeon® & Core™ processors (codename Haswell & Broadwell)
2015 R6July. 2, 2015Windows*/Linux*Supports select SKUs of 4th & 5th generation
Intel® Xeon® & Core™ processors (codename Haswell & Broadwell)

Intel Media SDK for Embedded Linux*

Version/What's NewRelease DateRelease NotesPlatform Support
2017 R1Aug. 25, 2017LinuxSupports select SKUs Intel® Atom™ processors (codename ApolloLake)

 

 

 

 

For latest documents, getting started guide and release notes, check Intel Media SDK getting start webpage. If you have any issue, connect with us at Intel media forum.

Dynamically linked IMSL* Fortran Numerical Library can't work with Intel® Parallel Studio XE 2018 for Windows Update 2

$
0
0

Version : Intel® Parallel Studio XE 2018 for Windows Update 2, Intel® Math Kernel Library (Intel® MKL) 2018 Update 2

Operating System : Windows*

Architecture: Intel 64 Only


Problem Description :

An application built by Intel Parallel Studio XE for Windows 2018 Update 2 and dynamically linked with IMSL* Fortran Numerical Library will fail to start with an error message like:


"The procedure entry point mkl_lapack_ao_zgeqrf could not be located in the dynamic link library C:\Program Files(x86)\VNI\imsl\fnl701\Intel64\lib\imslmkl_dll.dll. "


The cause of the error is that symbols removal in Intel MKL 2018 Update 2 breaks its backward compatibility with binaries dynamically linked with an old version of Intel MKL such as IMSL* Fortran Numerical Library.

Resolution Status:

It will be fixed in a future product update. When the fix is available, this article will be updated with the information.

There are three workarounds available to resolve the error:

  1. Link IMSL Fortran Numerical Library statically
  2. Link IMSL Fortran Numerical Library without making use of Intel® MKL, which may have some performance impact
  3. Use an older version of Intel MKL DLL such as Intel MKL 2018 update 1 by putting them into PATH setting at runtime 

Intel® Math Kernel Library (Intel® MKL) 2019 System Requirements

$
0
0

Operating System Requirements

The Intel MKL 2019 release supports the IA-32 and Intel® 64 architectures. For a complete explanation of these architecture names please read the following article:

Intel Architecture Platform Terminology for Development Tools

The lists below pertain only to the system requirements necessary to support developing applications with Intel MKL. Please review your compiler (gcc*, Microsoft* Visual Studio* or Intel® Compiler Pro) hardware and software system requirements, in the documentation provided with that product, to determine the minimum development system requirements necessary to support your compiler product.

Supported operating systems: 

  • Windows 10 (IA-32 / Intel® 64)
  • Windows 8.1* (IA-32 / Intel® 64)
  • Windows 7* SP1 (IA-32 / Intel® 64)
  • Windows HPC Server 2016 (Intel® 64)
  • Windows HPC Server 2012 (Intel® 64)
  • Windows HPC Server 2008 R2 (Intel® 64) 
  • Red Hat* Enterprise Linux* 6 (IA-32 / Intel® 64)
  • Red Hat* Enterprise Linux* 7 (IA-32 / Intel® 64)
  • Red Hat* Enterprise Linux* 7.5 (IA-32 / Intel® 64)
  • Red Hat Fedora* core 28 (IA-32 / Intel® 64)
  • Red Hat Fedora* core 27 (IA-32 / Intel® 64)
  • SUSE Linux Enterprise Server* 11 
  • SUSE Linux Enterprise Server* 12
  • SUSE Linux Enterprise Server* 15  ????
  • openSUSE* 13.2
  • CentOS 7.1
  • CentOS 7.2
  • Debian* 8 (IA-32 / Intel® 64)
  • Debian* 9 (IA-32 / Intel® 64)
  • Ubuntu* 16.04 LTS (IA-32/Intel® 64)
  • Ubuntu* 17.10 LTS (IA-32/Intel® 64)
  • Ubuntu* 18.04 LTS (IA-32/Intel® 64)
  • WindRiver Linux 8
  • WindRiver Linux 9
  • WindRiver Linux 10
  • Yocto 2.3
  • Yocto 2.4
  • Yocto 2.5
  • Yocto 2.6
  • macOS* 10.13 (Xcode 6.x) and macOS* 10.14 (Xcode 6.x) (Intel® 64)

         Note: Intel® MKL is expected to work on many more Linux distributions as well. Let us know if you have trouble with the distribution you use.

Supported C/C++ and Fortran compilers for Windows*:

  • Intel® Fortran Composer XE 2019 for Windows* OS
  • Intel® Fortran Composer XE 2018 for Windows* OS
  • Intel® Fortran Composer XE 2017 for Windows* OS
  • Intel® Visual Fortran Compiler 19.0 for Windows* OS
  • Intel® Visual Fortran Compiler 18.0 for Windows* OS
  • Intel® Visual Fortran Compiler 17.0 for Windows* OS
  • Intel® C++ Composer XE 2019 for Windows* OS
  • Intel® C++ Composer XE 2018 for Windows* OS
  • Intel® C++ Composer XE 2017 for Windows* OS
  • Intel® C++ Compiler 19.0 for Windows* OS
  • Intel® C++ Compiler 18.0 for Windows* OS
  • Intel® C++ Compiler 17.0 for Windows* OS
  • Microsoft Visual Studio* 2017 - help file and environment integration
  • Microsoft Visual Studio* 2015 - help file and environment integration
  • Microsoft Visual Studio* 2013 - help file and environment integration

Supported C/C++ and Fortran compilers for Linux*:

  • Intel® Fortran Composer XE 2019 for Linux* OS
  • Intel® Fortran Composer XE 2018 for Linux* OS
  • Intel® Fortran Composer XE 2017 for Linux* OS
  • Intel® Fortran Compiler 19.0 for Linux* OS
  • Intel® Fortran Compiler 18.0 for Linux* OS
  • Intel® Fortran Compiler 17.0 for Linux* OS
  • Intel® C++ Composer XE 2019 for Linux* OS
  • Intel® C++ Composer XE 2018 for Linux* OS
  • Intel® C++ Composer XE 2017 for Linux* OS
  • Intel® C++ Compiler 19.0 for Linux* OS
  • Intel® C++ Compiler 18.0 for Linux* OS
  • Intel® C++ Compiler 17.0 for Linux* OS
  • GNU Compiler Collection 4.4 and later
  • PGI* Compiler version 2018
  • PGI* Compiler version 2017

Note: Using the latest version of Intel® Manycore Platform Software Stack (Intel® MPSS is recommended on Intel MIC Architecture. It is available from the Intel® Software Development Products Registration Center at http://registrationcenter.intel.com as part of your Intel® Parallel Studio XE for Linux* registration

Supported C/C++ and Fortran compilers for OS X*:

  • Intel® Fortran Compiler 19.0 for macOS *
  • Intel® Fortran Compiler 18.0 for macOS *
  • Intel® Fortran Compiler 17.0 for macOS *
  • Intel® C++ Compiler 19.0 for macOS *
  • Intel® C++ Compiler 18.0 for macOS *
  • Intel® C++ Compiler 17.0 for macOS *
  • CLANG/LLVM Compiler 9.0
  • CLANG/LLVM Compiler 10.0

MPI implementations that Intel® MKL for Windows* OS has been validated against:

  • Intel® MPI Library Version 2019 (Intel® 64) (http://www.intel.com/go/mpi)
  • Intel® MPI Library Version 2018 (Intel® 64) (http://www.intel.com/go/mpi)
  • Intel® MPI Library Version 2017 (Intel® 64) (http://www.intel.com/go/mpi)
  • MPICH version 3.3  (http://www-unix.mcs.anl.gov/mpi/mpich)
  • MPICH version 2.14  (http://www-unix.mcs.anl.gov/mpi/mpich)
  • MS MPI, CCE or HPC 2012 on Intel® 64 (http://www.microsoft.com/downloads)

MPI implementations that Intel® MKL for Linux* OS has been validated against:

  • Intel® MPI Library Version 2019 (Intel® 64) (http://www.intel.com/go/mpi)
  • Intel® MPI Library Version 2018 (Intel® 64) (http://www.intel.com/go/mpi)
  • Intel® MPI Library Version 2017 (Intel® 64) (http://www.intel.com/go/mpi)
  • MPICH version 3.3  (http://www-unix.mcs.anl.gov/mpi/mpich)
  • MPICH version 3.1  (http://www-unix.mcs.anl.gov/mpi/mpich)
  • MPICH version 2.14  (http://www-unix.mcs.anl.gov/mpi/mpich)
  • Open MPI 1.8.x (Intel® 64) (http://www.open-mpi.org)

Note: Usage of MPI and linking instructions can be found in the Intel Math Kernel Library Developer Guide

Other tools supported for use with example source code:

  • uBLAS examples: Boost C++ library, version 1.x.x
  • JAVA examples: J2SE* SDK 1.4.2, JDK 5.0 and 6.0 from Sun Microsystems, Inc.

Note: Parts of Intel® MKL have FORTRAN interfaces and data structures, while other parts have C interfaces and C data structures. The Intel Math Kernel Library Developer Guide  contains advice on how to link to Intel® MKL with different compilers and from different programming languages.

Deprecation Notices :

  • Dropped support for all MPI IA-32 implementations
  • Red Hat Enterprise Linux* 5.0 support is dropped
  • Windows XP* is not supported Support for Windows XP has been removed
  • Windows Server 2003* and Windows Vista* not supported

 

Viewing all 461 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>