TensorFlow_Best_Practices_Intel_Xeon_AI-HPC_v1.0_Q3_2018
如果无法正常显示,请先停止浏览器的去广告插件。
1. Best Practices for Scaling Deep Learning Training and
Inference with TensorFlow* On Intel® Xeon® Processor
Based HPC Infrastructures
_____________________________________
Version: 1.0
Date of Issue: August 2018
Prepared By: Aishwarya Bhandare¶, Deepthi Karkada¶, Kushal Datta¶, Anupama Kurpad§,
Vamsi Sripathi¶, Sun Choi¶, Vikram Saletore¶,
§
Connectivity Group & ¶AI Products Group
Data Center Group, Intel Corp
Customer Solutions Technical Enabling/AIPG
2. Notices and Disclaimers:
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the
referenced web site and confirm whether referenced data are accurate.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should
consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the
performance of that product when combined with other products. For more complete information visit:
http://www.intel.com/performance.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations
that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not
manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.
Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable
product User and Reference Guides for more information regarding the specific instruction sets covered by this notice
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or
configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your
purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service
activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your
system manufacturer or retailer or learn more at intel.com.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published
specifications. Current characterized errata are available on request.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness
for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or
usage in trade.
Intel, the Intel logo, Xeon, Xeon Phi and Nervana are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others
© 2018 Intel Corporation. All rights reserved.
3. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
Table of Contents
1.Best Practices for TensorFlow Over Intel® Xeon® ...............................................4
1.1
1.2
1.3
1.4
1.5
TensorFlow Setup and Installation .................................................................... 4
Install MPI, if not already installed ................................................................... 4
Install Uber’s Horovod Library ........................................................................ 5
Installing tf_cnn_benchmarks ........................................................................... 5
Preparing the ImageNet2012-1K Dataset........................................................... 5
Steps to download and prepare Dataset ....................................................................... 6
Already have the ImageNet-1K Dataset ....................................................................... 6
Dataset Striped on Lustre ............................................................................................ 6
1.5.1
1.5.2
1.5.3
1.6
Example: Running ResNet-50 with tf_cnn_benchmarks ..................................... 7
tf_cnn_benchmarks: ResNet-50 ................................................................................... 7
Training on Single-Node with Multiple Workers ......................................................... 8
Using OpenMPI........................................................................................................... 9
Using Intel® MPI ...................................................................................................... 10
Using MVAPICH2 .................................................................................................... 10
Training on Multiple Nodes with Multiple Workers .................................................. 11
Evaluating the Accuracy of the Trained Model .......................................................... 11
Multi-Stream Inference on the Trained Model .......................................................... 12
Running Inference on the Trained Model .................................................................. 13
1.6.1
1.6.2
1.6.3
1.6.4
1.6.5
1.6.6
1.6.7
1.6.8
1.6.9
2.Using Singularity ...............................................................................................15
2.1
2.2
2.3
Installing Singularity ...................................................................................... 15
Building Singularity Image ............................................................................. 15
Running TensorFlow With Singularity ............................................................ 15
3.Using NFS and SLURM .....................................................................................17
3.1
3.2
Using NFS Mounted File System ..................................................................... 17
Using SLURM Scheduler ................................................................................ 17
4.TensorFlow Build Instructions...........................................................................18
4.1
4.2
Building TensorFlow ...................................................................................... 18
Install TensorFlow using script ....................................................................... 19
5.Sample scripts ...................................................................................................20
5.1
5.2
5.3
TensorFlow build script .................................................................................. 20
SLURM scripts .............................................................................................. 22
Singularity scripts .......................................................................................... 24
Install script .............................................................................................................. 24
Recipe file for tensorflow wheel downloaded from a URL .......................................... 24
Recipe file for TensorFlow wheel on local file system ................................................. 25
Singularity run-script ................................................................................................ 27
5.3.1
5.3.2
5.3.3
5.3.4
6.Troubleshooting ................................................................................................28
6.1
TensorFlow Import Issues .............................................................................. 28
Importing TensorFlow .............................................................................................. 28
Run ldd to find the dynamically linked libraries ........................................................ 28
Check by running: ..................................................................................................... 29
Another Common Error when Importing TensorFlow .............................................. 30
Verify that TensorFlow is Using right the version of gcc ............................................ 30
Run ldd again after adding the correct version of gcc ................................................ 31
6.1.1
6.1.2
6.1.3
6.1.4
6.1.5
6.1.6
6.2
6.3
6.4
TensorFlow Build Issues ................................................................................. 32
Horovod Install Issues .................................................................................... 33
Verify Intel® Omni-Path Architecture (OPA) ................................................. 33
6.4.1
Version 0.94
Verify that OPA is Up and Running .......................................................................... 33
Page 2
9/15/2018
4. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
6.4.2
6.4.3
6.4.4
Version 0.94
Verify Install (Example of a good install) ................................................................... 33
Verify OPA Fabric Performance ............................................................................... 34
OPA Install Issues ..................................................................................................... 34
Page 3
9/15/2018
5. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
1. BEST PRACTICES FOR TENSORFLOW OVER INTEL® XEON®
This document describes the setup, installation and procedure to run distributed Deep Learning training and inference
using TensorFlow with Uber Horovod library on Intel® Xeon® based infrastructure. The steps required to run the
benchmark can vary depending on the user’s environment. In case of a large cluster with the order of hundreds or
thousands of nodes, we provide sample scripts that use the SLURM scheduler. Alternatively, we also list out steps
for smaller systems that may not have such a scheduler configured. Furthermore, we also provide scripts to build a
singularity image for ease of deployment.
This document has been tested with TensorFlow 1.9.0, Horovod 0.13.11, OpenMPI version 1.10.7, Python 2.7 on
CentOS 7 and 7.5.
1.1 TensorFlow Setup and Installation
TensorFlow can either be built from source or installed from a pre-built wheel file. The steps to build TensorFlow
from source are clearly listed on the official TensorFlow documentation here:
https://www.tensorflow.org/install/install_sources .
Also refer to the Intel® Best Practices documentation at the end of this document to use the recommended build
flags to get the most optimized CPU build. We have consolidated these steps into a simple script
build_tensorflow_script, which can be run as follows:
./build_tf.sh <path/to/install/directory>
We recommend that the install directory be created on a shared filesystem so that the TensorFlow build is accessible
by all nodes in a multi-node training environment.
At the prompt, choose “y” to build from source. The script creates a virtual environment and installs TensorFlow in
this environment, as opposed to doing a system-wide install. The virtual environment is created in
<path/to/install/directory>/virtenv/ , and can be activated as follows:
source <path/to/install/directory>/virtenv/bin/activate
To deactivate the virtual environment, run:
deactivate
Alternatively, TensorFlow can be installed from a pre-built wheel available here:
https://software.intel.com/en-us/articles/intel-optimized-tensorflow-installation-guide
The script build_tf.sh can be used to do this by selecting ‘No’ when prompted about building TensorFlow from source.
Note: See section on Troubleshooting for some common problems encountered during the build process.
1.2 Install MPI, if not already installed
For OpenMPI: Below are steps to install OpenMPI. If sudo/root access is available:
sudo yum -y install openmpi openmpi-devel
module avail && module add mpi/openmpi-x86_64
Note: If you get an error ‘module:command not found’ , use the following command:
source /etc/profile.d/modules.sh
This will install to standard paths, so there is no need to change environment variables.
This command installs OpenMPI 1.10.7 as tested on Centos 7.5. If neither sudo/root access is available, build
OpenMPI from source. Follow the steps on the OpenMPI page to install and configure OpenMPI. After installation,
update PATH and LD_LIBRARY_PATH accordingly to point to this installation.
For MVAPICH2: refer to the user guide in the official documentation here: http://mvapich.cse.ohio-
state.edu/userguide/
For Intel-MPI: refer to https://software.intel.com/en-us/intel-mpi-library. Add Intel-MPI related variables to PATH
by running:
source <intel_mpi_install_dir>/bin/mpivars.sh intel64
Version 0.94
Page 4
9/15/2018
6. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
1.3 Install Uber’s Horovod Library
Horovod is a distributed training framework for TensorFlow, Keras, and PyTorch, which makes distributed Deep
Learning fast and easy to use.
Horovod is available as a standalone python package. The installer looks for an existing installation of MPI. Note: In
an environment where there are potentially multiple MPI libraries, is important to ensure that only the required MPI
libraries are enabled in Horovod environment. This can be done either with the mpi-selector utility, or by manually
updating the PATH and LD_LIBRARY_PATH variables to point to the required library.
Horovod also requires an existing installation of TensorFlow. In step 1, we show how to install TensorFlow in a
Python virtual environment. Before installing Horovod, ensure that the virtual environment is activated. Then install
Horovod using the pip installer:
pip install --no-cache-dir horovod
Verify the installation. If the following steps work without errors, we are ready to run the benchmark.
python
Python 2.7.13 (default, Aug 1 2017, 16:43:54)
[GCC Intel C++ gcc 5.4 mode] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> import horovod.tensorflow as hvd
1.4 Installing tf_cnn_benchmarks
TensorFlow maintains a repository for performance benchmarking of few standard Convolutional Neural Network
topologies at https://github.com/tensorflow/benchmarks.git .These benchmarks can be obtained by running:
git clone https://github.com/tensorflow/benchmarks.git
1.5 Preparing the ImageNet2012-1K Dataset
The dataset needed for performance evaluation is the ImageNet2012-1K dataset available here http:/www.image-
net.org/challenges/LSVRC/2012/. The dataset can be obtained by first registering at the ImageNet website
(http://image-net.org/download-images) and obtaining the username and access key. The raw images must be
converted to the TensorFlow’s TFRecord format which stores the raw image data along with metadata like class info,
height, width, and so on.
SURFSara, one of our collaborators, identified a key requirement when preprocessing the dataset. In order to
achieve State-Of-The-Art (SOTA) convergence after training, shuffling the input classes is very important when
creating TFRecords from raw images. There is an existing script located in the tensorflow/tpu repository that can be
used to do this: https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py
The script contains several TPU references that must be removed when running on Intel® Xeon®. A version of the
original script with the required modifications for CPU and horovod and an accompanying Best Practices document
is
provided
by
SURFSara,
which
can
be
downloaded
from:
https://surfdrive.surf.nl/files/index.php/s/xrEFLPvo7IDRARs
This script can be used to download the dataset as well as create the TFRecords. Please note that there should be
~400GB disc space available before starting the process, and it might take several hours to complete depending on
your network connectivity.
Version 0.94
Page 5
9/15/2018
7. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
1.5.1 Steps to download and prepare Dataset
Follow the steps below for preparing the ImageNet-1K Dataset using scripts from SURFsara.
1. Untar the <tar_file>.tar.gz downloaded from SURFsara drive above with the command
tar –xzvf <tar_file>.tar.gz
cd tpu/tools/datasets
2. Download the dataset along with converting it to tf_records format
python imagenet_to_gcs.py --local_scratch_dir <path-to-resulting tfrecords> \
--project="TEST_PROJECT" \
--imagenet_username=<username> \
--imagenet_access_key=<access_key>
Where “imagenet_username” and “imagenet_access_key” are to be obtained by registering on the Imagenet
website as mentioned above.
The script does the following to download the data:
Training images ILSVRC2012_img_train.tar are downloaded from
http://www.image-net.org/challenges/LSVRC/2012/
This tarball contains multiple tarballs in it. The script then un-tars each one of them and saves it in the
form: Training images: train/n03062245/n03062245_4620.JPEG
NOTE: in the training dataset, the images are grouped by the labels, meaning that all images in the
folder train/n03062245 have the label n03062245.
The synset_labels (validation labels) are downloaded from:
https://raw.githubusercontent.com/tensorflow/models/master/research/inception/inception/data/imagen
et_2012_validation_synset_labels.txt
Then, the validation images ILSVRC2012_img_val.tar are downloaded from
http://www.image-net.org/challenges/LSVRC/2012/
1.5.2 Already have the ImageNet-1K Dataset
If you have already downloaded the dataset, you can pass the path to the dataset in –raw_data_dir:
python imagenet_to_gcs.py --raw_data_dir <path-to-rawimage-dir> --
local_scratch_dir <path-to-resulting-tfrecords>
If raw data directory is provided, it should be in the format:
o Training images: train/n03062245/n03062245_4620.JPEG
o Validation Images: validation/ILSVRC2012_val_00000001.JPEG
o Validation Labels: synset_labels.txt
Note: If you have a pre-downloaded dataset, please make sure that the folder <path-to-rawimage-dir> has the directory
structure specified above. Please make sure the validation labels and validation images are from the same sources as
indicated earlier. A mismatch can give lower validation accuracies.
The TF_Records can either be then copied on to the local drives of all the nodes, or a high-speed shared file system
like Lustre can be used.
1.5.3 Dataset Striped on Lustre
While running multi-node experiments on large number of nodes, striping the dataset can give significance
performance gains in terms of scaling efficiency. We have found that it can improve scaling efficiency for a 32 node
run from ~40% to ~90%, for a stripe count of 64 and stripe size of 32M for ImageNet data. So, conclusion is that
Lustre striping makes a big difference, allowing us to have good scaling efficiency at 64-128 nodes, and possibly
beyond.
Version 0.94
Page 6
9/15/2018
8. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
1.6 Example: Running ResNet-50 with tf_cnn_benchmarks
In this section, we describe the command line parameters used for both single and multi-node Training and Inference
with ResNet-50 using tf_cnn_benchmarks scripts.
1.6.1 tf_cnn_benchmarks: ResNet-50
Let us first look at the common arguments to tf_cnn_benchmarks for ResNet-50 model:
COMMON_ARGS="\
--batch_size=64 \
--model=resnet50 \
--num_inter_threads=2 \
--display_every=5 \
--data_format=NCHW \
--optimizer=momentum \
--device=cpu \
--mkl=TRUE \
--variable_update=horovod \
--horovod_device=cpu \
--local_parameter_device=cpu \
--kmp_blocktime=1"
The explanation of the values used are:
--batch_size=64; Number of images/worker to train in parallel
--model=resnet50. This is an ad-hoc pre-determined tag used by tf_cnn_benchmarks to define/use the ResNet-50
v1 model. The other preset model names are available in tf_cnn_benchmarks/model/model_config.py script.
--num_inter_threads=2 or 3; this sets the number of ops that can be executed in parallel.
--display_every=5 means after how many batches will the training loss will be printed in stdout. We have used 5,
but any value between 5 & 20 gives useful information on the progress of the training job.
--data_format=NCHW. NCHW format means the dimensions of the tensor containing the input images are in the
order of BATCHES, CHANNELS, HEIGHT and WIDTH. In other words, “channels first” as it appears before
height and width. Other possible value is NHWC where channels appear last.
--optimizer=momentum is the gradient descent optimizer used in training. In place of Momentum optimizer, you
can use adagrad, sgd (stochastic gradient descent), rmsprop, and others.
--device=cpu signifies CPU device to execute the operations
--mkl=True enables the use of Intel® Math Kernel Library® for optimized kernels for blocked matrix
multiplication, vector-matric multiplication and other operations
--variable_update=horovod; Use Horovod for gradient synchronization between the training workers.
--horovod_device=cpu; Use CPU to execute the gradient synchronization
--kmp_blocktime=0 or 1; optimal value may vary with topology. This is the time in milliseconds that a thread
waits before sleeping after completion of execution.
Version 0.94
Page 7
9/15/2018
9. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
For training the model to convergence, following additional arguments will need to be included in the COMMON_ARGS
for saving check-point and summaries in the appropriate path to directory.
--train_dir: directory where model checkpoints and graph are saved
--save_model_secs: time in seconds, recurrently save model after every given duration.
--init_learning_rate: initial learning rate
--num_epochs_per_decay: learning rate is decayed after these many epochs
--num_learning_rate_warmup_epochs: Run warmup for these many epochs
--learning_rate_decay_factor: learning rate decayed by this factor
--summary_verbosity=1: verbosity of summary statistics
--save_summaries_steps: save summaries after every given steps
1.6.2 Training on Single-Node with Multiple Workers
The key idea here is to use multiple training instances or workers even on a single CPU and divide the cores uniformly
across them. In addition to core-pin the workers also use NUMA-aware core affinity and data placement to exploit
local memory channels of the sockets. The detailed methodology and analysis are presented at
https://software.intel.com/en-us/articles/boosting-deep-learning-training-inference-performance-on-xeon-and-xeon-
phi
One example is showcased below. In this scenario, we are running training ResNet-50 on 2S Intel® Xeon® CPU.
Platform Configuration: 2S Intel® Xeon® Gold 6148 CPU @ 2.40GHz, 20 cores/CPU, HT enabled, turbo disabled,
scaling governor set to “performance” via intel_pstate driver, 192GB DDR4-2666 ECC RAM. CentOS Linux release
7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series.
For a single-node using Horovod, a ResNet-50 training job with 4 TensorFlow worker instances per node can be
launched as described below. Note: Several parameters including OMP_NUM_THREADS, affinity domains, etc.,
need to be set according to the number of physical cores available. To know which CPUs to bind the processes to, run
“numactl –H” and observe how the CPUs are mapped to sockets and nodes.
The MPI tasks partitioning on the sockets of the node in this case is shown in figure 1.
½-Skt-0 ½-Skt-1
½-Skt-0 ½-Skt-1
Figure 1. MPI tasks Partitioning across Dual-Socket Intel® Xeon® Platform
Version 0.94
Page 8
9/15/2018
10. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
The MPI task Layout is shown in figure2 below:
Compute Node 0
Socket 0
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 core 10 Core 11 Core 12 Core 13 Core 14 Core 15 Core 16 Core 17 Core 18 Core 19
Task 0 Task 0 Task 0 Task 0 Task 0 Task 0 Task 0 Task 0 Task 0 Task 0 Task 1 Task 1 Task 1 Task 1 Task 1 Task 1 Task 1 Task 1 Task 1 Task 1
Socket 1
Core 20 Core 21 Core 22 Core 23 Core 24 Core 25 Core 26 Core 27 Core 28 Core 29 Core 30 Core 31 Core 32 Core 33 Core 34 Core 35 Core 36 Core 37 Core 38 Core 39
Task 2 Task 2 Task 2 Task 2 Task 2 Task 2 Task 2 Task 2 Task 2 Task 2 Task 3 Task 3 Task 3 Task 3 Task 3 Task 3 Task 3 Task 3 Task 3 Task 3
Figure 2. MPI task layout on host0
1.6.3 Using OpenMPI
HOROVOD_FUSION_THRESHOLD=134217728 \
mpirun -np 4 \
--map-by ppr:2:socket:pe=10 \
-H localhost \
--report-bindings \
--oversubscribe \
-x HOROVOD_FUSION_THRESHOLD \
-x OMP_NUM_THREADS=10 \
python /path/to/tf_cnn_benchmarks.py
$COMMON_ARGS \
--num_intra_threads=$OMP_NUM_THREADS \
--data_dir=/path/to/imagenet-1k-TFRecords \
--data_name=<dataset_name>
In the above command line, use localhost to run the MPI job locally. Otherwise, use the target hostname on which
you wish to run the workload. $COMMON_ARGS contains the arguments to the benchmark script. Rest of the
arguments and their recommended values are:
HOROVOD_FUSION_THRESHOLD=user defined size in bytes, default is 67108864(64Mbytes). This sets the
size of the fusion buffer for Horovod’s Tensor Fusion, which controls how tensors are batched to do allreduce.
For more information please see: https://github.com/uber/horovod/blob/master/docs/tensor-fusion.md
--np 4: runs 4 copies of the program on the given node(s).
--map-by ppr:2:socket:pe=10: starts two processes per socket and binds each of them to 10 processing elements
(pe), where “ppr” here stands for processes per resource, and “pe” should correspond to the number of physical
cores per socket. The number for “pe” should be the same as num_intra_threads as described later.
-H localhost: run the copies of the program on localhost
--report-bindings: flags provides a verbose output of the mapping of cores to processes.
--oversubscribe: is necessary to run multiple processes on a node.
OMP_NUM_THREADS specifies the number of threads to be used by the OpenMP (shared memory) library
used by Eigen and Intel® MKL. This should be set to number of physical cores (same as num_intra_threads).
python /path/to/tf_cnn_benchmarks.py $COMMON_ARGS: python program to run with CLI arguments
--num_intra_threads=set to (number of physical cores per node)/(number of workers per node). The idea is to
equally divide cores among the workers. This sets the number of threads an op can use to parallelize execution,
used by Eigen and Intel® Math Kernel Library®.
--data_dir: path to dataset TFRecords
--data_name: dataset name
Version 0.94
Page 9
9/15/2018
11. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
1.6.4 Using Intel® MPI
mpiexec \
--machinefile <hostfile> \
-genv I_MPI_DEBUG 4 \
-np 4 \
--ppn 4 \
--genv OMP_NUM_THREADS $OMP_NUM_THREADS \
--genv I_MPI_PIN_DOMAIN 20:compact \
--genv HOROVOD_FUSION_THRESHOLD 134217728 \
python /path/to/tf_cnn_benchmarks.py
$COMMON_ARGS \
--num_intra_threads=$OMP_NUM_THREADS \
--data_dir=/path/to/imagenet-1k-TFRecords \
--data_name=<dataset_name>
Where hostfile is the file containing the hostname.
I_MPI_DEBUG=4 prints out the debug information about process pinning. Please refer
https://software.intel.com/en-us/mpi-developer-reference-linux-other-environment-variables for more details.
I_MPI_PIN_DOMAIN “20:compact” maps each process to 20 logical processors located as close to each other
in terms of common resources. This should be set according to the number of available physical cores on the
system. Please refer https://software.intel.com/en-us/mpi-developer-reference-linux-interoperability-with-
openmp-api for more details.
1.6.5 Using MVAPICH2
/usr/mpi/gcc/mvapich2-2.2-hfi/bin/mpirun_rsh \
--export -np 4 \
--hostfile <hostfile> \
MV2_SHOW_CPU_BINDING=1 \
OMP_NUM_THREADS=10 \
MV2_ENABLE_AFFINITY=1 \
MV2_CPU_MAPPING=0-9,40-49:10-19,50-59:20-29,60-69:30-39,70-79 \
HOROVOD_FUSION_THRESHOLD 134217728 \
python /path/to/tf_cnn_benchmarks.py
$COMMON_ARGS \
--num_intra_threads=$OMP_NUM_THREADS \
--data_dir=/path/to/imagenet-1k-TFRecords \
--data_name=<dataset_name>
Where hostfile is the file containing the hostname.
MV2_SHOW_CPU_BINDING=1 displays the CPU bindings
MV2_ENABLE_AFFINITY=1 and MV2_CPU_MAPPING are used to map each process to specific cores. This
ensures that there is no thread migration to different sockets. To know which CPUs to bind the processes to, run
“numactl –H” and observe how the CPUs are mapped to sockets and nodes.
Version 0.94
Page 10
9/15/2018
12. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
1.6.6 Training on Multiple Nodes with Multiple Workers
To scale to multiple nodes, we spawn multiple workers per node and utilize Horovod over MPI to synchronize
gradients (as shown in Section 3.2). The MPI command looks different now as it must include all the host names.
Example command line parameters are shown to train ResNet-50 model on multiple 2S Intel® Xeon® processors
saving model periodically in a model check-point directory.
Cluster Configuration: The 2x Intel® 2S Xeon® Gold 6148 CPU platforms are connected with Intel® Omni-Path
Architecture Host Fabric, Intel® OPA Interface Driver version 10.4.2.0.7. SSD: Intel® SSD DC S3700 Series.
For a training job on 2 nodes with 4 TensorFlow workers per node using OpenMPI is launched as follows:
HOROVOD_FUSION_THRESHOLD=134217728 \
mpirun --np 8 \
--hostfile <hostfile> \
--map-by ppr:2:socket:pe=10 \
--report-bindings \
--oversubscribe \
-x LD_LIBRARY_PATH \
-x HOROVOD_FUSION_THRESHOLD
-x OMP_NUM_THREADS=10 \
python /path/to/tf_cnn_benchmarks.py
$COMMON_ARGS \
--num_intra_threads=$OMP_NUM_THREADS \
--data_dir=/path/to/imagenet-1k-TFRecords \
--data_name=<dataset_name> \
--train_dir=/path/to/ model-checkpoint-dir \
--save_model_secs=<time-in-seconds-to-save-models-periodically>
It is important to ensure that mpirun binary, path to the training dataset and path to the tf_cnn_benchmarks are
consistent in all the nodes and is in the PATH environment variables.
1.6.7 Evaluating the Accuracy of the Trained Model
Inference is the process to evaluate the classification or (prediction) score of the trained model on a validation dataset
(images in case of ResNet-50). Due to artifacts of the model restore method in TensorFlow, the best practice is to
repeat all the command line parameters used during training for inference with the evaluation flag set to True. Hence,
we are going to use the horovod variable update command line parameters similar to training although they are not
relevant for inference. In our example for ResNet-50 model, evaluation is run with and rest of the training parameters
as shown below.
Version 0.94
Page 11
9/15/2018
13. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
Platform Configuration: 2S Intel® Xeon® Gold 6148 CPU @ 2.40GHz, 20 cores/CPU, HT enabled, turbo disabled,
scaling governor set to “performance” via intel_pstate driver, 192GB DDR4-2666 ECC RAM. CentOS Linux release
7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series.
mpirun --np 8 \
--hostfile <hostfile>
\
--map-by ppr:2:socket:pe=10 \
--report-bindings \
--oversubscribe \
-x LD_LIBRARY_PATH \
-x HOROVOD_FUSION_THRESHOLD
-x OMP_NUM_THREADS=10 \
python /path/to/tf_cnn_benchmarks.py
$COMMON_ARGS \
--num_intra_threads=$OMP_NUM_THREADS \
--data_dir=/path/to/imagenet-1k-validation-TFRecords \
--train_dir=/path/to/model-checkpoint-dir \
--num_epochs=1 \
--data_name=<dataset_name> \
--eval=True
Ensure to use appropriate executable according to the flavor of MPI you are using. Note the path to the data directory
must point to validation dataset files, not training files. Also, the path to the checkpoint directory is passed as the
train_dir. The required files in the checkpoint directory are:
o checkpoint – contains the latest checkpoint number
o graph.pbtxt – the text version of the protocol buffer graph definition. This file contains the operation
names and its inputs and outputs
o model.ckpt – Model checkpoint
o model.ckpt.index – Index of the weights and tensors
o model.ckpt.data – Weights of the model checkpoint
The output of running InageNet2012-1K validation dataset (50K Images) for 1 epoch i.e. ONCE over the entire
validation records for ResNet-50 trained model is shown below:
Accuracy @ 1 = 0.7513 Accuracy @ 5 = 0.9234
1.6.8 Multi-Stream Inference on the Trained Model
We create multiple independent deep learning inference framework instances, and set affinity for each instance to a
partitioned set of cores and memory locality on single or multiple socket systems. Figure 2 shows an example of 8
framework instances, each concurrently processing a separate stream of input data on affinitized threads and memory
locality. Depending on the inference batch size and system memory capacity, one could have even larger number of
frameworks and streams, each mapped to different cores.
In this scenario, we have K workers per node. The performance is measured by the total throughput in images/sec per
node with K streams of input each at a given batch size BSize and processed by the K workers. The total number of
batches per node on K workers for inference would then be equal to K*BSize.
Version 0.94
Page 12
9/15/2018
14. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
Inference Stream-0 BS=X 1/4 Skt-0 1/4 Skt-1 Inference Stream-4 BS=X
Inference Stream-1 BS=X 1/4 Skt-0 1/4 Skt-1 Inference Stream-5 BS=X
1/4 Skt-0 1/4 Skt-1 1/4 Skt-0 1/4 Skt-1
Inference Stream-2 BS=X
Inference Stream-3 BS=X
Inference Stream-6 BS=X
Inference Stream-7 BS=X
Figure 2. Sub-socket Partitioning across Dual-Socket Intel® Xeon® Platforms for Multiple Inference Streams
1.6.9 Running Inference on the Trained Model
To run Inference on the trained model, we use the same command that was used for evaluating the accuracy with
following modifications. data_dir would need to be pointed to the inference dataset and train_dir needs to point to
the directory containing the checkpoints for inference model.
The commands below shows how one can run 8 Inference streams on 2S Intel® Xeon® Gold 6148 platform with 20
cores/CPU, each stream affinitized to 5 cores. Note the use of numactl and KMP_AFFINITY for improved
performance.
TF_CNN_BMARKS=/path/to/tf_cnn_benchmarks/
DATA_DIR=/path/to/imagenet-1k-validation-TFRecords/
TRAIN_DIR=/path/to/model-checkpoint-dir/
export OMP_NUM_THREADS=5 # 20C/skt * 2 Skt / Num_Streams
export KMP_BLOCKTIME=0
NUMA_DOMAIN0=0
NUMA_DOMAIN1=1
KMP_AFFINITY='granularity=thread,proclist=[0-4,40-44],explicit,verbose' \
taskset -c 0-4,40-44 numactl -m $NUMA_DOMAIN0 \
python $TF_CNN_BMARKS/tf_cnn_benchmarks.py $COMMON_ARGS \
--num_intra_threads=$OMP_NUM_THREADS --eval=True \
--train_dir=$TRAIN_DIR --data_dir=$DATA_DIR &
KMP_AFFINITY='granularity=thread,proclist=[5-9,45-49],explicit,verbose' \
taskset -c 5-9,45-49 numactl -m $NUMA_DOMAIN0 \
python $TF_CNN_BMARKS/tf_cnn_benchmarks.py $COMMON_ARGS \
--num_intra_threads=$OMP_NUM_THREADS --eval=True \
--train_dir=$TRAIN_DIR --data_dir=$DATA_DIR &
Version 0.94
Page 13
9/15/2018
15. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
KMP_AFFINITY='granularity=thread,proclist=[10-14,50-54],explicit,verbose' \
taskset -c 10-14,50-54 numactl -m $NUMA_DOMAIN0 \
python $TF_CNN_BMARKS/tf_cnn_benchmarks.py $COMMON_ARGS \
--num_intra_threads=$OMP_NUM_THREADS --eval=True \
--train_dir=$TRAIN_DIR --data_dir=$DATA_DIR &
KMP_AFFINITY='granularity=thread,proclist=[15-19,55-59],explicit,verbose' \
taskset -c 15-19,55-59 numactl -m $NUMA_DOMAIN0 \
python $TF_CNN_BMARKS/tf_cnn_benchmarks.py $COMMON_ARGS \
--num_intra_threads=$OMP_NUM_THREADS --eval=True \
--train_dir=$TRAIN_DIR --data_dir=$DATA_DIR &
KMP_AFFINITY='granularity=thread,proclist=[20-24,60-64],explicit,verbose' \
taskset -c 20-24,60-64 numactl -m $NUMA_DOMAIN1 \
python $TF_CNN_BMARKS/tf_cnn_benchmarks.py $COMMON_ARGS \
--num_intra_threads=$OMP_NUM_THREADS --eval=True \
--train_dir=$TRAIN_DIR --data_dir=$DATA_DIR &
KMP_AFFINITY='granularity=thread,proclist=[25-29,65-69],explicit,verbose' \
taskset -c 25-29,65-69 numactl -m $NUMA_DOMAIN1 \
python $TF_CNN_BMARKS/tf_cnn_benchmarks.py $COMMON_ARGS \
--num_intra_threads=$OMP_NUM_THREADS --eval=True \
--train_dir=$TRAIN_DIR --data_dir=$DATA_DIR &
KMP_AFFINITY='granularity=thread,proclist=[30-34,70-74],explicit,verbose' \
taskset -c 30-34,70-74 numactl -m $NUMA_DOMAIN1 \
python $TF_CNN_BMARKS/tf_cnn_benchmarks.py $COMMON_ARGS \
--num_intra_threads=$OMP_NUM_THREADS --eval=True \
--train_dir=$TRAIN_DIR --data_dir=$DATA_DIR &
KMP_AFFINITY='granularity=thread,proclist=[35-39,75-79],explicit,verbose' \
taskset -c 35-39,75-79 numactl -m $NUMA_DOMAIN1 \
python $TF_CNN_BMARKS/tf_cnn_benchmarks.py $COMMON_ARGS \
--num_intra_threads=$OMP_NUM_THREADS --eval=True \
--train_dir=$TRAIN_DIR --data_dir=$DATA_DIR &
Version 0.94
Page 14
9/15/2018
16. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
2. USING SINGULARITY
2.1 Installing Singularity
If Singularity is already installed on your cluster/system, then you can skip this step. If not, install Singularity as
root/sudo user using the Script_4 below by replacing the --prefix path accordingly. Latest installation instructions are
also available at https://singularity.lbl.gov/install-linux
Also refer to instructions at http://opensciencegrid.org/docs/worker-node/install-singularity/#enabling-unprivileged-
mode-for-singularity for more instructions and options for installing singularity on your cluster.
2.2 Building Singularity Image
If you already have a Singularity image/container, then you skip this step. If not, build a Singularity image
comprising of CentOS, Intel optimized TensorFlow, Horovod, OpenMPI and TensorFlow benchmarks as root/sudo
user
sudo /path/to/singularity/install/bin/singularity build tf-horovod.simg
tf-horovod.singularity
where tf-horovod.singularity is the Singularity recipe file that defines the configuration of the image. If you want the
image to be writable, you can pass the –writable flag while building and invoking the container. For more
details, refer to the documentation at https://singularity.lbl.gov/docs-build-container#--writable .
Below are two recipe files, either of which can be used based on your use case.
The singularity image can be built with the pre-built optimized tensorflow wheel pulled from a URL or from conda.
A sample recipe file for this is provided in Script_5.
Alternatively, one can also build a singularity image using a tensorflow wheel that was built from source on the
local file system. The sample recipe file for this is Script_6 .
2.3 Running TensorFlow With Singularity
Check the environment of the Singularity Image with singularity run
[user@linux ~]$ singularity run tf-horovod.simg
This is a Singularity image containing Intel optimized TensorFlow
installation with Horovod (OpenMPI)
Operating System: CentOS Linux release 7.5.1804 (Core)
GCC: gcc (GCC) 6.2.0
TensorFlow: 1.9.0
Horovod: 0.13.11
OpenMPI: 1.10.7
TensorFlow Benchmarks: /opt/tensorflow-benchmarks
Run TensorFlow Benchmarks using singularity shell:
For interactive development, shell into the image:
[user@linux ~]$ singularity shell tf-horovod.simg
Singularity: Invoking an interactive shell within container...
Singularity tf-horovod.simg:~ >
To run using singularity shell, start a singularity shell as described above, and launch the mpirun command
(described in section: Running tf_cnn_benchmarks (Example ResNet-50) above). Make a note to use the
right path for the tensorflow benchmarks. This path should point to the location of tf_cnn_benchmarks within
the image where they were cloned while creating the image
(eg.: "/opt/tensorflow-benchmarks/").
Version 0.94
Page 15
9/15/2018
17. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
Run TensorFlow Benchmarks using singularity exec:
To run using singularity exec, the same version of OpenMPI needs to exist within the image, and on the host
system. For example, if OpenMPI 1.10.7 is installed on the host system, make sure the image is built with
the same OpenMPI 1.10.7. Failure to match the MPI versions might cause unexpected failures. The Script_7
(run_singularity_multiworker.sh) included in this document uses the ‘singularity exec’ command to launch
the multi-node tf_cnn_benchmark run with ResNet 50. This script should be launched from the host machine.
Eg:
[user@linux ~]$chmod 755 run_singularity_multiworker.sh
[user@linux ~]$./run_singularity_multiworker.sh
Note that the script uses --bind to bind the directory containing the data on the host to the image file system. This
enables us to specify the data directory and data name while launching the benchmarks.
Use --bind
<path/to/some/host/directory>:<path/to/directory/in/image>
while
launching singularity shell/exec to bind the host directory to the specified directory in the image. Some additional
information about binding directories is available at http://singularity.lbl.gov/docs-mount. This can be used to bind
the directory containing the ImageNet TF_Records to the image and running the benchmarks.
Some HPC clusters (e.g. TACC) do not allow user-specified binding. In case of TACC, some directories (e.g. /home1,
/work, /scratch) are bound directly. A bind point needs to exist for these paths in the image. In that case, while building
the image, these directories can be created in the image, and the data on the host can be stored in one of these to be
accessed from within the image.
To run the benchmarks using dummy data, just exclude the arguments data_dir and data_name from the run command.
Version 0.94
Page 16
9/15/2018
18. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
3. USING NFS AND SLURM
While the previous section showed the job command line, the user may need to change how the jobs are launched
depending on the environment. Large scale systems with thousands of nodes typically use schedulers such as SLURM
to launch and manage jobs. We provide example scripts to run the benchmark in this environment. Additionally, we
also cover alternatives in smaller scale systems that may not have such a scheduler system in place. The scripts include
the run-time optimization Best Practices described above. The data set must be preprocessed as described in the
previous section and copied onto the local SSDs of all nodes.
3.1 Using NFS Mounted File System
If your cluster has an NFS mounted drive, the TensorFlow installation and the benchmarks can reside on the NFS
mounted drive. However, the dataset must preferably reside on local disc storage (preferably on SSDs) on all nodes
as opposed to NFS drives, unless Lustre is being used.
Ensure that you set up all nodes similarly (gcc version, MPI version) as described in the steps above
Install TensorFlow in a Python virtual environment as described in sections 1 and 2 by creating the
virtualenv directory on the NFS mounted drive
Install Horovod in the virtualenv following the steps described above
Clone the tf_cnn_benchmarks in the NFS mounted directory
mkdir <NFS_dir>/tensorflow_benchmarks
cd <NFS_dir>/tensorflow_benchmarks
git clone https://github.com/tensorflow/benchmarks.git
Run the benchmark by adapting the sample command line above to the cluster environment
3.2 Using SLURM Scheduler
We provide example scripts showing how to launch jobs with SLURM. The scripts provided use the Best Practices
from these optimizations listed in the blog above. In the scripts given in Section 8.1, edit the #SBATCH parameters
accordingly to change the number of nodes and ranks per node.
Version 0.94
Page 17
9/15/2018
19. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
4. TENSORFLOW BUILD INSTRUCTIONS
4.1 Building TensorFlow
1.
2.
Clone TensorFlow from: https://github.com/tensorflow/tensorflow, and checkout r1.9.0.
Build Tensorflow 1.9 with instructions from AI Products Group/Intel TensorFlow optimization team described
in the Appendix at the end of this document described below.
3. The process for building with MKL ML or MKL DNN is very similar. No specific steps are needed during
configuration to build Tensorfow using MKL. It is automatically configured when you run configure. You need
to only add --config=mkl to bazel build. In addition, make sure to set $TEST_TMPDIR to a local directory,
instead of an NFS mount directory, or some random errors could happen.
4. When you get to the question about "copt" parameters, just accept the default. You'll specify the flags explicitly
in the bazel build command
5. Use the default for all other settings. Choose as following:
a. No cloud
b. No Hadoop
c. No opencl
d. No rdma/verbs
e. No cuda
f. No mpi
g. Yes xla, xla may give improved performance
6. Building for MKL DNN: Broadwell/Skylake: Please use the following build command:
7. bazel build --config=mkl --copt="-mfma" --copt="-mavx2" --copt="-
march=broadwell" --copt="-O3" -s -c opt
//tensorflow/tools/pip_package:build_pip_package
8.
9.
When everything finally works, this is what you'll see: Success!
a. Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
bazel-bin/tensorflow/tools/pip_package/build_pip_package
Building the Python Wheel
a. Remove any existing wheels
b. rm ~/tmp/*.whl
c. From the root directory of your TensorFlow sources, create the whl file:
./bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tmp
10. Installing and running TensorFlow
a. Remove any existing TensorFlow installations
b.
pip uninstall tensorflow
11. Install the wheel for your user account.
a. pip install ~/tmp/*.whl --user #This will install any missing dependencies into your
user profile
b. pip install ~/tmp/*.whl --user --no-deps #This will skip installing dependencies
Make sure everything is working
12. Try importing the TensorFlow lib. If you don't get any errors, all is well
a. python
b. >>> import tensorflow
c. >>> quit()
Version 0.94
Page 18
9/15/2018
20. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
13. Try a simple kernel test
a. python <tensorflow src
directory>/tensorflow/python/kernel_tests/relu_op_test.py
Build Notes
b.
c.
d.
e.
-mavx512f -march=broadwell and other AVX512 flags need gcc4.9 or higher (Recommended to use
gcc5.4 or higher)
Currently, Tensorflow build with Intel® C/C++ compilers has some issues.
To use an alternate compiler (e.g. gcc 6.3):
change PATH to point to alternate compiler:
f. export PATH=/opt/intel/gcc6.3/bin:$PATH
g. add -L flag to bazel build command line above: --copt="-L/opt/intel/gcc6.3/lib64"
h.
change LD_LIBRARY_PATH when running the TensorFlow programs to point to new GLIBC:
i. export LD_LIBRARY_PATH=/opt/intel/gcc6.3/lib64:$LD_LIBRARY_PATH
For debug builds (no optimization), use
i. bazel build --config=mkl --copt="-DEIGEN_USE_VML" --copt="-g" -s -c dbg
//tensorflow/tools/pip_package:build_pip_package
j. For debug symbols only add --copt="-g" to bazel build command.
4.2 Install TensorFlow using script
The build_tensorflow_script can be used to install tensorflow in a virtual environment. At the prompt, choose “y” to
build from source. The script creates a virtual environment and installs TensorFlow in this environment, as opposed
to doing a system-wide install. The virtual environment is created in <path/to/install/directory>/virtenv/ , and can be
activated as follows:
source <path/to/install/directory>/virtenv/bin/activate
To deactivate the virtual environment, run:
deactivate
Version 0.94
Page 19
9/15/2018
21. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
5. SAMPLE SCRIPTS
5.1 TensorFlow build script
In the script below, change the highlighted parts according to the TensorFlow version that you want to work with.
The script saves the TensorFlow wheel that is installed in the virtual environment (either built from source or
downloaded) in the folder <path/to/install/directory>/tf_whl/.
#!/bin/bash
set -e
set -o pipefail
if [[ $#
then
echo
echo
exit
fi
-lt 1 ]]
"Usage $0 <dir>"
"TensorFlow will be cloned/installed into <dir>"
1
dir=$1
rm -rf $dir
export PYTHONPATH=`which python`
echo "Default python found at $PYTHONPATH"
while true; do
read -p "Do you want to use this python? [y/n] " yn
case $yn in
[Yy]* ) py_ins=1; break;;
[Nn]* ) py_ins=0; break;;
* ) echo "Please answer y/n.";;
esac
done
if [[ $py_ins -eq 0 ]]
then
read -e -p "Enter location of python to be used: " new_path
python_vers=`$new_path --version`
echo "Python at $new_path, of version $python_vers"
export PYTHONPATH=$new_path
fi
while true; do
read -p "Do you want to build tensorflow from source? [y/n] " yn
case $yn in
[Yy]* ) tf_ins=1; break;;
[Nn]* ) tf_ins=0; break;;
* ) echo "Please answer y/n.";;
esac
done
clonedir=`realpath $dir`
mkdir -p $clonedir
cd $clonedir
if [[ $tf_ins -eq 1 ]]
then
echo "Building from source at: $clonedir"
# Check GCC version
minv="5.2.0"
gccv=`gcc --version | grep "(GCC)" | awk '{print $3}'`
if [ "$(printf '%s\n' $minv $gccv | sort -V | head -n 1)" != "$minv" ]; then
echo "Need a newer GCC than $gccv"
exit
fi
Version 0.94
Page 20
9/15/2018
22. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
[[ $(type -P bazel) ]] ||
{ echo "Could not find bazel. Is it in PATH?" 1>&2; exit 1; }
git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
git checkout r1.9
testdir=/tmp/$USER
whldir="$testdir/tf_whl"
mkdir -p $whldir
export TEST_TMPDIR=$testdir
# TF related env var
export TF_NEED_GCP=0
export TF_NEED_HDFS=0
export TF_NEED_JEMALLOC=1
export TF_NEED_OPENCL=0
export TF_NEED_CUDA=0
export TF_NEED_GDR=0
export TF_CUDA_CLANG=0
export TF_ENABLE_XLA=0
export TF_ENABLE_GDR=0
export TF_NEED_S3=0
export TF_NEED_KAFKA=0
export TF_NEED_OPENCL_SYCL=0
export TF_NEED_VERBS=0
export TF_NEED_MPI=0
export TF_SET_ANDROID_WORKSPACE=0
export CC_OPT_FLAGS="-march=broadwell --copt=\"-mfma\" --copt=\"-mavx2\" --copt=\"-O3\"
"
echo "Configuring tensorflow build..."
./configure
bazel build --config mkl --copt="-mavx2" --copt="-O3" --copt="-mfma" --copt="-
march=broadwell" -s -c opt //tensorflow/tools/pip_package:build_pip_package
cd $clonedir/tensorflow
echo "Building wheel..."
bazel-bin/tensorflow/tools/pip_package/build_pip_package $whldir
else
echo "Downloading Optimized pre-built wheel..."
whldir="$clonedir/tf_whl"
mkdir -p $whldir
cd $whldir
wget https://storage.googleapis.com/intel-optimized-tensorflow/tensorflow-1.9.0-cp27-
cp27mu-linux_x86_64.whl
fi
virtdir="$clonedir/virtenv"
echo "Creating virtualenv directory in $virtdir..."
mkdir -p $virtdir
cd $virtdir
echo "Creating virtualenv..."
$PYTHONPATH -m virtualenv $virtdir
whl=`ls $whldir/*.whl`
echo "$whl"
cd $virtdir
source ./bin/activate
echo "Installing wheel in virtualenv..."
pip install --upgrade pip
pip install $whl
echo "TensorFlow is installed in $virtdir using wheel $whl"
Script 1: build_tf.sh: file to either build tensorflow from source, or download the pre-built Intel optimized wheel and
install in a python virtual environment
Version 0.94
Page 21
9/15/2018
23. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
5.2 SLURM scripts
Here are two sample scripts, skx_2nodes.job (Script 2) and run.sh (Script 3). Script 2 controls the arguments to the
multi-node job, and Script 3 controls the arguments to the python script. The Script 2 runs the script 3.
#!/bin/bash
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
-J
-o
-e
-N
-n
-p
-t
skx_resnet50
out_skx_resnet50_%j
err_skx_resnet50_%j
2
8
skx-normal
00:45:00
cur_date=`date +%F-%H-%M-%S`
model="resnet50"
# edit result directory path
result_dir="path_to_res_dir"
mkdir -p ${result_dir}
mkdir -p ${result_dir}/train
curdir=`pwd`
script="${curdir}/run.sh"
echo "Running script $script"
unset OMP_NUM_THREADS
unset KMP_BLOCKTIME
unset KMP_AFFINITY
export
export
export
export
export
export
OMP_NUM_THREADS=24
KMP_BLOCKTIME=1
I_MPI_FABRICS="shm:tmi"
I_MPI_TMI_PROVIDER="psm2"
I_MPI_PIN_DOMAIN="socket"
I_MPI_FALLBACK=0
# optional for debug
export I_MPI_DEBUG="3"
export I_MPI_DEBUG_OUTPUT="${result_dir}/debug_%r.dbg"
thresh=$(( 16 * 1024 * 1024 ))
unset HOROVOD_FUSION_THRESHOLD
export HOROVOD_FUSION_THRESHOLD=$thresh
mpienv="-genv I_MPI_DEBUG $I_MPI_DEBUG -genv LD_LIBRARY_PATH $LD_LIBRARY_PATH -genv
HOROVOD_FUSION_THRESHOLD $HOROVOD_FUSION_THRESHOLD -genv OMP_NUM_THREADS $OMP_NUM_THREADS -
genv KMP_BLOCKTIME $KMP_BLOCKTIME -genv I_MPI_FABRICS $I_MPI_FABRICS -genv
I_MPI_TMI_PROVIDER $I_MPI_TMI_PROVIDER -genv I_MPI_PIN_DOMAIN $I_MPI_PIN_DOMAIN -genv
I_MPI_FALLBACK $I_MPI_FALLBACK"
module load impi
which mpirun
mpirun --np 8 $mpienv -l ${script} ${thresh} ${result_dir} ${model}
mv out_skx_resnet* ${result_dir}
mv err_skx_resnet* ${result_dir}
# For TACC environment, ibrun is recommended
# export IBRUN_TASKS_PER_NODE=4
# ibrun --np 8 ${script} ${thresh} ${result_dir} ${model}
Script 2: Example skx_2nodes.job sample file launches the batch job by defining the SBATCH parameters for 2
nodes of Skylake, and launches run.sh with mpirun. This is a sample script to run on TACC, and uses Intel MPI.
Version 0.94
Page 22
9/15/2018
24. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
#!/bin/bash
if [ $# -lt 1 ]
then
echo "Usage: $0 fusion_threshold"
exit 1
fi
killall -9 python
thresh=$1
resdir=$2
model=$3
train_dir=$resdir/train
total_procs=${SLURM_NPROCS}
num_nodes=$SLURM_NNODES
procs_per_node=`echo ${SLURM_NPROCS}/$SLURM_NNODES | bc`
if [ ${num_nodes} == "" ]
then
echo "SLURM_NNODES returned $SLURM_NNODES"
exit 1
fi
script="/path/to/tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py"
intra_threads=24
inter_threads=2
num_batches=500
data_dir="/tmp/imgnet/"
batch_size=`echo 256/${procs_per_node} | bc`
host=$(hostname)
module load impi
env
lscpu
echo "Num nodes: ${num_nodes} Node: $host total procs: $total_procs batch size: $batch_size
intra threads: $intra_threads"
args="--batch_size=$batch_size --model=resnet50 \
--num_intra_threads $OMP_NUM_THREADS \
--num_inter_threads 2 \
--display_every 5 \
--data_format NCHW \
--optimizer momentum \
--device cpu \
--enable_layout_optimizer=TRUE"
cmd="cd /path/to/virtualenv/virtenv; source ./bin/activate; python -u $script --model
$model $args
--mkl=TRUE \
--variable_update=horovod \
--horovod_device=cpu \
--local_parameter_device=cpu \
--kmp_blocktime=1 \
--enable_layout_optimizer=TRUE; deactivate "
echo $cmd
eval $cmd
Script 3: run.sh Launching Job with SLURM. run.sh controls launching the python script and its arguments
Version 0.94
Page 23
9/15/2018
25. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
5.3 Singularity scripts
5.3.1 Install script
Edit the highlighted portions below to change the version and the path where singularity is to be installed
#!/bin/bash
VERSION=2.5.1
sudo yum –y install libarchive-devel
sudo yum –y install squashfs-tools
wget https://github.com/singularityware/singularity/releases/download/$VERSION/singularity-
$VERSION.tar.gz
tar xvf singularity-$VERSION.tar.gz
cd singularity-$VERSION
./configure --prefix=/path/to/where/singularity/will/be/installed
make
sudo make install
Script 4: Installing Singularity
5.3.2 Recipe file for tensorflow wheel downloaded from a URL
Bootstrap: docker
From: centos:latest
%setup
export SINGULARITY_SHELL=/bin/bash
%post -c /bin/bash
# If your environment requires proxy to reach internet, then export the http, https proxy
variables accordingly
export http_proxy=http://proxy-server-name:port
export https_proxy=http://proxy-server-name:port
yum clean all
rm -rf /var/cache/yum
yum -y install epel-release
yum -y groupinstall "Development Tools"
yum -y install wget
yum -y install git
#installing gcc
yum -y install gmp-devel mpfr-devel libmpc-devel glibc-devel.i686 libgcc.i686 gcc-c++
wget https://ftp.gnu.org/gnu/gcc/gcc-6.2.0/gcc-6.2.0.tar.gz
tar -xzf gcc-6.2.0.tar.gz
mkdir build
cd build
../gcc-6.2.0/configure --prefix=/opt/gcc-6.2.0
make -j 20
make install
#adding GCC to $PATH
export PATH=/opt/gcc-6.2.0/bin:$PATH
export LD_LIBRARY_PATH=/opt/gcc-6.2.0/lib64:/opt/gcc-6.2.0/lib
#using anaconda, conda’s default tensorflow is Intel optimized
wget https://repo.anaconda.com/archive/Anaconda2-5.2.0-Linux-x86_64.sh
bash Anaconda2-5.2.0-Linux-x86_64.sh -b -p /opt/anaconda2
Version 0.94
Page 24
9/15/2018
26. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
export PATH=/opt/anaconda2/bin:$PATH
rm -rf Anaconda2-5.2.0-Linux-x86_64.sh
yum -y install openmpi openmpi-devel
yum -y install environment-modules
source /etc/profile.d/modules.sh
module add mpi/openmpi-x86_64
#In case module throws an error (eg. on TACC), use:
# export PATH=/usr/lib64/openmpi/bin:$PATH
# export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
conda install -y tensorflow
pip install --no-cache-dir Horovod
git clone https://github.com/tensorflow/benchmarks /opt/tensorflow-benchmarks
#uncomment following for running on TACC
#mkdir –p /home1
#mkdir –p /scratch
#mkdir –p /work
%environment
export http_proxy=http://proxy-server-name:port
export https_proxy=http://proxy-server-name:port
export PATH=/opt/gcc-6.2.0/bin:$PATH
export LD_LIBRARY_PATH=/opt/gcc-6.2.0/lib64:/opt/gcc-6.2.0/lib
export PATH=/opt/anaconda2/bin:$PATH
#openmpi
source /etc/profile.d/modules.sh
module add mpi/openmpi-x86_64
#Again, in case module throws an error(eg. on TACC), use:
# export PATH=/usr/lib64/openmpi/bin:$PATH
# export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
%runscript
echo "This is a Singularity image containing Intel optimized TensorFlow installation with
Horovod (OpenMPI)"
echo "Operating System: $(cat /etc/redhat-release)"
echo "GCC: $(gcc --version)"
echo "TensorFlow: $(pip show tensorflow --disable-pip-version-check | grep "^Version:" |
awk '{print $NF}')"
echo "Horovod: $(pip show horovod --disable-pip-version-check | grep "^Version:" | awk
'{print $NF}')"
echo "OpenMPI: $(ompi_info | grep "Open MPI:" | awk '{print $NF}')"
echo "TensorFlow Benchmarks: /opt/tensorflow-benchmarks"
Script 5: tf-horovod.singularity file contents for the case where TensorFlow is installed from conda.
5.3.3 Recipe file for TensorFlow wheel on local file system
Bootstrap: docker
From: centos:latest
%setup
# Adjust the below source path and file name accordingly
cp /path/on/local/filesystem/to/tensorflow.whl ${SINGULARITY_ROOTFS}/tensorflow.whl
export SINGULARITY_SHELL=/bin/bash
%post -c /bin/bash
# If your environment requires proxy to reach internet, then export the http, https proxy
variables accordingly
export http_proxy=http://proxy-server-name:port
export https_proxy=http://proxy-server-name:port
Version 0.94
Page 25
9/15/2018
27. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
yum clean all
rm -rf /var/cache/yum
yum -y install epel-release
yum -y groupinstall "Development Tools"
yum -y install wget
yum -y install git
#installing gcc
yum -y install gmp-devel mpfr-devel libmpc-devel glibc-devel.i686 libgcc.i686 gcc-c++
wget https://ftp.gnu.org/gnu/gcc/gcc-6.2.0/gcc-6.2.0.tar.gz
tar -xzf gcc-6.2.0.tar.gz
mkdir build
cd build
../gcc-6.2.0/configure --prefix=/opt/gcc-6.2.0
make -j 20
make install
#adding GCC to $PATH
export PATH=/opt/gcc-6.2.0/bin:$PATH
export LD_LIBRARY_PATH=/opt/gcc-6.2.0/lib64:/opt/gcc-6.2.0/lib:$LD_LIBRARY_PATH
yum -y install python-pip python-wheel python-devel
yum -y install openmpi openmpi-devel
yum -y install environment-modules
source /etc/profile.d/modules.sh
module add mpi/openmpi-x86_64
#In case module throws an error (eg. on TACC), use:
# export PATH=/usr/lib64/openmpi/bin:$PATH
# export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
# Replace the tensorflow.whl file in the installation below to the actual file name used
in the setup section
pip install /tensorflow.whl
pip install --no-cache-dir Horovod
# Download the TF benchmarks and place them in /opt inside the image
git clone https://github.com/tensorflow/benchmarks /opt/tensorflow-benchmarks
#For running on TACC: uncomment the following lines to bind these directories
# mkdir –p /home1
# mkdir –p /scratch
# mkdir –p /work
%environment
export http_proxy=http://proxy-server-name:port
export https_proxy=http://proxy-server-name:port
#gcc 6.2.0
export PATH=/opt/gcc-6.2.0/bin:$PATH
export LD_LIBRARY_PATH=/opt/gcc-6.2.0/lib64:/opt/gcc-6.2.0/lib:$LD_LIBRARY_PATH
#openmpi
source /etc/profile.d/modules.sh
module add mpi/openmpi-x86_64
#Again, in case module throws an error(eg. on TACC), use:
# export PATH=/usr/lib64/openmpi/bin:$PATH
# export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
%runscript
echo "This is a Singularity image containing Intel optimized TensorFlow installation with
Horovod (OpenMPI)"
echo "Operating System: $(cat /etc/redhat-release)"
echo "GCC: $(gcc --version | grep "(GCC)" | awk '{print}')"
Version 0.94
Page 26
9/15/2018
28. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
echo "TensorFlow: $(pip show tensorflow --disable-pip-version-check | grep "^Version:" |
awk '{print $NF}')"
echo "Horovod: $(pip show horovod --disable-pip-version-check | grep "^Version:" | awk
'{print $NF}')"
echo "OpenMPI: $(ompi_info | grep "Open MPI:" | awk '{print $NF}')"
echo "TensorFlow Benchmarks: /opt/tensorflow-benchmarks"
Script 6: tf-horovod.singularity file contents for the case where TensorFlow is installed from local file system.
Edit highlighted text depending on your environment, paths and TensorFlow wheel.
5.3.4 Singularity run-script
#!/bin/bash
PATH_TO_SINGULARITY="/path/to/singularity/installation/bin/singularity"
PATH_TO_SIMG="/path/to/tf-horovod.simg"
SING_EXEC_CMD="${PATH_TO_SINGULARITY} exec --bind /local/path/to/TF_Records:
/image/path/to/TF_Records ${PATH_TO_SIMG}"
PATH_TO_SIMG_TF_BENCH="/opt/tensorflow-benchmarks/"
OMP_NUM_THREADS=<positive_number>
HOSTNAMES="hostname1,hostname2,and so on"
args=" \
--batch_size=64 \
--model=resnet50 \
--num_intra_threads=$OMP_NUM_THREADS \
--num_inter_threads=2 \
--display_every=5 \
--data_format=NCHW \
--optimizer=momentum \
--device=cpu "
HOROVOD_FUSION_THRESHOLD=134217728 mpirun --np 4 \
--map-by ppr:2:socket:pe=${OMP_NUM_THREADS} \
-H ${HOSTNAMES} \
--report-bindings \
--oversubscribe \
--allow-run-as-root \
-x LD_LIBRARY_PATH \
-x HOROVOD_FUSION_THRESHOLD \
-x OMP_NUM_THREADS=${OMP_NUM_THREADS} \
${SING_EXEC_CMD} \
python ${PATH_TO_SIMG_TF_BENCH}/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py $args \
--mkl=TRUE \
--variable_update=horovod --horovod_device=cpu \
--local_parameter_device=cpu --kmp_blocktime=1 \
--data_dir=/image/path/to/TF_Records --data_name=imagenet
Script 7: Sample run_singularity_multiworker.sh . Edit text highlighted in yellow depending on your
environment and number of TF workers per node. Also, edit “args” settings for the TF benchmark topology.
Version 0.94
Page 27
9/15/2018
29. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
6. TROUBLESHOOTING
6.1 TensorFlow Import Issues
With the most recent pre-built 1.9 wheel, this error might occur on some systems while importing tensorflow:
6.1.1 Importing TensorFlow
>>> import tensorflow
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/path/to/virtualenv/virtenv/lib/python2.7/site-
packages/tensorflow/__init__.py", line 22, in <module>
from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File "/path/to/virtualenv/virtenv/lib/python2.7/site-
packages/tensorflow/python/__init__.py", line 49, in <module>
from tensorflow.python import pywrap_tensorflow
File "/path/to/virtualenv/virtenv/lib/python2.7/site-
packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/path/to/virtualenv/virtenv/lib/python2.7/site-
packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/path/to/virtualenv/virtenv/lib/python2.7/site-
packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/path/to/virtualenv/virtenv/lib/python2.7/site-
packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in
swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
ImportError: /lib64/libm.so.6: version `GLIBC_2.23' not found (required by
/path/to/virtualenv/virtenv/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so)
Failed to load the native TensorFlow runtime.
This happens because the wheel requires GLIBC 2.23, but the system has an older version. You can use the
following techniques to debug this issue.
6.1.2 Run ldd to find the dynamically linked libraries
(virt2.7) [user@linux singularity_testing]$ ldd
/path/to/virtualenv/virt2.7/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so
linux-vdso.so.1 => (0x00007fff4df5b000)
libtensorflow_framework.so => /path/to/virtualenv/virt2.7/lib/python2.7/site-
packages/tensorflow/python/../libtensorflow_framework.so (0x00007f426cefe000)
libiomp5.so => /path/to/virtualenv/virt2.7/lib/python2.7/site-
packages/tensorflow/python/../../_solib_k8/_U@mkl_Ulinux_S_S_Cmkl_Ulibs_Ulinux___Uexte
rnal_Smkl_Ulinux_Slib/libiomp5.so (0x00007f426cb5a000)
libmklml_intel.so => /path/to/virtualenv/virt2.7/lib/python2.7/site-
packages/tensorflow/python/../../_solib_k8/_U@mkl_Ulinux_S_S_Cmkl_Ulibs_Ulinux___Uexte
rnal_Smkl_Ulinux_Slib/libmklml_intel.so (0x00007f4263a89000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f4263885000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f4263669000)
libm.so.6 => /lib64/libm.so.6 (0x00007f4263367000)
libstdc++.so.6 => /path/to/gcc-7.2.0/lib64/libstdc++.so.6 (0x00007f4262fe6000)
libgcc_s.so.1 => /path/to/gcc-7.2.0/lib64/libgcc_s.so.1 (0x00007f4262dcf000)
libc.so.6 => /lib64/libc.so.6 (0x00007f4262a02000)
/lib64/ld-linux-x86-64.so.2 (0x00007f4273a10000)
Version 0.94
Page 28
9/15/2018
30. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
The libm.so.6 in /lib64 was not finding `GLIBC_2.23'.
6.1.3 Check by running:
(virt2.7) [user@linux singularity_testing]$ strings /lib64/libm.so.6 | grep "GLIBC_"
GLIBC_2.2.5
GLIBC_2.4
GLIBC_2.15
GLIBC_PRIVATE
GLIBC_2.15
__strtold_nan@@GLIBC_PRIVATE
__errno_location@@GLIBC_2.2.5
errno@@GLIBC_PRIVATE
qsort@@GLIBC_2.2.5
__strtod_nan@@GLIBC_PRIVATE
fputs@@GLIBC_2.2.5
memset@@GLIBC_2.2.5
feupdateenv@@GLIBC_2.2.5
GLIBC_2.4
_rtld_global_ro@@GLIBC_PRIVATE
__strtof_nan@@GLIBC_PRIVATE
memmove@@GLIBC_2.2.5
fwrite@@GLIBC_2.2.5
__cxa_finalize@@GLIBC_2.2.5
stderr@@GLIBC_2.2.5
GLIBC_2.2.5
Finally, one fix is to build the required version of GLIBC in a user-specified path, and then copying the libm.so.6 to
a path that is imported in LD_LIBRARY_PATH.
mkdir ~/glibc
cd ~/glibc
wget http://ftp.gnu.org/gnu/glibc/glibc-2.23.tar.gz
tar zxvf glibc-2.23.tar.gz
cd glibc-2.23
mkdir build
cd build
../configure --prefix=/opt/glibc-2.23
make -j4
sudo make install
cp /opt/glibc-2.23/lib/libm.so.6 </some/path>/lib
export LD_LIBRARY_PATH=</some/path>/lib
Please make sure that you do not replace the system’s GLIBC. Adding /opt/glibc-2.23/lib to
LD_LIBRARY_PATH might break other things like ls.
Version 0.94
Page 29
9/15/2018
31. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
6.1.4 Another Common Error when Importing TensorFlow
>>> import tensorflow
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/path/to/virtualenv/test/lib/python2.7/site-packages/tensorflow/__init__.py",
line 22, in <module>
from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File "/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/__init__.py", line 49, in <module>
from tensorflow.python import pywrap_tensorflow
File "/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in
swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so)
Failed to load the native TensorFlow runtime.
This happens if you are using an older version of GCC. To fix this, make sure the correct version of gcc (6.2 or
higher) is in PATH and LD_LIBRARY_PATH.
6.1.5 Verify that TensorFlow is Using right the version of gcc
(test) [user@linux OOCpackage_testing]$ ldd
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so: /lib64/libstdc++.so.6:
version `CXXABI_1.3.8' not found (required by
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so)
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so: /lib64/libstdc++.so.6:
version `GLIBCXX_3.4.21' not found (required by
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so)
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so: /lib64/libstdc++.so.6:
version `GLIBCXX_3.4.20' not found (required by
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so)
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so: /lib64/libm.so.6: version
`GLIBC_2.23' not found (required by /path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so)
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so: /lib64/libstdc++.so.6:
version `GLIBCXX_3.4.20' not found (required by
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/../libtensorflow_framework.so)
Version 0.94
Page 30
9/15/2018
32. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so: /lib64/libstdc++.so.6:
version `GLIBCXX_3.4.21' not found (required by
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/../libtensorflow_framework.so)
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so: /lib64/libstdc++.so.6:
version `CXXABI_1.3.8' not found (required by
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/../libtensorflow_framework.so)
linux-vdso.so.1 => (0x00007ffc2c326000)
libtensorflow_framework.so => /path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/../libtensorflow_framework.so (0x00007fec5a4dd000)
libiomp5.so => /path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/../../_solib_k8/_U@mkl_Ulinux_S_S_Cmkl_Ulibs_Ulinux___Uexte
rnal_Smkl_Ulinux_Slib/libiomp5.so (0x00007fec5a139000)
libmklml_intel.so => /path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/../../_solib_k8/_U@mkl_Ulinux_S_S_Cmkl_Ulibs_Ulinux___Uexte
rnal_Smkl_Ulinux_Slib/libmklml_intel.so (0x00007fec51068000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fec50e64000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fec50c48000)
libm.so.6 => /lib64/libm.so.6 (0x00007fec50946000)
librt.so.1 => /lib64/librt.so.1 (0x00007fec5073e000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fec50437000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fec50221000)
libc.so.6 => /lib64/libc.so.6 (0x00007fec4fe54000)
/lib64/ld-linux-x86-64.so.2 (0x00007fec615b2000)
6.1.6 Run ldd again after adding the correct version of gcc
(test) [user@linux OOCpackage_testing]$ ldd
/path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/_pywrap_tensorflow_internal.so
linux-vdso.so.1 => (0x00007ffe334f6000)
libtensorflow_framework.so => /path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/../libtensorflow_framework.so (0x00007efff9f26000)
libiomp5.so => /path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/../../_solib_k8/_U@mkl_Ulinux_S_S_Cmkl_Ulibs_Ulinux___Uexte
rnal_Smkl_Ulinux_Slib/libiomp5.so (0x00007efff9b82000)
libmklml_intel.so => /path/to/virtualenv/test/lib/python2.7/site-
packages/tensorflow/python/../../_solib_k8/_U@mkl_Ulinux_S_S_Cmkl_Ulibs_Ulinux___Uexte
rnal_Smkl_Ulinux_Slib/libmklml_intel.so (0x00007efff0ab1000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007efff08ad000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007efff0691000)
libm.so.6 => /lib64/libm.so.6 (0x00007efff038e000)
librt.so.1 => /lib64/librt.so.1 (0x00007efff0186000)
libstdc++.so.6 => /path/to/gcc-7.2.0/lib64/libstdc++.so.6 (0x00007effefe05000)
libgcc_s.so.1 => /path/to/gcc-7.2.0/lib64/libgcc_s.so.1 (0x00007effefbee000)
libc.so.6 => /lib64/libc.so.6 (0x00007effef821000)
/lib64/ld-linux-x86-64.so.2 (0x00007f0000ffb000)
Version 0.94
Page 31
9/15/2018
33. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
6.2 TensorFlow Build Issues
At the time of writing this document, TensorFlow fails to build with the latest Bazel version 0.12.0. The error is shown
below:
[user@linux tensorflow]$ bazel build --config mkl --copt="-mavx2" --copt="-
O3"
--copt="-mfma"
--copt="-march=broadwell"
-s
-c
opt
//tensorflow/tools/pip_package:build_pip_package
WARNING:
/home/user/.cache/bazel/_bazel_user/8aaf3223e362cc8d0a1f01ee89fea818/external
/protobuf_archive/WORKSPACE:1:
Workspace
name
in
/home/user/.cache/bazel/_bazel_user/8aaf3223e362cc8d0a1f01ee89fea818/external
/protobuf_archive/WORKSPACE (@com_google_protobuf) does not match the name
given in the repository's definition (@protobuf_archive); this will cause a
build error in future versions
WARNING:
/home/user/.cache/bazel/_bazel_user/8aaf3223e362cc8d0a1f01ee89fea818/external
/grpc/WORKSPACE:1:
Workspace
name
in
/home/user/.cache/bazel/_bazel_user/8aaf3223e362cc8d0a1f01ee89fea818/external
/grpc/WORKSPACE (@com_github_grpc_grpc) does not match the name given in the
repository's definition (@grpc); this will cause a build error in future
versions
WARNING:
/home/user/tf_test_new/tensorflow/third_party/mkl/BUILD:20:12:
in
srcs attribute of cc_library rule //third_party/mkl:intel_binary_blob: please
do not import '@mkl//:lib/libmklml_intel.so' directly. You should either move
the file to this package or depend on an appropriate rule there
WARNING:
/home/user/tf_test_new/tensorflow/third_party/mkl/BUILD:20:12:
in
srcs attribute of cc_library rule //third_party/mkl:intel_binary_blob: please
do not import '@mkl//:lib/libiomp5.so' directly. You should either move the
file to this package or depend on an appropriate rule there
ERROR:
/home/user/.cache/bazel/_bazel_user/8aaf3223e362cc8d0a1f01ee89fea818/external
/jpeg/BUILD:126:12: Illegal ambiguous match on configurable attribute "deps" in
@jpeg//:jpeg:
@jpeg//:k8
@jpeg//:armeabi-v7a
Multiple matches are not allowed unless one is unambiguously more specialized.
ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package'
failed; build aborted:
/home/user/.cache/bazel/_bazel_user/8aaf3223e362cc8d0a1f01ee89fea818/external
/jpeg/BUILD:126:12: Illegal ambiguous match on configurable attribute "deps" in
@jpeg//:jpeg:
@jpeg//:k8
@jpeg//:armeabi-v7a
Multiple matches are not allowed unless one is unambiguously more specialized.
INFO: Elapsed time: 7.813s
FAILED: Build did NOT complete successfully (135 packages loaded)
currently loading: tensorflow/core/kernels
The workaround is to downgrade to the previous Bazel version 0.11.1
Version 0.94
Page 32
9/15/2018
34. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
6.3 Horovod Install Issues
On trying to import Horovod, if there are errors such as this:
>>> import horovod.tensorflow
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/tf_test_new/virtenv/lib/python2.7/site-
packages/horovod/tensorflow/__init__.py", line 41, in <module>
from horovod.tensorflow.mpi_ops import allgather
File "/home/user/tf_test_new/virtenv/lib/python2.7/site-
packages/horovod/tensorflow/mpi_ops.py", line 56, in <module>
['HorovodAllgather', 'HorovodAllreduce'])
File "/home/user/tf_test_new/virtenv/lib/python2.7/site-
packages/horovod/tensorflow/mpi_ops.py", line 43, in _load_library
library = load_library.load_op_library(filename)
File "/home/user/tf_test_new/virtenv/lib/python2.7/site-
packages/tensorflow/python/framework/load_library.py", line 58, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename, status)
File "/home/user/tf_test_new/virtenv/lib/python2.7/site-
packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError:
/home/user/tf_test_new/virtenv/lib/python2.7/site-
packages/horovod/tensorflow/mpi_lib.so: undefined symbol:
_ZNK10tensorflow8OpKernel4nameEv
Ensure that the same GCC version that was using to build TensorFlow is in PATH when installing Horovod. If
different versions were used, especially GCC versions older than 5.1, then object files get compiled with for
different versions of the ABI (_GLIBCXX_USE_CXX11_ABI), resulting in this error at link time. To fix this,
uninstall Horovod, fix PATH and LD_LIBRARY_PATH to point to the correct GCC, and then reinstall Horovod.
6.4 Verify Intel® Omni-Path Architecture (OPA)
The official documentation contains in-depth information on setup and configuration of Intel® OPA:
https://www.intel.com/content/www/us/en/support/articles/000016242/network-and-i-o/fabric-products.html
6.4.1 Verify that OPA is Up and Running
opaconfig -V
10.3.1.0.22
6.4.2 Verify Install (Example of a good install)
ibstat
CA 'hfi1_0'
CA type:
Number of ports: 1
Firmware version:
Hardware version: 11
Node GUID: 0x00117501017a04c9
System image GUID: 0x00117501017a04c9
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Version 0.94
Page 33
9/15/2018
35. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
Base lid: 19
LMC: 0
SM lid: 1
Capability mask: 0x00410020
Port GUID: 0x00117501017a04c9
Link layer: InfiniBand
opainfo
hfi1_0:1
PortGID:0xfe80000000000000:001175010174447d
PortState:
Active
LinkSpeed
Act: 25Gb
En: 25Gb
LinkWidth
Act: 4
En: 4
LinkWidthDnGrd ActTx: 4 Rx: 4
En: 3,4
LCRC
Act: 14-bit
En: 14-bit,16-bit,48-bit
Mgmt: True
LID: 0x00000031-0x00000031
SM LID: 0x00000001 SL: 0
QSFP: PassiveCu, 3m
Hitachi Metals
P/N IQSFP26C-30
Rev 03
Xmit Data:
10922365 MB Pkts:
1929517624
Recv Data:
10893103 MB Pkts:
1809969601
Link Quality: 5 (Excellent)
6.4.3 Verify OPA Fabric Performance
We recommend sanity checking the OPA network connectivity by using a test benchmark. If the Intel® Omni-Path
Fabric Suite FastFabric Toolset is installed, a number of such test suites are automatically included in the installation.
See section 8 of the Intel® OPA™ FastFabric Toolset documentation here:
https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-
products/Intel_OP_FabricSuite_FastFabric_UG_H76469_v8_0.pdf
If not, the link can be checked manually by running benchmarks such as the OSU benchmark, which are easily
obtainable on the official page here:
http://mvapich.cse.ohio-state.edu/benchmarks/
6.4.4 OPA Install Issues
Some things to check if the hfi1 module does not load automatically on reboot: You may have to load the driver
manually:
sudo service rdma start
sudo modprobe hfi1
service irqbalance restart
lsmod | grep hfi1
hfi1
697628 1
rdmavt
63294 1 hfi1
ib_core
210381
13
hfi1,rdma_cm,ib_cm,iw_cm,rpcrdma,ib_ucm,rdmavt,ib_iser,ib_umad,ib_uverbs,rdma
_ucm,ib_ipoib,ib_isert
i2c_algo_bit
13413 3 igb,hfi1,mgag200
Version 0.94
Page 34
9/15/2018
36. Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon®
i2c_core
40756
drm,igb,hfi1,i2c_i801,drm_kms_helper,mgag200,i2c_algo_bit
7
Check if device is present and recognized in the PCIe tree:
[user@linux nompi]$ ls –l /sys/class/infiniband
lrwxrwxrwx
1
root
root
0
Oct
10
12:14
hfi1_0
->
../../devices/pci0000:00/0000:00:02.0/0000:02:00.0/infiniband/hfi1_0
[user@linux nompi]$ ls -l /dev/hfi1*
crw-rw-rw- 1 root root 245,
0 Apr 12 14:33 /dev/hfi1_0
crw------- 1 root root 245, 128 Apr 12 14:33 /dev/hfi1_diagpkt
crw------- 1 root root 245, 200 Apr 12 14:33 /dev/hfi1_diagpkt0
[bduser@node27 ~]$ lspci | grep -i hfi
03:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series
[discrete] (rev 11)
Look at dmesg. Following shows driver loaded correctly:
[
[
[
[
[
[
22.987380]
22.987398]
22.987409]
22.988637]
22.988651]
22.988715]
hfi1
hfi1
hfi1
hfi1
hfi1
hfi1
0000:03:00.0:
0000:03:00.0:
0000:03:00.0:
0000:03:00.0:
0000:03:00.0:
0000:03:00.0:
hfi1_0:
hfi1_0:
hfi1_0:
hfi1_0:
hfi1_0:
hfi1_0:
set_link_state: current INIT, new ARMED
logical state changed to PORT_ARMED (0x3)
send_idle_message: sending idle message 0x103
set_link_state: current ARMED, new ACTIVE
logical state changed to PORT_ACTIVE (0x4)
send_idle_message: sending idle message 0x203
Additional driver parameters are listed in the performance tuning guide:
https://www.intel.com/content/www/us/en/support/articles/000016242/network-and-i-o/fabric-products.html
Current version (subject to update):
https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-
products/Intel_OP_Performance_Tuning_UG_H93143_v11_0.pdf
To test basic functionality, run a Verbs or IPoFabric benchmark. See section 7 of performance tuning guide for
complete details. Below is a snippet showing how to run ib_write_bw from the perftest benchmark to test for Verbs
performance.
To unload the driver manually:
sudo rmmod hfi1
sudo rmmod rdmavt
Version 0.94
Page 35
9/15/2018