Speeding up Python* scientific computations
Goal
Use Intel® oneAPI Math Kernel Library (oneMKL) to boost Python* applications that perform heavy mathematical computations.
Solution
Python applications with a high amount of mathematical computations use these packages:
NumPy* |
Consists of an N-dimensional array object, a multi-dimensional container of generic data. |
SciPy* |
Includes modules for linear algebra, statistics, integration, Fourier transforms, ordinary differential equations solvers, and more. Depends on NumPy for fast N-dimensional array manipulation. |
To speed up NumPy/SciPy computations, build the sources of these packages with oneMKL and run an example to measure the performance. To get further performance boost on systems with Intel® Xeon Phi™ coprocessors available, enable Automatic Offload.
Building NumPy and SciPy with oneMKL
To benefit from NumPy and SciPy prebuilt with oneMKL, download Intel® Distribution for Python* from https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html.
These steps assume a Linux* or Windows* operating system, Intel® 64 architecture, and ILP64 interface.
Get the latest NumPy and SciPy packages from http://www.scipy.org/Download and unpack them
Install the latest versions oneMKL and Intel® C++ and Intel® Fortran Compilers
Set the environment variables for Intel C++ and Fortran compilers:
Linux*:
Execute the command:
$source <intel tools installation dir>/bin/compilervars.sh intel64
Windows*:
Launch environment setters to specify the Visual Studio* mode for your Intel64 build binaries:
(Windows 8:) Place the mouse pointer in the bottom-left corner of the screen, click the right mouse button, select Search, and click anywhere in the screen white space.
Navigate to the Intel Parallel Studio 2016 section and select Intel64 Visual Studio 20XX mode.
Change directory to <numpy dir>
Make a copy of the existing site.cfg.example and save it as site.cfg
Open site.cfg, uncomment the [mkl] section, and modify it to look as follows:
Linux:
[mkl] library_dirs = /opt/intel/compilers_and_libraries_2016/linux/mkl/lib include_dirs = /opt/intel/compilers_and_libraries_2016/linux/mkl/include mkl_libs = mkl_rt lapack_libs =
Windows:
[mkl] library_dirs = C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2016\windows\mkl\lib\intel64;
C:\Program Files (x86)\Intel\Composer XE 2015.x.yyy\compiler\lib\intel64 include_dirs = C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2016\windows\mkl\include mkl_libs = mkl_lapack95_lp64,mkl_blas95_lp64,mkl_intel_lp64,mkl_intel_thread,mkl_core,libiomp5md lapack_libs = mkl_lapack95_lp64,mkl_blas95_lp64,mkl_intel_lp64,mkl_intel_thread,mkl_core,libiomp5md
Modify intelccompiler.py in <numpy dir>/distutils to pass optimization options to Intel C++ Compiler:
Linux:
self.cc_exe = 'icx –O3 –g -xhost –fPIC –fomit-frame-pointer –openmp –DMKL_ILP64'
Windows:
self.compile_options = [ '/nologo', '/O3', '/MD', '/W3', '/Qstd=c99',
'/QxHost', '/fp:strict', '/Qopenmp']
Modify intel.py in the <numpy dir>/distutils/fcompiler folder to pass optimization options to Intel Fortran Compiler:
Linux:
ifort –xhost –openmp –i8 –fPIC
Windows:
def get_flags(self): opt = ['/nologo', '/MD', '/nbs','/names:lowercase', '/assume:underscore']
Change directory to <numpy dir> and build and install NumPy:
Linux:
$python setup.py config --compiler=intelem build_clib --compiler=intelem build_ext --compiler=intelem install
Windows:
python setup.py config --compiler=intelemw build_clib --compiler=intelemw build_ext --compiler=intelemw install
Change directory to <scipy dir> and build and install SciPy:
Linux:
$python setup.py config --compiler=intelem --fcompiler=intelem build_clib --compiler=intelem --fcompiler=intelem build_ext --compiler=intelem --fcompiler=intelem install
Windows:
python setup.py config --compiler=intelemw --fcompiler=intelvem build_clib --compiler=intelemw --fcompiler=intelvem build_ext --compiler=intelemw --fcompiler=intelvem install
Code Example
import numpy as np import scipy.linalg.blas as slb import time M = 10000 N = 6000 k_list = [64, 128, 256, 512, 1024, 2048, 4096, 8192] np.show_config() for K in k_list: a = np.array(np.random.random((M, N)), dtype=np.double, order='C', copy=False) b = np.array(np.random.random((N, K)), dtype=np.double, order='C', copy=False) A = np.matrix(a, dtype=np.double, copy=False) B = np.matrix(b, dtype=np.double, copy=False) start = time.time() C = slb.dgemm(1.0, a=A, b=B) end = time.time() tm = start - end print ('{0:4}, {1:9.7}'.format(K, tm))
Source code: see the dgemm_python folder in the samples archive available at https://www.intel.com/content/dam/develop/external/us/en/documents/mkl-cookbook-samples-120115.zip.
Enabling Automatic Offload
If Intel® Xeon Phi™ coprocessors are available on your system, to enable Automatic Offload of computations to coprocessors, set the environment variable MKL_MIC_ENABLE to 1.
Discussion
The build steps install NumPy and SciPy in the default Python path. To install them in your home directory or another specific folder, pass –prefix=$HOME or the folder path to the commands in steps 9 or 10. IF you install Python into $HOME, after building NumPy and before building SciPy, set the PYTHONPATH environment variable to $HOME/lib/pythonY.Z/site-packages, where Y.Z is the Python version.
Specific instructions in step 3 for selecting the Visual Studio* mode for your Intel64 build binaries depend on the Windows version. For example:
On Windows 7, go to All Programs -> Intel Parallel Studio XE 20XX -> Command Prompt and select Intel64 Visual Studio 20XX mode, where 20XX is the version of Visual Studio, such as 2014.
The code example uses the most common matrix-matrix multiplication routine dgemm from SciPy and NumPy arrays to create and initialize the input matrices. If NumPy and SciPy are built with oneMKL, this code actually calls oneMKL BLAS dgemm routine.
If Intel® Xeon Phi™ coprocessors are available on your system, some oneMKL routines can take advantage of the coprocessors (for the list of Automatic Offload enabled oneMKL functions, see [AO]). If Automatic Offload is enabled, these routines split the computations between the host CPU(s) and coprocessor(s).