This is a getting started guide for Python* API (daal4py) with the Intel® oneAPI Data Analytics Library (oneDAL).
What is daal4py?
daal4py, included in Intel® Distribution for Python* as part of the Intel® AI Analytics Toolkit, is an easy-to-use Python* API that provides superior performance for your machine learning algorithms and frameworks. Designed for data scientists, it provides a simple way to utilize powerful Intel® DAAL machine learning algorithms in a flexible and customizable manner. For scaling capabilities, daal4py also provides you the option to process and analyze data via batch, streaming, or distributed processing modes, allowing you to choose the option to best fit your system's needs.
daal4py's speedy frameworks are best known as a way to accelerate machine learning algorithms from scikit-learn*, however, this guide provides you with the information to use the daal4py algorithms directly.
Installing daal4py
There are several methods you can use to install the daal4py package.
You can install daal4py as part of Intel® Distribution for Python powered by Intel® oneAPI, which includes numerous accelerated Python* packages and applications. Follow the instruction specifications based on your computer build.
To install daal4py as part of Intel® Distribution for Python directly, use the following Anaconda* command:
conda install -c https://software.repos.intel.com/python/conda/ python3_full python=3.x
Please note that "x" in "python=3.x" should signify which version of Python* you would like to install.
For example, for Python* version 3.7:
conda create -n idp intelpython3_core python=3.7
You can also install daal4py as part of the Intel® AI Analytics Toolkit, which includes daal4py as part of Intel® Distribution for Python along with other high performance machine learning and deep learning packages.
To install the daal4py package directly, use the following Anaconda* command:
conda install -c https://software.repos.intel.com/python/conda/ daal4py
This will install daal4py along with any necessary dependency packages.
Working with daal4py
Batch Processing
For small quantities of data, your input data can be inputted all at once using batch processing mode. Batch processing is daal4py's default process mode, so no changes need to be made to your daal4py code in order to run it.
daal4py Batch Processing Code Example
In this example, you will use batch processing to create a linear regression model and use it for predicting prices of houses in Boston based on the features of each house.
Start by importing all necessary data and packages. You will also create any necessary directories to store our model and results:
##### daal4py linear regression example for shared memory systems #####
import daal4py as d4p
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import pickle
import os
# making all necessary directories
try:
os.mkdir("./models")
os.mkdir("./results")
except:
pass
Now load in the dataset and organize it as necessary to work with your model:
# loading in the data
data = load_boston()
# organizing variables used in the model for prediction
X = data.data # house characteristics
y = data.target[np.newaxis].T # house price
# splitting the data for training and testing, with a 25% test dataset size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state =1693)
Train the model and use it for prediction:
# training the model for prediction
train_result = d4p.linear_regression_training().compute(X_train, y_train)
# now predicting the target feature(s) using the trained model
y_pred = d4p.linear_regression_prediction().compute(X_test, train_result.model).prediction
Look at the results:
print(y_pred)
Stream Processing
For large quantities of data, it may be unfeasible to provide all input data at once unlike with batch processing. This might be because the data resides in multiple files and merging it is too costly to deem feasible. In other cases, the dataset may be too large to load completely into memory. The data being processed may also be an actual stream. daal4py’s streaming mode allows you to process these types of data in a quick and easy manner.
Besides supporting certain use cases, streaming also allows interleaving I/O operations with computation.
daal4py Stream Processing Code Example
In this example, you will use stream processing to create a linear regression model and use it for prediction. Download the following open-source data set to ensure the example runs properly.
# the files we will be reading in as chunks for training
infiles = ["./data/streaming_data/linear_regression_train_1.csv", "./data/streaming_data/linear_regression_train_2.csv","./data/streaming_data/linear_regression_train_3.csv","./data/streaming_data/linear_regression_train_4.csv","./data/streaming_data/linear_regression_train_5.csv"]
# configure the linear regression object for streaming
train_algo = d4p.linear_regression_training(interceptFlag=True, streaming=True)
# now feed chunks
for file in infiles:
indep_data = pd.read_csv(file).drop(["target"], axis=1) # house characteristics
dep_data =pd.read_csv(file)["target"] # house price
train_algo.compute(indep_data, dep_data)
# all chunks are done, now finalize the computation
train_result = train_algo.finalize()
Distributed Processing
daal4py operates in Single Program Multiple Data (SPMD) style, which means your program is executed on several processes (e.g. similar to MPI). The use of MPI is not required for daal4py’s SPMD-mode to work- all necessary communication and synchronization happens thourgh daal4py. However, it is possible to use daal4py and mpi4py in the same program.
Only very minimal changes are needed to your daal4py code to allow daal4py to run on a cluster of workstations.
daal4py Distributed Processing Code Example
In this example, you will use distributed processing to create a linear regression model and use it for prediction. Download the following open-source data set to ensure the example runs properly.
To properly running the code in distributed mode, please download the following code as a .py file and then run the following command (the number 4 means that it will run on 4 processes):
mpirun -n 4 python ./linear_regression_spmd.py
Running daal4py on distributed mode is similar to running it on batch mode, with a few modifications.
To load in your data and initialize the distribution engine:
# Now let's **load** in the dataset and **organize** it as necessary to work with our model. For distributed, every file has a unique ID.
#
# We will also **initialize the distribution engine**.
d4p.daalinit() # initializes the distribution engine
# organizing variables used in the model for prediction
# each process gets its own data
infile = "./data/distributed_data/linear_regression_train_" + str(d4p.my_procid()+1) + ".csv"
# read data
indep_data = pd.read_csv(infile).drop(["target"], axis=1) # house characteristics
dep_data = pd.read_csv(infile)["target"] # house price
To train you model on distributed mode:
# Time to **train our model** and look at the model's features!
# In[3]:
# training the model for prediction
train_result = d4p.linear_regression_training(distributed=True).compute(indep_data, dep_data)
At the end of your code, do not forget to turn off the distribution engine:
d4p.daalfini() # stops the distribution engine
To learn more about daal4py, please refer to the daal4py documentation.
For even faster application performance, get started with Intel® oneAPI.