In this video, Intel Cloud Software Engineer Ben Olson demonstrates how to accelerate the end-to-end machine learning workflow by plugging in open-source libraries available independently or in the Intel® AI Analytics Toolkit (AI Kit).
Data scientists add value to machine learning workflows by exploring large datasets, experimenting, and tuning their models. Speeding every aspect of their end-to-end workflow can help maximize productivity. This starts by optimizing the preparation of data through extraction, transformation, and loading (ETL) tasks where data scientists can waste precious time iterating. Next, accelerating the computationally intensive tasks of training the model before running prediction or classification can make experimentation more productive. Finally, eliminating hidden inefficiencies along the way can let data scientists focus on their highest-value tasks.
Some inefficiencies that impact data scientists arise when they must change tools as they scale from a subset of the data on their local machine to using the whole dataset in the cloud infrastructure. This often involves installing new software and changing code to use different APIs.
Pandas is a popular Python library used in the ETL stages, but it can only utilize single-core processing. Data scientists often start using a subset of the data with pandas. As they scale to the entire dataset, they must often re-write their code to use a distributed processing engine such as Dask* or Ray*,Modin* is an open-source drop-in replacement for pandas that utilizes all the processing cores on a system. You can install it in an Anaconda environment, then just change your pandas import statement:
# import pandas as pd
import modin.pandas as pd
By simply changing this import statement, you can accelerate your ETL tasks on your local workstation or move to a cloud platformwithout changing the rest of your code. Modin automates distributed processing using Dask or Ray, and the Intel® Distribution of Modin* adds additional support for HEAVY.AI* and for Intel® Optane™ persistent memory.
The Demo
In this demo, you’ll learn:
- How Intel-optimized engines can be plugged in to accelerate computationally intensive tasks. All of these can be installed using Python* pip or Anaconda*, and they all use the same APIs as the stock Python engines, minimizing code changes.
- How to utilize the Ray engine, plus the speedup you can achieve in typical ETL tasks such as filtering out edge cases and adding features using the New York City taxi dataset.
- Training and prediction performance for the XGBoost model using XGBoost Optimized for Intel® Architecture versus the standard Python pip-installed xgboost library.
- Similar comparison for training and prediction of Ridge regression using Intel® Extension for Scikit-learn* optimizations.
The overall end-to-end speedup was about 1.5x for the three steps shown. But your real time savings will vary with the iterations in specific tasks, as well as moving between a subset and a full dataset.
Because these libraries are available in Anaconda and require no code changes, it’s easy to set up an Anaconda environment and test them out on your own data and models. You can learn more about these and other engine optimizations available in Intel AI and machine learning development tools and resources.