Build an End-to-End Machine Learning Workflow on Census Data Using Modin* and scikit-learn*

author-image

By

Classical machine learning means learning from data using classical mathematical algorithms such as linear regression, logistic regression, and decision tree to complete a particular task without being explicitly programmed. These algorithms are used for various real-world applications across financial services, retail, manufacturing, healthcare, and several other industries. They are more suitable for use cases that do not require large computational power when compared to deep learning applications, and where data is limited and quality-dependent.

The major steps involved in building an end-to-end machine learning model are:

  1. Data preparation and preprocessing
  2. Model training
  3. Model evaluation

For every stage of a machine learning workflow, Intel offers optimized frameworks and tools such as Intel® Distribution of Modin*, Intel® Extension for Scikit-learn*, gradient-boosting optimizations from Intel, and Intel® Distribution for Python*.

In this article, we present a code sample on how to build and run an end-to-end machine learning workload using Intel Distribution of Modin and Intel Extension for Scikit-learn with US census data from 1970 to 2010.

How Can Intel® Tools Help?

The most crucial challenges while developing machine learning applications are data collection, cleaning, and preprocessing. These steps can be time-consuming and resource intensive. Intel Distribution of Modin helps in addressing these issues. Modin enables speeding up of data preparation and manipulation process: It is a drop-in replacement for pandas. The Intel distribution adds optimizations to further accelerate processing on Intel hardware and can process terabytes of data on a single workstation.

Another significant challenge while building machine learning projects is to improve the training and inference performance. The scikit-learn module is one of the most common classical machine learning frameworks due to the availability of many algorithms and its user-friendly interface. Intel provides Intel Extension for Scikit-learn to speed up the scikit-learn workflows or applications for Intel® architectures across single-node and multi-node configurations.

Code Sample

A code sample shows how to perform extract, transform, and load (ETL) operations using Intel Distribution of Modin, and ridge regression algorithm using Intel Extension for Scikit-learn library. In this sample, Intel Distribution of Modin is used to ingest and process US census data from 1970 to 2010 to build a ridge regression-based model to find the relation between education and total income earned in the US.
 

  1. Download the dataset to a local disc using the following command. The dataset contains information of the nation's people and covers various statistics like year, sex, education, and income.
    !wget https://storage.googleapis.com/intel-optimized-tensorflow/datasets/ipums_education2income_1970-2010.csv.gz
  2. Import NumPy, Modin, and set heterogeneous data kernels (HDK) as the compute engine. The HDK engine provides a set of components for federation analytic queries to an execution backend based on OmniSciDB to obtain high single-node scalability for a specific set of data frame operations.
    import numpy as np
    import modin.pandas as pd
    import modin.config as cfg
    cfg.StorageFormat.put('hdk')
    
  3. Import Intel Extension for Scikit-learn. It dynamically patches scikit-learn estimators to use Intel® oneAPI Data Analytics Library as the underlying solver, which allows us to get the same solution faster.
    from sklearnex import patch_sklearn
    patch_sklearn()
    from sklearn import config_context
    from sklearn.metrics import mean_squared_error, r2_score
    from sklearn.model_selection import train_test_split
    import sklearn.linear_model as lm
  4. Load the downloaded dataset as a DataFrame.
    df = pd.read_csv('ipums_education2income_1970-2010.csv.gz')
    
  5. Prepare the dataset. Run ETL operations to make sure the dataset can be easily consumed by the regression algorithm. We only keep the columns that are relevant to our analysis.
    keep_cols = ["YEAR", "DATANUM", "SERIAL", "CBSERIAL", "HHWT", "CPI99", "GQ", "PERNUM", "SEX", "AGE", "INCTOT", "EDUC", "EDUCD", "EDUC_HEAD", "EDUC_POP", "EDUC_MOM", "EDUCD_MOM2", "EDUCD_POP2", "INCTOT_MOM", "INCTOT_POP", "INCTOT_MOM2", "INCTOT_POP2", "INCTOT_HEAD", "SEX_HEAD",]
    df = df[keep_cols]
  6. Clean up the samples with invalid values for income and education.
    df = df[df["INCTOT"] != 9999999]
    df = df[df["EDUC"] != -1]
    df = df[df["EDUCD"] != -1]
  7. Normalize the income to account for yearly inflation.
    df["INCTOT"] = df["INCTOT"] * df["CPI99"]
    
    for column in keep_cols:
    df[column] = df[column].fillna(-1)
    df[column] = df[column].astype("float64")
    
    y = df["EDUC"]
    X = df.drop(columns=["EDUC", "CPI99"])
  8. Train the model and run the prediction. Start with regression model declaration. In this code sample, we are using the Ridge model. It solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm.
    clf = lm.Ridge()
  9. Prepare all parameters needed for the training and inference.
    mse_values, cod_values = [], []
    N_RUNS = 50
    TRAIN_SIZE = 0.9
    random_state = 777
    
    X = np.ascontiguousarray(X, dtype=np.float64)
    y = np.ascontiguousarray(y, dtype=np.float64)
  10. Run training and inference. Use cross validation (loop the process 50 times) to remove any bias in splitting the dataset into the train and test set. We are doing it to reduce the chance of overfitting from selecting a train set that fits the model too well to the test set.
    for i in range(N_RUNS):
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=TRAIN_SIZE, random_state=random_state)
    random_state += 777
    
    # training
    with config_context(assume_finite=True):
    model = clf.fit(X_train, y_train)
    
    # inference
    y_pred = model.predict(X_test)
    
    mse_values.append(mean_squared_error(y_test, y_pred))
    cod_values.append(r2_score(y_test, y_pred))
  11. Check the regression results by calculating the accuracy of the prediction and print them.
    mean_mse = sum(mse_values)/len(mse_values)
    mean_cod = sum(cod_values)/len(cod_values)
    mse_dev = pow(sum([(mse_value - mean_mse)**2 for mse_value in mse_values])/(len(mse_values) - 1), 0.5)
    cod_dev = pow(sum([(cod_value - mean_cod)**2 for cod_value in cod_values])/(len(cod_values) - 1), 0.5)
    
    print("mean MSE ± deviation: {:.9f} ± {:.9f}".format(mean_mse, mse_dev))
    print("mean COD ± deviation: {:.9f} ± {:.9f}".format(mean_cod, cod_dev))

This code sample showcases how to implement an end-to-end workflow for a classical machine learning use case and illustrates how Intel-optimized frameworks will help the users achieve the best performance results on Intel hardware.

What’s Next?

You have worked your way through a code sample that helped you build an end-to-end census workload using AI Tools without any external dependencies. Additionally, you can also watch the webinar to learn more about accelerating tasks such as data preprocessing, training, and inference while gaining performance using Intel Distribution of Modin and Intel Extension for Scikit-learn.

Access and try the AI Tools for yourself to build additional end-to-end AI applications. We encourage you to also check out and incorporate Intel’s other AI and machine learning framework optimizations and end-to-end portfolio of tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of the Intel AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.