Recommendation Systems are a crucial part of our daily digital experience, transforming how users interact with digital platforms by providing relevant and interesting information based on data. These systems help users to discover products, content, and services tailored to their preferences. Examples abound ranging from Netflix and Amazon to Zillow and Goodreads.
In this article, we will show you how to build an end-to-end job recommendation system using Intel® Extension for TensorFlow*, from initial data exploratory analysis to providing relevant job opportunities. Whether you are a data scientist, engineer or a developer, this guide will provide you with step-by-step explanations to create an efficient recommendation system using easy to use tools and optimizations.
TensorFlow* Optimizations from Intel
Intel collaborates with Google* to upstream most optimizations to open source TensorFlow with the newest optimizations and features being released earlier as Intel® Extension for TensorFlow*. These optimizations can be enabled with a few lines of code and will accelerate TensorFlow-based training and inference performance on Intel CPU and GPU hardware.
Code Sample Workflow
This code sample demonstrates the end-to-end workflow to build a job recommendation system. It consists of four parts:
- Data exploration and visualization – provides information on how the dataset looks like, main features of the data and data distribution.
- Data cleaning and pre-processing - removal of duplicates, steps required for text pre-processing.
- Fraud job postings removal - find which of the job postings are fake using Long Short-Term Memory Deep Neural Network (LSTM DNN) and filtering them. LSTM DNN is based on recurrent neural network architecture, and they are designed to recognize patterns in sequences of data, with the ability to remember long-term dependencies by incorporating memory cells that maintain information over extended time intervals.
- Job recommendation - calculates and provides top-n job descriptions like the chosen one.
Let’s now review each of the parts in detail.
I. Data exploration and visualization
1. Load the dataset - We will use Real or Fake: Fake Job Postings dataset available over Hugging Face API datasets library.
2. Analyze and understand the data - we will transform it to pandas DataFrame.
3. Remove duplicates in the dataset - use drop_duplicates method.
4. Visualize the dataset - It can be challenging to visualize the text data. There is a wordcloud library that shows common words in the analyzed texts. The bigger the word is, the more often the word is in the text. In our example, we will create wordcloud for job titles, to have a high-level overview of job postings we are working with.
5. Show the top-n most common values in the given column or distribution of the values in this column - For example top 10 most common job titles and same can be done for all the columns.
II. Data cleaning and pre-processing
1. Data cleaning and pre-processing - For texts, data cleaning and pre-processing usually includes removal of stop words, special characters, numbers or any additional noise like hyperlinks. We will first combine all relevant columns into a single new record.
2. Pre-process the data by removing white characters (newlines, carriage returns, and tabs), URLs, special characters, digits. At the end, we will transform all the text to lower case and remove stop words.
3. Lemmatization - It is a process to reduce a word to its root form, called a lemma.
III. Fraud job postings removal
Nowadays, not all the job offers posted on popular portals are genuine. Few of the job postings are created only to collect personal data. Therefore, just detecting fake job postings can be very essential. To detect fake job postings, we will create a bidirectional LSTM model with one hot encoding.
1. Import all the necessary libraries and make sure to use TensorFlow version 2.15.0.
2. Now, import Intel Extension for TensorFlow*. We are using Python API itex.experimental_ops_override(). It automatically replaces some TensorFlow operators with Custom Operators under itex.ops namespace, as well as to be compatible with existing trained parameters.
3. Prepare data for the model we create - assign job_postings to X and fraudulent values to y (expected value).
4. One hot encoding - One hot encoding is a technique to represent categorical variables as numerical values.
5. Create the model - we are creating a Deep Neural Network using Bidirectional LSTM. The architecture will have embedding layer, bidirectional LSTM layer, dropout layer and dense layer with sigmoid function. We are using Adam optimizer with binary cross entropy. If Intel® Extension for TensorFlow* backend is XPU, tf.keras.layers.LSTM will be replaced by itex.ops.ItexLSTM.
6. Split our data into training and testing datasets.
7. Train the model - we will use the standard model.fit() method and provide training and testing dataset.
8. The values returned by the model are in the range [0,1]. So, we need to map them to integer values of 0 or 1.
9. To demonstrate the effectiveness of our models we presented the confusion matrix and classification report available within the scikit-learn library.
IV. Job recommendation
For job recommendations, we will use a much simpler solution. To show similar job postings, we will use the classical machine learning algorithms.
1. Filter fake job postings.
2. Create a common column containing the relevant text parameters. They will be used to compare job postings between each other and make recommendations.
3. Prepare recommendations - we will use sentence similarity based on prepared text column in our database.
4. Prepare functions to show similarities between given sentences using a heat map.
5. Moving back to the dataset. First, we are using sentence encoding model to be able to calculate similarities.
6. Then, we can choose job posting we want to calculate similarities to. In our case it is the first job posting in the dataset, but you can easily change it to any other job posting by changing the value in the index variable.
7. And based on the calculated similarities, we can show the top 5 most similar job postings, by sorting them according to calculated correlation value.
In this code sample, we explored and analyzed the dataset, then we pre-processed the data and created fake job postings detection model. At the end we used sentence similarities to show the top 5 recommendations - the most similar job descriptions to the chosen one.
What’s Next?
Leverage the most up-to-date Intel software and hardware optimizations for TensorFlow to accelerate AI workload performance, across not just recommender systems but also computer vision, natural language processing, and generative applications, on Intel hardware.
We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.