Building a Job Recommendation System with TensorFlow*: A Complete Deep Learning Workflow

Stay in the Know on All Things CODE

Ramya Ravi, AI Software Marketing Engineer, Intel | LinkedIn
Urszula Zofia Gumińska, AI Software Development Engineer, Intel | LinkedIn
Chandan Damannagari, Director, AI Software, Intel | LinkedIn

Recommendation Systems are a crucial part of our daily digital experience, transforming how users interact with digital platforms by providing relevant and interesting information based on data. These systems help users to discover products, content, and services tailored to their preferences. Examples abound ranging from Netflix and Amazon to Zillow and Goodreads.

In this article, we will show you how to build an end-to-end job recommendation system using Intel® Extension for TensorFlow*, from initial data exploratory analysis to providing relevant job opportunities. Whether you are a data scientist, engineer or a developer, this guide will provide you with step-by-step explanations to create an efficient recommendation system using easy to use tools and optimizations.

TensorFlow* Optimizations from Intel

Intel collaborates with Google* to upstream most optimizations to open source TensorFlow with the newest optimizations and features being released earlier as Intel® Extension for TensorFlow*. These optimizations can be enabled with a few lines of code and will accelerate TensorFlow-based training and inference performance on Intel CPU and GPU hardware.

Code Sample Workflow

This code sample demonstrates the end-to-end workflow to build a job recommendation system. It consists of four parts:

  1. Data exploration and visualization – provides information on how the dataset looks like, main features of the data and data distribution.
  2. Data cleaning and pre-processing - removal of duplicates, steps required for text pre-processing.
  3. Fraud job postings removal - find which of the job postings are fake using Long Short-Term Memory Deep Neural Network (LSTM DNN) and filtering them. LSTM DNN is based on recurrent neural network architecture, and they are designed to recognize patterns in sequences of data, with the ability to remember long-term dependencies by incorporating memory cells that maintain information over extended time intervals.
  4. Job recommendation - calculates and provides top-n job descriptions like the chosen one.

Let’s now review each of the parts in detail.

I. Data exploration and visualization


1. Load the dataset - We will use Real or Fake: Fake Job Postings dataset available over Hugging Face API datasets library.

from datasets import load_dataset dataset = load_dataset("victor/real-or-fake-fake-jobposting-prediction") dataset = dataset['train']

2. Analyze and understand the data - we will transform it to pandas DataFrame.

import pandas as pd df = dataset.to_pandas() df.head() df.tail() df.info()

3. Remove duplicates in the dataset - use drop_duplicates method.

df = df.drop(columns=['job_id']) df = df.drop_duplicates() print(df.duplicated().sum())

4. Visualize the dataset - It can be challenging to visualize the text data. There is a wordcloud library that shows common words in the analyzed texts. The bigger the word is, the more often the word is in the text. In our example, we will create wordcloud for job titles, to have a high-level overview of job postings we are working with.

from wordcloud import WordCloud # module to print word cloud from matplotlib import pyplot as plt import seaborn as sns # On the basis of Job Titles form word cloud job_titles_text = ' '.join(df['title']) wordcloud = WordCloud(width=800, height=400, background_color='white').generate(job_titles_text) # Plotting Word Cloud plt.figure(figsize=(10, 6)) plt.imshow(wordcloud, interpolation='bilinear') plt.title('Job Titles') plt.axis('off') plt.tight_layout() plt.show()

5. Show the top-n most common values in the given column or distribution of the values in this column - For example top 10 most common job titles and same can be done for all the columns.

job_title_counts = df['title'].value_counts() # Plotting a bar chart for the top 10 most common job titles top_job_titles = job_title_counts.head(10) plt.figure(figsize=(10, 6)) top_job_titles.sort_values().plot(kind='barh') plt.title('Top 10 Most Common Job Titles') plt.xlabel('Frequency') plt.ylabel('Job Titles') plt.show()

II. Data cleaning and pre-processing


1. Data cleaning and pre-processing - For texts, data cleaning and pre-processing usually includes removal of stop words, special characters, numbers or any additional noise like hyperlinks. We will first combine all relevant columns into a single new record.

# List of columns to concatenate columns_to_concat = ['title','location','department','salary_range', 'company_profile','description','requirements','benefits','employment_type', 'required_experience','required_education','industry','function'] # Concatenate the values of specified columns into a new column 'job_posting' df['job_posting'] = df[columns_to_concat].apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1) # Create a new DataFrame with columns 'job_posting' and 'fraudulent' new_df = df[['job_posting', 'fraudulent']].copy()

2. Pre-process the data by removing white characters (newlines, carriage returns, and tabs), URLs, special characters, digits. At the end, we will transform all the text to lower case and remove stop words.

import re import nltk from nltk.corpus import stopwords nltk.download('stopwords') def preprocess_text(text): # Remove newlines, carriage returns, and tabs text = re.sub('\n','', text) text = re.sub('\r','', text) text = re.sub('\t','', text) # Remove URLs text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE) # Remove special characters text = re.sub(r"[^a-zA-Z0-9\s]", "", text) # Remove punctuation text = re.sub(r'[^\w\s]', '', text) # Remove digits text = re.sub(r'\d', '', text) # Convert to lowercase text = text.lower() # Remove stop words stop_words = set(stopwords.words('english')) words = [word for word in text.split() if word.lower() not in stop_words] text = ' '.join(words) return text new_df['job_posting'] = new_df['job_posting'].apply(preprocess_text)

3. Lemmatization - It is a process to reduce a word to its root form, called a lemma.

import en_core_web_sm nlp = en_core_web_sm.load() def lemmatize_text(text): doc = nlp(text) return " ".join([token.lemma_ for token in doc]) new_df['job_posting'] = new_df['job_posting'].apply(lemmatize_text)

III. Fraud job postings removal


Nowadays, not all the job offers posted on popular portals are genuine. Few of the job postings are created only to collect personal data. Therefore, just detecting fake job postings can be very essential. To detect fake job postings, we will create a bidirectional LSTM model with one hot encoding.

1. Import all the necessary libraries and make sure to use TensorFlow version 2.15.0.

from tensorflow.keras.layers import Embedding from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.models import Sequential from tensorflow.keras.preprocessing.text import one_hot from tensorflow.keras.layers import Dense from tensorflow.keras.layers import Bidirectional from tensorflow.keras.layers import Dropout import tensorflow as tf tf.__version__

2. Now, import Intel Extension for TensorFlow*. We are using Python API itex.experimental_ops_override(). It automatically replaces some TensorFlow operators with Custom Operators under itex.ops namespace, as well as to be compatible with existing trained parameters.

import intel_extension_for_tensorflow as itex itex.experimental_ops_override()

3. Prepare data for the model we create - assign job_postings to X and fraudulent values to y (expected value).

X = new_df['job_posting'] y = new_df['fraudulent']

4. One hot encoding - One hot encoding is a technique to represent categorical variables as numerical values.

voc_size = 5000 onehot_repr = [one_hot(words, voc_size) for words in X] sent_length = 40 embedded_docs = pad_sequences(onehot_repr, padding='pre', maxlen=sent_length) print(embedded_docs)

5. Create the model - we are creating a Deep Neural Network using Bidirectional LSTM. The architecture will have embedding layer, bidirectional LSTM layer, dropout layer and dense layer with sigmoid function. We are using Adam optimizer with binary cross entropy. If Intel® Extension for TensorFlow* backend is XPU, tf.keras.layers.LSTM will be replaced by itex.ops.ItexLSTM.

embedding_vector_features = 50 model_itex = Sequential() model_itex.add(Embedding(voc_size, embedding_vector_features, input_length=sent_length)) model_itex.add(Bidirectional(itex.ops.ItexLSTM(100))) model_itex.add(Dropout(0.3)) model_itex.add(Dense(1, activation='sigmoid')) model_itex.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) print(model_itex.summary())

6. Split our data into training and testing datasets.

X_final = np.array(embedded_docs) y_final = np.array(y) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.25, random_state=320)

7. Train the model - we will use the standard model.fit() method and provide training and testing dataset.

model_itex.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=1, batch_size=64)

8. The values returned by the model are in the range [0,1]. So, we need to map them to integer values of 0 or 1.

y_pred = (model_itex.predict(X_test) > 0.5).astype("int32")

9. To demonstrate the effectiveness of our models we presented the confusion matrix and classification report available within the scikit-learn library.

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, classification_report conf_matrix = confusion_matrix(y_test, y_pred) print("Confusion matrix:") print(conf_matrix) ConfusionMatrixDisplay.from_predictions(y_test, y_pred) class_report = classification_report(y_test, y_pred) print("Classification report:") print(class_report)

IV. Job recommendation


For job recommendations, we will use a much simpler solution. To show similar job postings, we will use the classical machine learning algorithms.

1. Filter fake job postings.

real = df[df['fraudulent'] == 0]

2. Create a common column containing the relevant text parameters. They will be used to compare job postings between each other and make recommendations.

real = real.fillna(value='') real['text'] = real['description'] + real['requirements'] + real['required_experience'] + real['required_education'] + real['industry']

3. Prepare recommendations - we will use sentence similarity based on prepared text column in our database.

from sentence_transformers import SentenceTransformer model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

4. Prepare functions to show similarities between given sentences using a heat map.

import numpy as np import seaborn as sns def plot_similarity(labels, features, rotation): corr = np.inner(features, features) sns.set(font_scale=1.2) g = sns.heatmap( corr, xticklabels=labels, yticklabels=labels, vmin=0, vmax=1, cmap="YlOrRd") g.set_xticklabels(labels, rotation=rotation) g.set_title("Semantic Textual Similarity") def run_and_plot(messages_): message_embeddings_ = model.encode(messages_) plot_similarity(messages_, message_embeddings_, 90)

5. Moving back to the dataset. First, we are using sentence encoding model to be able to calculate similarities.

encodings = [] for text in real['text']: encodings.append(model.encode(text)) real['encodings'] = encodings

6. Then, we can choose job posting we want to calculate similarities to. In our case it is the first job posting in the dataset, but you can easily change it to any other job posting by changing the value in the index variable.

index = 0 corr = np.inner(encodings[index], encodings) real['corr_to_first'] = corr

7. And based on the calculated similarities, we can show the top 5 most similar job postings, by sorting them according to calculated correlation value.

real.sort_values(by=['corr_to_first'], ascending=False).head()

In this code sample, we explored and analyzed the dataset, then we pre-processed the data and created fake job postings detection model. At the end we used sentence similarities to show the top 5 recommendations - the most similar job descriptions to the chosen one.

What’s Next?

Leverage the most up-to-date Intel software and hardware optimizations for TensorFlow to accelerate AI workload performance, across not just recommender systems but also computer vision, natural language processing, and generative applications, on Intel hardware.

We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.

Useful resources