Introduction
Recently in my project, I faced an interesting problem. My team was asked to recommend courses to our users. In most cases, we would build a model based off historical data and attributes of the users and courses. The interesting part is that these courses were brand new, so we had no historical data for courses taken by our users, so collaborative filtering and content based models would not be useful. We decided to come up with a method to match the courses offered to information that the users and organizations submit as their goals and interests. A fuzzy or approximate search would be needed as some of the data is manually typed in by the users and may have spelling errors and typos. We also needed to determine what attribute of the courses we could use for the matching process. The owners of the courses suggested that we use tags they would supply. Unfortunately, these tags were not very discerning between the courses and were heavily applied. The next best course of action was to develop our own tags for the courses utilizing machine learning. This post will explain the methods used and walk you through an example of the entire process.
Example
Let’s step through an example to see and understand how this process works. First we will be performing some text preprocessing on the course titles, user interests, and goals. We will perform stemming which reduces words down to their word stem. This will take words like “businesses” and “business” and reduce them down to “busi”. Another example is “mathematics” will be reduced to “mathemat”. Another preprocessing applied is removing stop words such as “a”, “about”, “the”, “along”, and “of”. A list of common stop words can be found here. We will also remove any non-alphabetic characters, and make all characters lowercase. This would make the course title of “Introduction to Statistics: How Mathematics Can Make Decisions” reduce to “introduct statist mathemat can make decis ".
Suppose we have three courses and two users. Through the process described we will end up with:
Course Titles | Preprocessed Titles |
Introduction to Statistics: How Mathematics Can Make Decisions | introduct statist mathemat can make decis |
Business Calculus: The Mathematics Behind Business | busi calculus mathemat behind busi |
Introduction to Algebra: The Study of Functions and Equations | introduct algebra studi function equat |
User | Interests | Goals |
User1 | New Businesses, Data, Analysis | I want to make better decisions based on statistical data. |
User2 | Functions, Equations, Mathematics | I want a refresher course in arithmetic and algebra. |
Now we will calculate the tf-idf for each word in the course titles. For this process we decided to use the mean number of words in the titles to limit the maximum number of tags generated. With the preprocessed titles above, we have a mean of five words. This results in the following tags.
Course Titles | Tags |
Introduction to Statistics: How Mathematics Can Make Decisions | statist, can, make, decis, introduct |
Business Calculus: The Mathematics Behind Business | busi, calculus, mathemat, behind |
Introduction to Algebra: The Study of Functions and Equations | algebra, studi, function, equat, introduct |
You can see how only using a small amount of text to form tags leads us to using all words in the titles or nearly all words in the titles as tags. This is one reason why using course descriptions could lead to better predictions, but there were no course descriptions readily available.
After generating the tags, we can start matching user data to the course tags. We will preprocess the user interests and goals the same way we did for the course titles. Next, we will perform a fuzzy search of the user interests and goals for each course tag and compute how many matches we find. Below is a matrix with our findings.
Introduction to Statistics | Business Calculus | Introduction to Algebra | |
User1 | 3 | 1 | 0 |
User2 | 0 | 1 | 3 |
Normalizing the scores (all scores are mapped to a range between 0 and 1) gives us:
Introduction to Statistics | Business Calculus | Introduction to Algebra | |
User1 | 1 | .33 | 0 |
User2 | 0 | .33 | 1 |
With this model, we would recommend Introduction to Statistics and Business Calculus to User1, and we would recommend Business Calculus and Introduction to Algebra to User2. This is a great starting point for a cold start recommendation model, and by adding more attributes of the users and attributes of the courses to generate better tags, the results can be improved. Finally, when enough data is gathered you can then switch the model over to a collaborative filtering or content based model. Now let’s look at the details of the fuzzy search and tf-idf process.
Fuzzy Search
A fuzzy search is an algorithm that performs approximate string matching. The algorithm matches strings that are approximately, rather than exactly equivalent. For example, the strings “coat” and “cost” are different, but a single substitution of a letter would make the strings exactly equivalent. Therefore “coat” and “cost” are approximately the same. One method of approximate string matching looks at different operations to transform one string to another: insertions, deletions, and substitutions. A threshold for the number of operations is set, and any words that require less than that number of transformations to become an exact match are identified as approximate matches.
We decided to code our model in R; therefore, we used the agrep function in the base package to perform our fuzzy search. The agrep function performs an approximate pattern matching using the generalized Levenshtein edit distance. The generalized Levenshtein edit distance is defined as “the minimal possibly weighted number of insertions, deletions, and substitutions needed to transform one string to another.” (https://stat.ethz.ch/R-manual/R-devel/library/base/html/agrep.html)
A mathematical interpretation of the generalized Levenshtein edit distance between two strings x,y with lengths |x| an |y| is defined as where
where is the indicator function equal to 0 when x_i = y_j and equal to 1 otherwise, and 〖lev〗_(x,y) (i,j) is the distance between the first i characters of x and the first j characters of y.
We did not weight the insertions, deletions, and substitutions in our algorithm. The maximum distance that we search for is .1 of the pattern size. This means that a ten-letter phrase could have 1 insertion, deletion, or substitution and still be considered a match.
The agrep function returns which strings were identified to be matches. We then aggregate how many tags were identified as matches on all the fields of the organizational and user data. This sum is then normalized per user, and the courses with the top scores per user are recommended to that user.
Developing Tags
The tags provided for the courses that we wished to recommend were not discerning enough to fit our needs. To remedy this situation, we decided to create our own tags based off the course information. The only course information that we could use for this was the course titles. We wanted tags that were important to that course and are unique to that course; therefore, we are looking for tags that are frequently in the title of that course but are not frequently in the titles of all the courses. This describes a metric known as tf-idf.
Tf-idf stands for “term frequency-inverse document frequency.” Tf-idf provides a numerical value to determine how important a word is to a document by how often it happens in that document versus how often it happens in the set of all documents or corpus. In our case, the documents are the course titles, and all the course titles combined compose the corpus. To determine how many tags each course should have, we use the median length of the course titles. Using this as our maximum number of tags, we select the tags with the highest, nonzero scores. There are numerous different methods for calculating tf-idf. For our purposes, we used a weighted, nonnormalized tf-idf.
Mathematically weighted, nonnormalized tf-idf is defined as let freq(t,d) be the number of occurrences of keyword t in document d.
Where |D| denotes the total number of documents and |{d|t ∈ d}| is the number of documents where the term t appears.
This provides us with tags that are important to specific courses that we can use in our mapping process.
Thank you for taking the time to read this, and I hope it helps you in your endeavors.