31

Note that I am doing everything in R.

The problem goes as follow:

Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify them into different job sectors. I am particular in those cases, in which the candidates do not have any experience / is a student, and I want to make a prediction to classify which job sectors this candidate will most likely belongs to after graduation .

Question 1: I know machine learning algorithms. However, I have never done NLP before. I came across Latent Dirichlet allocation on the internet. However, I am not sure if this is the best approach to tackle my problem.

My original idea: make this a supervised learning problem. Suppose we already have large amount of labelled data, meaning that we have correctly labelled the job sectors for a list of candidates. We train the model up using ML algorithms (i.e. nearest neighbor... )and feed in those unlabelled data, which are candidates that have no work experience / are students, and try to predict which job sector they will belong to.

Update Question 2: Would it be a good idea to create an text file by extracting everything in a resume and print these data out in the text file, so that each resume is associated with a text file,which contains unstructured strings, and then we applied text mining techniques to the text files and make the data become structured or even to create a frequency matrix of terms used out of the text files ? For example, the text file may look something like this:

I deployed ML algorithm in this project and... Skills: Java, Python, c++ ...

This is what I meant by 'unstructured', i.e. collapsing everything into a single line string.

Is this approach wrong ? Please correct me if you think my approach is wrong.

Question 3: The tricky part is: how to identify and extract the keywords ? Using the tm package in R ? what algorithm is the tm package based on ? Should I use NLP algorithms ? If yes, what algorithms should I look at ? Please point me to some good resources to look at as well.

Any ideas would be great.

user1769197
  • 431
  • 1
  • 5
  • 5

4 Answers4

14

Check out this link.

Here, they will take you through loading unstructured text to creating a wordcloud. You can adapt this strategy and instead of creating a wordcloud, you can create a frequency matrix of terms used. The idea is to take the unstructured text and structure it somehow. You change everything to lowercase (or uppercase), remove stop words, and find frequent terms for each job function, via Document Term Matrices. You also have the option of stemming the words. If you stem words you will be able to detect different forms of words as the same word. For example, 'programmed' and 'programming' could be stemmed to 'program'. You can possibly add the occurrence of these frequent terms as a weighted feature in your ML model training.

You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.

Example:

1) Load libraries and build the example data

library(tm)
library(SnowballC)

doc1 = "I am highly skilled in Java Programming.  I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
job1 = "Software Engineer"
doc2 = "Tested new software releases for major program enhancements.  Designed and executed test procedures and worked with relational databases.  I helped organize and lead meetings and work independently and in a group setting."
job2 = "Quality Assurance"
doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers.  Perform database design and debugging of current releases."
job3 = "Software Engineer"
jobInfo = data.frame("text" = c(doc1,doc2,doc3),
                     "job" = c(job1,job2,job3))

2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.

# Convert to lowercase
jobInfo$text = sapply(jobInfo$text,tolower)

# Remove Punctuation
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

# Remove extra white space
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

# Remove stop words
jobInfo$text = sapply(jobInfo$text, function(x){
  paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")
})

# Stem words (Also try without stemming?)
jobInfo$text = sapply(jobInfo$text, function(x)  {
  paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")
})

3) Make a corpus source and document term matrix.

# Create Corpus Source
jobCorpus = Corpus(VectorSource(jobInfo$text))

# Create Document Term Matrix
jobDTM = DocumentTermMatrix(jobCorpus)

# Create Term Frequency Matrix
jobFreq = as.matrix(jobDTM)

Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.

Where you go from here is up to you. You can keep only specific (more common) words and use them as features in your model. Another way is to keep it simple and have a percentage of words used in each job description, say "java" would have 80% occurrence in 'software engineer' and only 50% occurrence in 'quality assurance'.

Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
nfmcclure
  • 493
  • 3
  • 11
11

Just extract keywords and train a classifier on them. That's all, really.

Most of the text in CVs is not actually related to skills. E.g. consider sentence "I'm experienced and highly efficient in Java". Here only 1 out of 7 words is a skill name, the rest is just a noise that's going to put your classification accuracy down.

Most of CVs are not really structured. Or structured too freely. Or use unusual names for sections. Or file formats that don't preserve structure when translated to text. I have experience extracting dates, times, names, addresses and even people intents from unstructured text, but not a skill (or university or anything) list, not even closely.

So just tokenize (and possibly stem) your CVs, select only words from predefined list (you can use LinkedIn or something similar to grab this list), create a feature vector and try out a couple of classifiers (say, SVM and Naive Bayes).

(Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)

ffriend
  • 2,831
  • 19
  • 19
7

This is a tricky problem. There are many ways to handle it. I guess, resumes can be treated as semi-structured documents. Sometimes, it's beneficial to have some minimal structure in the documents. I believe, in resumes you would see some tabular data. You might want to treat these as attribute value pairs. For example, you would get a list of terms for the attribute "Skill set".

The key idea is to manually configure a list of key phrases such as "skill", "education", "publication" etc. The next step is to extract terms which pertain to these key phrases either by exploiting the structure in some way (such as tables) or by utilizing the proximity of terms around these key phrases, e.g. the fact that the word "Java" is in close proximity to the term "skill" might indicate that the person is skilled in Java.

After you extract these information, the next step could be to build up a feature vector for each of these key phrases. You can then represent a document as a vector with different fields (one each for a key phrase). For example, consider the following two resumes represented with two fields, namely project and education.

Doc1: {project: (java, 3) (c, 4)}, {education: (computer, 2), (physics, 1)}

Doc2: {project: (java, 3) (python, 2)}, {education: (maths, 3), (computer, 2)}

In the above example, I show a term with the frequency. Of course, while extracting the terms you need to stem and remove stop-words. It is clear from the examples that the person whose resume is Doc1 is more skilled in C than that of D2. Implementation wise, it's very easy to represent documents as field vectors in Lucene.

Now, the next step is to retrieve a ranked list of resumes given a job specification. In fact, that's fairly straight forward if you represent queries (job specs) as field vectors as well. You just need to retrieve a ranked list of candidates (resumes) using Lucene from a collection of indexed resumes.

Debasis
  • 1,556
  • 12
  • 10
7

I work for an online jobs site and we build solutions to recommend jobs based on resumes. Our approach take's a person's job title (or desired job title if a student and known), along with skills we extract from their resume, and their location (which is very important to most people) and find matches with jobs based on that.

in terms of document classification, I would take a similar approach. I would recommend computing a tf idf matrix for each resume as a standard bag of words model, extracting just the person's job title and skills (for which you will need to define a list of skills to look for), and feed that into a ML algorithm. I would recommend trying knn, and an SVM, the latter works very well with high dimensional text data. Linear SVM's tend to do better than non-linear (e.g. using RBf kernels). If you have that outputting reasonable results, I would then play with extracting features using a natural language parser \ chunker, and also some custom built phrases matched by regex's.

Simon
  • 1,036
  • 10
  • 10