

Textrecipes::step_tf(description, weight_scheme = "binary") %>% Textrecipes::step_tokenfilter(description, max_tokens = 2500) %>% Textrecipes::step_stopwords(description) %>%Ĭustom_stopword_source = remove_words) %>% Textrecipes::step_tokenize(description) %>% The steps that we described above are coded below: remove_words %

If I were to use another model, I would go with around 500 words. I am keeping a lot of predictors because I will be using the Lasso model which will take care of the variance and shrink a lot of the predictors to zero. We will be transforming the response variable, tokenize the description stemming the words to their affixes and suffixes, removing all the stop words, only keeping the top 2500 words, and then translating the output into ones and zeros. Data Preprocessing For NLP and Tidymodels textrecipesįor the preprocessing steps, we will be using the textrecipes package from the tidymodels. I transformed the response variable to make it look a bit more normally distributed. We also removed all numbers in the description as we only want to focus on the words. One column with the price, our response variable, and the other one is the housing descriptions. In the end we have 2,244 rows and 2 columns. # $ description "prestigious concrete air-conditioned boutique building i… library(tidyverse)ĭplyr::select(price, description, website) %>%ĭplyr::mutate(description = stringr::str_remove_all(description, We will be removing outliers and potentially false data points. Preparing the data for 10-fold cross-validationįirst, we will be reading in the data and then cleaning it up a bit.

Therefore I decided to analyze the job descriptions with natural language processing, bag of words, and a lasso regression model. It was fun and I wanted to try something different. I used a random forest regression model with the typical predictors, the number of baths, beds, square feet, etc. Today, I wanted to analyze some real estate descriptions that I previously scraped from the web. If you want to reproduce the analysis or check out the code, then you can find it on my GitHub.Ī lot of data is generated each day and a lot of it is text data. We will be using bag of words with column vectors of ones and zeros. We will be using natural language processing, NLP, to build a machine learning model. In this tutorial, we will be predicting housing prices based on their descriptions.
