3

I give a simple example: I have a set of houses with different features (# rooms, perimeter, # neighbours, etc...), almost 15, and a price value for each house. The features are also quite correlated (i.e. perimeter is often correlated with #rooms). I want to establish what are the main features (or non-linear combination of them) that determine the price.

In a linear case, for instance, I can compute a Lasso regression and see the importance of each feature through the coefficients. In my case, every feature (or maybe combination of them) has a non linear impact. For example, the # of neighbours can have a quadratic impact (increase the price if #neighbours < 10, and decrease the price if > 10).

I want to identify the main important relationship among the features and the prices. I don't need a predictor. For example, at the end I will discover that the price depends principally by #rooms/perimeter and #neighbours^2.

I was thinking to apply Kernel methods, in combination with regression or PCA. But I don't know a lot about kernel methods.

Thank you in advance.

A M
  • 61
  • 4

3 Answers3

1

As far as I know, kernel methods cannot deal with categorical variables (don't now whether it is the case). In addition, you will have to use indirect methods to evaluate the variable importance. This could work, although I havent tested it yet:

Giam, X., Olden, J.D., 2015. A new R2-based metric to shed greater insight on variable importance in artificial neural networks. Ecol. Modell. 313, 307–313. http://dx.doi.org/10.1016/j.ecolmodel.2015.06.034

I would definitively go for a tree-based approach. Since you already know that there are correlated variables, I would advocate for conditional Random Forests (which solve many drawbacks of the standard random forests implementation). Check:

Strobl, C., Hothorn, Zeileis, A., 2009. Party on! R J. 1 (2), 14–17.

And references therein. At least in R there are complementary packages (https://cran.r-project.org/web/packages/pdp/index.html) that allow plotting the impact of each predictor variable over the target variable (house prices). That complements the variable importance ranking pretty well.

Good luck.

1

I want to identify the main important relationship among the features and the prices. I don't need a predictor. For example, at the end I will discover that the price depends principally by #rooms/perimeter and #neighbours^2.

If it depends principally by #neighbours^2, it depends by #neighbours to the same extent. The same hold for the other combinations.

But if you wish to clearly identify the linear dependency on the #neighbours^2 rather than on #neighbours, or #rooms/perimeter (rather than simple #rooms) this is no different from a predictor.

Weka has a rich toolkit for ranking and selecting features by their importance, see this blogpost for a tutorial.

Dmytro Prylipko
  • 856
  • 5
  • 10
0

Im not familiar with many methods for feature importance but you could try random forest. Explained in:

Breiman L (2001). "Random Forests".Machine Learning. 45 (1): 5–32.doi:10.1023/A:1010933404324