Questions tagged [feature-selection]

Methods and principles of selecting a subset of attributes for use in further modelling

Feature selection, also called attribute selection or feature reduction, refers to techniques for identifying a subset of features of a data set that are relevant to a given problem. By removing irrelevant and redundant features, successful feature selection can avoid the curse of dimensionality and improve the performance, speed, and interpretability of subsequent models.

Feature selection includes manual methods (such those those based on domain knowledge) and automatic methods. Automatic methods are often categorized into filter, wrapper, and embedded approaches.

Filter approaches perform feature selection as a separate preprocessing step before the learning algorithm. Filter approaches thus look only at the intrinsic properties of the data. Filter methods include Wilcoxon rank sum tests and Correlation based tests.

Wrapper approaches uses performance of a learning algorithm to select features. A search algorithm is “wrapped” around the learning algorithm to ensure the space of feature subsets is adequately searched. As such, wrapper methods can be seen as conducting the model hypothesis search within the feature subset search. Examples of wrapper approaches are simulated annealing and beam search.

Embedded approaches incorporate variable selection as a part of the training process, with feature relevance obtained analytically from the objective of the learning model. Embedded methods can be seen as a search in the combined space of feature subsets and hypotheses. Examples of embedded approaches are boosting and recursive ridge regression.

959 questions
72
votes
11 answers

What is dimensionality reduction? What is the difference between feature selection and extraction?

From wikipedia: dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction. What is the difference between feature…
alvas
  • 2,510
  • 7
  • 28
  • 40
63
votes
10 answers

Machine learning - features engineering from date/time data

What are the common/best practices to handle time data for machine learning application? For example, if in data set there is a column with timestamp of event, such as "2014-05-05", how you can extract useful features from this column if any? Thanks…
Igor Bobriakov
  • 1,071
  • 2
  • 9
  • 11
61
votes
6 answers

Does XGBoost handle multicollinearity by itself?

I'm currently using XGBoost on a data-set with 21 features (selected from list of some 150 features), then one-hot coded them to obtain ~98 features. A few of these 98 features are somewhat redundant, for example: a variable (feature) $A$ also…
neural-nut
  • 1,803
  • 3
  • 18
  • 28
60
votes
8 answers

Does scikit-learn have a forward selection/stepwise regression algorithm?

I am working on a problem with too many features and training my models takes way too long. I implemented a forward selection algorithm to choose features. However, I was wondering does scikit-learn have a forward selection/stepwise regression…
Maksud
  • 725
  • 1
  • 7
  • 6
34
votes
6 answers

Are there any tools for feature engineering?

Specifically what I am looking for are tools with some functionality, which is specific to feature engineering. I would like to be able to easily smooth, visualize, fill gaps, etc. Something similar to MS Excel, but that has R as the underlying…
John
  • 441
  • 1
  • 5
  • 4
29
votes
3 answers

How to combine categorical and continuous input features for neural network training

Suppose we have two kinds of input features, categorical and continuous. The categorical data may be represented as one-hot code A, while the continuous data is just a vector B in N-dimension space. It seems that simply using concat(A, B) is not a…
29
votes
4 answers

Any "rules of thumb" on number of features versus number of instances? (small data sets)

I am wondering, if there are any heuristics on number of features versus number of observations. Obviously, if a number of features is equal to the number of observations, the model will overfit. By using sparse methods (LASSO, elastic net) we can…
Arnold Klein
  • 513
  • 2
  • 5
  • 13
26
votes
2 answers

Text categorization: combining different kind of features

The problem I am tackling is categorizing short texts into multiple classes. My current approach is to use tf-idf weighted term frequencies and learn a simple linear classifier (logistic regression). This works reasonably well (around 90% macro F-1…
22
votes
2 answers

How to choose the features for a neural network?

I know that there is no a clear answer for this question, but let's suppose that I have a huge neural network, with a lot of data and I want to add a new feature in input. The "best" way would be to test the network with the new feature and see the…
22
votes
3 answers

How to perform feature engineering on unknown features?

I am participating on a kaggle competition. The dataset has around 100 features and all are unknown (in terms of what actually they represent). Basically they are just numbers. People are performing a lot of feature engineering on these features. I…
21
votes
6 answers

What does embedding mean in machine learning?

I just met a terminology called "embedding" in a paper regarding deep learning. The context is "multi-modal embedding" My guess: embedding of something is extract some feature of sth,to form a vector. I couldn't get the explicit meaning for this…
21
votes
5 answers

Feature selection vs Feature extraction. Which to use when?

Feature extraction and feature selection essentially reduce the dimensionality of the data, but feature extraction also makes the data more separable, if I am right. Which technique would be preferred over the other and when? I was thinking,…
20
votes
1 answer

What is difference between one hot encoding and leave one out encoding?

I am reading a presentation and it recommends not using leave one out encoding, but it is okay with one hot encoding. I thought they both were the same. Can anyone describe what the differences between them are?
18
votes
3 answers

How to determine feature importance in a neural network?

I have a neural network to solve a time series forecasting problem. It is a sequence-to-sequence neural network and currently it is trained on samples each with ten features. The performance of the model is average and I would like to investigate…
18
votes
3 answers

When should I use StandardScaler and when MinMaxScaler?

I have a feature vector with One-Hot-Encoded features and with continous features. How can I decide now, which data I shall scale with StandardScaler and which data scale with MinMaxScaler? I think I do not have to scale the one-hot-encoded anyway…
1
2 3
63 64