When designing regression models, the data features chosen as inputs might sequester relevant information in a way that offers less predictive power to the model than could be achieved if these inputs were subjected to a preprocessing stage. In fact, it is often the case that the features will contain some information that is irrelevant to the response variable, and which is therefore an unwanted source of noise. When present, this noise will impede the training process by preventing the model from fully adapting to the relevant information components, and will therefore lead to larger prediction errors.
Feature Engineering refers to a sequence of preprocessing operations by which the original input variables are transformed, to produce an ancillary set of feature variables that will more accurately expose the underlying relevant information components to the predictive model during training. When used in combination with the original variables, these engineered features can lead to improved prediction accuracy when the trained model processes previously unseen input data. There exists a broad range of options for performing Feature Engineering, and AlgoTactica has developed an object-oriented library that implements many of those operations.
The importance of the approach is illustrated here by a time series forecasting case, in which non-negative matrix factorization and independent component analysis strategies were used to engineer new features. An original set of 5 input features were preprocessed, in order to derive an engineered set of 12 additional features. Two separate sets of 5 (original) and 17 (original + engineered) features were then created in order to compare the predictive accuracy of models that were trained from each of them. When each trained model was tested using previously unseen validation data, predicted values of the response variable were subtracted from the true known response, and the resulting errors were randomly bootstrapped to produce sampling distributions for the RMS error values.
Two feedforward neural networks were designed, one for the 5-feature original set, and one for the 17-feature set containing engineered components. After an optimization grid search, the best-performing network for the original set contained 25 nodes on the hidden layer, while the best-performing feature-engineered network had 55 nodes. When the validation RMS error was computed, the 5-feature network yielded an overall average of 411, while the network with engineered features had a value of 386. Therefore, the use of feature engineering reduced the RMS error by an additional 6.1% relative to the 5-feature model. The Feedforward Neural Network Regression graph compares the RMS error distribution for both models, and confirms that the neural network with feature engineering produced mainly smaller errors overall.
A similar test was conducted by building two random forest regression models, each using an ensemble of 100 decision trees. For the model using the original 5 features, the average RMS validation error was 470, whereas for the feature-engineered model, the value was 448; therefore, feature engineering reduced the error by 4.7%. The Random Forest Regression error distribution plots also confirm that the validation errors were typically smaller for the model that used feature engineering.
The Stepwise Linear Regression error distribution plots show that feature engineering had the biggest impact for this type of model. For the original 5-feature set, the average RMS error was 713, whereas for the engineered 17-feature model it was 526, indicating that use of feature engineering reduced the validation error by 26.2%.