High-Speed Feature Selection for Design of Big Data Models

High-Speed Feature Selection for Design of Big Data Models


Feature selection refers to the process of selecting a set of independent variables that contain the most useful information available in order to make a prediction about a dependent variable. Automated selection algorithms are typically used to identify and remove the irrelevant and correlated redundant variables from a dataset, leaving behind only the most important features that have maximum predictive power. This will yield a predictive model having higher accuracy, but needing fewer variables and less data.

However, achieving accurate feature selection with minimum processing time is always a challenge. In fact, due to lengthy wait times imposed by software algorithm complexity, it is often impossible to perform an exhaustive search of the entire variable space unless the original feature set is very small.


Using efficient matrix-analytic techniques, AlgoTactica has designed a feature selection software algorithm that is capable of searching the entire variable space, while yielding a turn-around time that is much shorter than normally observed with this type of processing. In comparison to traditional methods, the resulting final set of selected features is also much smaller, but still produces a model having equivalent accuracy.

Our software embodies an highly-optimized implementation of the Forward Regression Orthogonal Least Squares (FROLS) feature selection technique. The design process has focused on a unique sequencing of operations that eliminates all possible computational inefficiency, and ensures a lean software instruction set of very minimal size.


Performance of our software was tested by comparison with the well-known LASSO feature selection method. During the process discussed here, 100 training and validation data sets, each with 200 variables, were produced by sampling from a parent population. For each training set, a regression model was designed using each of the FROLS and LASSO feature selection methods, and the models were then tested using the accompanying validation set.

During the training process, it was observed that our FROLS implementation consistently converged to its final selected feature set after only 0.02 seconds on average and with very minimal variance, whereas the LASSO time was highly variable, and  would nominally be about 13 seconds, as shown in the Search Time boxplot. The Time Ratio Distribution plot also reveals that our version of FROLS was typically several hundred times faster than LASSO during each of the 100 processing tasks. Furthermore, FROLS consistently produced feature sets of size less than 17, whereas the LASSO sets were consistently of a size greater than 33, as shown in the Feature Selection Count boxplot.

When compared by data set, the Paired RMS Error plot shows that even though our implementation of FROLS uses fewer variables and can build a model much faster, there is no loss of accuracy relative to LASSO, and our RMS error on each data set is very similar to that of LASSO. In fact, the slope of the line is 1.07, indicating that the simpler FROLS regression produced slightly more accurate estimates of the dependent variable than was achieved by the more complex LASSO model. Our software algorithm offers a significant processing advantage for Big Data applications in which hundreds of thousands of variables need to be screened, or for real-time data streaming situations where the optimum feature set evolves dynamically.

Back to top