Addressing the Feature Discovery Problem for Data Scientists and Predictive Modelers

Artificial intelligence (AI) and machine learning (ML) are areas full of open debates among data scientists and practitioners. One such question where we seem to have reached consensus is the data vs. algorithm question in predictive modeling: What is more contributive towards the prediction outcome – the technique/algorithm used to build the model, or the data used to train the model? There seems to be little doubt that in a vast majority of business use cases for AI, data trumps algorithm — a simple algorithm with good data will almost always beat a sophisticated algorithm trained on insufficient data.

While this simple fact has been taught in undergraduate and graduate AI/ML classes in computer science departments forever (full disclosure: the author is ex-faculty at Georgia Tech), industry has finally caught up. This reality is widely acknowledged in practitioner forums and we have nuanced discussion available in our Resource Library:

Blog: How to Make Your AI Smarter with Enriched Data
Whitepaper: How to Increase the Value of AI with Data

Not only is data more contributive towards model accuracy, it also turns out that preparing data for consumption by the algorithm is the most time-consuming and least-liked part of the model building process. 76% of data scientists say data preparation is the least enjoyable part of their work, and spend 80% of their time engaged in it.¹ Data preparation involves three broad steps:

Ingesting data from a variety of both internal and external sources
Cleaning and de-duplicating ingested data sets and creating a coherent “base data” set
Creating features out of the base data for consumption by the algorithm and constructing the training set

Steps 1 and 2, while tedious and time-consuming, can leverage a host of ETL software tools to enable substantial automation. Step 3, on the other hand, is completely human-driven and the least amenable to automation. Often referred to as feature engineering, this is generally regarded as the most ad-hoc step in the entire model building process, almost entirely driven by the “feel”, experience and expertise of the modeler.

At the same time, feature engineering is the basis of creating the training data set that the ML algorithm will use to create the model. Reiterating that training data is undisputedly the most impactful factor of model quality, feature engineering represents one of the most critical components of the model creation workflow. It is our experience that human misjudgment during feature engineering, a relatively common occurrence, is the single biggest cause of under-performing models.

Related Reading: How Data Enrichment Improves Predictive Modeling

Motivated by the above, this post will try to de-mystify feature engineering. Specifically, we discuss why feature discovery is “more art than science”, identify the root causes that make it ad-hoc, and postulate a framework to render it less so. We believe all data scientists can benefit from this framework, regardless of experience.

For the less experienced, our framework should significantly reduce the risk of misjudgments and both shorten the feature-discovery timeframe and increase the quality of the training set. For the veteran modelers, who often arrive at chosen features through an extensive trial and error process, our framework should add to their “tips and tricks” arsenal, and reduce the number of trials. So, let’s jump in.

What Are Features, and Why is Feature Engineering So Hard?

Data, in the form captured by enterprise systems, is not directly ingested into AI algorithms. Rather, it is massaged into artifacts called features, that are then plugged into the predictive model. This transformation process is known as feature engineering (or feature discovery).

The relationship between base data and features is best captured by the following analogy: crude petroleum extracted from oil fields is not usable by automobiles as fuel, it has to go through a set of transformations to yield gasoline, which cars can consume. Base enterprise data is analogous to crude oil, while refined gasoline represents features.

However, this comparison only goes so far. The crude oil-to-gasoline refining process is scientific, systematic, and completely “algorithmized”. “Uniform” gasoline is produced efficiently at thousands of refineries around the globe, allowing any gasoline-powered vehicle to pull up to any pump and fuel up. In contrast, the date-to-feature transformation is unique to every modeling exercise, which prevents automation of this transformation. Imagine if different cars needed different types of fuel, which had to be produced on demand — that is effectively the situation with feature engineering. Let’s make this clear with an example.

Take a telecommunication company seeking to predict customer churn, using a model which assigns a churn likelihood score (CLS) to each subscriber. At a high level, the model will look like this: PV1, PV2, PV3, … => CL, read as Predictor Variables PV1, PV2, PV3 etc., predicts CL (churn likelihood). These predictor variables are known as features, and the generic structure of a predictive model may be recast as: F1, F2, F3, … => Outcome, read as features F1, F2, F3, etc., predict the Outcome being modeled.

For our churn modeling exercise, let’s assume that the average price of all mobile devices in a subscriber’s household and the subscriber’s age turn out to be the features that are best predictive of churn. Feature engineering refers to the process that the modeler went through to identify those features (and very likely others that were eliminated in the model building process).

One can see why this would be complex: imagine the many data items the company has about its subscribers, such as the personal data included in account information, to service consumption data available from usage logs. Each such base element may be considered as a feature for modeling. Further imagine the wide variety of “derived” information that may be created from the above-mentioned base data. Putting all of this together, the reader can see how large the potential feature space can be.

While proven statistical techniques can prune this space down, any modeler will recognize that they will eventually be left with a large, unwieldy, and often confounding array of possible predictor variables. Given this massive pool of features from which to draw, and the unique circumstances of each predictive modeling endeavor, it is challenging for a data scientist to arrive at a manageable, relatively small, predictive feature set, without embarking upon a significant, time-consuming ad-hoc trial and error exercise, often resulting in sub-optimal feature selection. This is why feature engineering is one of the most dreaded, and undoubtedly the least explainable component of the machine learning workflow.

Making Feature Discovery Systematic

Motivated by the inefficiencies discussed above, we have been exploring ways to make feature discovery less painful and more systematic. Having experienced hundreds of modeling exercises across a spectrum of industry segments, seeking to predict a multitude of outcomes, we have traced a majority of the issues of feature engineering to dual root causes:

The Cold Start Problem: The initial choice of features in model building is hugely important. We always recommend a pruning approach to feature selection: start with a wide set of features and progressively arrive at a smaller, effective set. Generally, if the right feature characteristics are not included in initial selection, it is unlikely that they will find their way in.
Choosing Specific Features: Modelers typically don’t start by identifying specific features. Rather they identify “feature classes” that they believe would be predictive and then test individual features from these classes. For instance, a modeler looking to predict buyers of a luxury automobile first identify broad potentially predictive feature classes like “affluence”, and “international leisure traveler”, rather than individual quantifiable features (e.g., “annual income > $1M”). However, to test the model, individual features need to be chosen to represent the classes selected. It turns out that the specific feature chosen will have a significant impact on the model outcome. For instance, consider two alternate features that indicate “affluence”: (a) “annual income > $1M”, and (b) “persistently observed at high-end malls”. Even though affluence (as a feature class) is strongly predictive of the propensity to buy a luxury automobile, we’ve seen that (b) works much better in a model than (a), as frequent visits to high end shopping indicates a propensity to spend money, which simply having a high income does not.

To summarize, the two most important elements in good feature discovery are the initial selection of the right feature classes (cold-start), and then selecting the appropriate feature instances from these classes. For most modelers, these are hard problems.

The Mobilewalla Feature Engineering Solution

To address this for our business, we have devised a taxonomy of feature classes and populated each class with a number of generally available feature instances. The value of this hierarchical taxonomy derives from their general applicability. We have found that this taxonomy helps solve a wide range of modeling problems that work to predict a wide range of outcomes.

Creating this set of features and making them readily available gives us the opportunity to optimize feature selection and the agility to quickly test different attributes to determine those which will be highly predictive. We combine our external data with our customers’ internal data sets to determine the feature set that best meets their needs and can get them quickly to production with a highly optimized model.

As a result, Mobilewalla stands ready to provide vetted features that not only optimize the predictive modeling process, but also drive value for our customers. Ultimately, these predictive models provide actionable consumer intelligence insights.

Make Predictive Modeling Work for Your Business

Contact Mobilewalla to learn more about our data enrichment, data schema, and other tailored solutions for data scientists.

Sources
¹ https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#c3e951b6f637

Anindya Datta, Ph.D.

Anindya Datta, the CEO and Chairman of Mobilewalla, is widely regarded as a front-running technologist, leader and innovator, with core contributions to the state of the art in large-scale data management and Internet technologies. Mobilewalla has pioneered audience measurement in mobile apps by applying ground-breaking data science techniques on the industry’s largest volumetric database of mobile app data. Prior to Mobilewalla, Anindya founded and ran Chutney Technologies, where he was backed by Kleiner Perkins and evolved into one of the earliest entrants in the application virtualization area. The company was acquired by Cisco Systems in 2005. Anindya has also been on the faculties of major research universities and institutes in the United States and abroad, including the Georgia Institute of Technology, the University of Arizona, the National University of Singapore and Bell Laboratories. Anindya obtained his undergraduate degree from the Indian Institute of Technology (IIT) Kharagpur, and his MS and Ph.D. degrees from the University of Maryland, College Park.

A Framework for Better Feature Engineering