October 15th, 2018
With the increasing importance of analytics in nonprofit direct marketing, you may be asking yourself, what is the secret to a good predictive model?
Hint: it’s not what you may think.
Once you determine whether the item you are trying to predict is a continuous or categorical variable, choosing the best modeling technique is the key, right? Deciding on one of the following is the most critical aspect of building of model, correct?
- Types of regression—multiple linear, logistic, ridge, lasso or elastic net
- Neural Networks and Deep Learning
- Predictive Discriminant Analysis—linear or quadratic
- Decision Tree Algorithms—Random Forests, CART or CHAID
- Support Vector Machines—classification or regression
- The Supercalifragilisticexpialidocious Modeling Technique
Then it must be deciding how to use multiple modeling techniques and combining them into an ensemble model, thereby reducing bias and variance. Could that be it?
Nope again. Is it feature (variable) engineering? Or, maybe because we are in the era of “Big Data,” it should be dimension reduction that reduces thousands of potential variables to build a dataset with a more reasonable subset of variables (perhaps less than 100). Although having a manageable dataset can be extremely important, none of the items mentioned above are the ultimate key to building a solid predictive model.
The answer/secret/magic is actually quite simple—clean data. That’s not the flashy, cool, or hip answer, but access to clean data is the most important variable. Without accurate information, neither the modeling technique nor any fancy manipulation of the variables will provide a consistently predictable outcome. Andy Johnson, Vice President of Analytics, touched on this concept of quality data in his blog coming out of the DMA MAC conference—advanced statistical transformations and models are worthless without clean data.
Sound easy? Tell that to any database administrator or a nonprofit that has recently been through a CRM database conversion. Easy is probably not the four-letter word that you’ll hear!
Without accurate information, neither the modeling technique nor any fancy manipulation of the variables will provide a consistently predictable outcome.
In addition to setting up strict procedures for gathering and entering information, maintaining clean data takes on-going diligence much deeper than simple NCOA and deceased cleansing.
Need channel specific models? Does your data have definitive ways to identify gifts and promotions by channel such as email, display ads, direct mail, organic web, or telemarketing? Looking to model contact cadence without being able to identify all promotion data that a person has been exposed to? Want to tailor messaging to different segments of your file? Does the response and promotion data have detail on the campaign elements like cultivation vs fundraising, scientific vs emotional messaging, or themes?
All of this is not meant to dismiss modeling techniques and variable manipulation because they can be very valuable, but clean data must come first. Everyone with a stake in the success of any model must prioritize this critical goal. Assembling the dataset for model builds is often the most time-consuming process and is often underappreciated.
Once you are confident about the data, you can move on to applying the second most important aspect of building models which is not even mentioned in this blog. “Secret #2” will be discussed in my next blog: Keep Calm and Model On!
By Jeff Huberty | Executive Vice President of Analytics and Partner