Often when we think of a data science assignment, the main thing that comes to mind is the algorithm technique that needs to be applied. While, that is crucially important, there are many other steps in a typical data science assignment that requires equal attention.
A typical data science assignment can have the following stages:
Let me explain using a simple case study:
There is an online retailer, who is running a shopping festival in the month of November, just before the holiday season. It has a catalogue of a million products and a database of 100 million customers, who had bought from them in the past.
The retailer wants to do promotional email campaigns to its customer base. The objective is to run a series of “successful email campaigns”.
Lets now understand the different life stages of this particular assignment:
Defining business objective:
This is an extremely crucial stage given that a wrong interpretation of the business problem/objective at hand can lead to a faulty solution and undesirable results. The role of data-science, if you really think about it, is to use data and insights from it to solve real-world problems. From that perspective, accurately identifying the problem and defining the objective is crucial for a successful outcome. In this example, the marketer wants to send customized emails to each of its customer showing a list of product offers curated according to the customer’s preferences and tastes:
In this case, to define the business objective, we have to ask a couple of questions:
A) Do we send emails to the entire list of 100 MM customers or to a select group of customers?The retailer is organizing a shopping festival, so it might make sense to send emails to all 100 MM customers, but still certain points need to be considered:
By bombarding emails to all their customers, would it leave some customers unhappy, for e.g. the ones who do not actively shop with the retailer
Since we want to show curated list of products to the customers (based on the individual’s preferences), so, if all 100 MM customers are taken into account, we may end up with a set of customers who might not show very high preference for any of the products (could be because they don’t shop enough with the retailer, and therefore, the retailer doesn’t have enough information to know their preferences)
Sometimes, data processing and storage costs could also be a consideration. Processing 100 MM customers and their characteristics, running machine-learning algorithms can be quite time and resource intensive. While infrastructure could be available to handle that, but along with the first two considerations, it may make sense to exclude some customers, especially to speed up time to market.
B) How do we define and quantify the success metric? This is an extremely important decision and is directly linked to the business goals. In the above case, we can have a few possible success metrics:
1. Purchase Rate of the campaign (#of purchases/#of emails sent): This metric will give how effective the campaign had been to persuade customers to spend. So, if the retailer is just concerned with how much sales the entire campaign drove, then this is the metric to go for!
2. Email Open rate of the campaign (#of emails opened/#of emails sent): This could be important if the retailer wants to understand other factors like how effective the email campaign content has been, specifically, in this case, how “catchy” the email subject has been. Similarly, email click through rates (after one has opened the email, clicks on the web links provided in the email, to land on the retailer’s website) shows how effective email content has been.
3. Profitability of the campaign: Sometimes, instead of just getting more number of customers to respond (i.e. driving higher response rate), the retailer could be interested in driving higher spend per customer. Think of it this way – a campaign focused on driving more and more customers to spend could end up attracting customers who may buy a lot of products but of lower value, eluding customers who may buy less but buy high value products.
2. Data processing and analysis:
This, again is another very important stage wherein we understand in detail the data available to us and how we can use it to accurately solve the problem at hand.
Broadly, there can be the following steps in this stage:
Missing value treatment
Let us go through them one – by – one to get some intuition on why the step is required. In the above case example, lets say you have data like below, from past promotional email campaigns:
This data above is a snapshot of three customers (out of 100MM customers that the online retailer has) and some of their information.
One can see that the Gender of the 2nd customer is not known. Gender can be powerful information, hence, if a large percentage of the customers are of “unknown” or “missing” gender, then we will lose a very important piece of information. There can be many ways to impute Gender (through the salutation or the name) and hence can be used for missing value treatment. Similarly, if reported annual income is missing (since this information is provided by the customer only and he/she may not be willing to provide that), we can use last 12 months spend to impute/predict the annual income.
Outlier treatment is also important. For example, we could see some very high values of “last 12 month spend” or “annual income”. In case of spend, it could be because of some one-off high dollar spending by certain customers which may not persist and which can bias the entire data, hence capping the spend values at some threshold (e.g. 99 or 95 percentile value of “last 12 month spend”) can help reduce such bias.
Sometimes, we may see that there are distinct segments of customers within the data that behave very differently. For example, if we look at recent customers (became members of the online retailer in the last 6 months), these customers are likely to behave in a very different manner than the rest of the customers (they may be very inquisitive, so email open rates could be very high, but purchase rates could be low). Hence mixing these customers with the rest of the customers can either bias the data on certain parameters or these customer’s features may get overshadowed by rest of the customers, reducing their representation on any prediction algorithm that is built. In such cases, it may make sense to build separate algorithms for these two “data segments” (new customers and remaining customers)
Feature engineering: Features or variables are really what gives predictive power to the algorithms. So, having the right set of features is key to building a robust algorithm – hence the focus on feature engineering. Types of feature engineering:
Feature selection: Selecting a subset of features that are most useful to the problem. There are many feature selection algorithms like scoring algorithms based on concepts of correlation, information value or other feature importance. With more and more computing power and machine learning techniques, however, feature selection is increasingly being handled within the algorithm.
Feature Construction: The manual construction of new features from raw data e.g. in the above case study, we have a feature “last spend date”, which in itself might not provide any predictive power. However, we can create a feature “days since last spend”, which can be very powerful (a customer who has recently spent, could have higher intent to spend again and therefore could be more responsive to an email offer)
Feature Extraction: Some data like image, voice, text can have multiple features, so, through feature extraction, we can automatically reduce the dimensionality of these types of features and also extract hidden features from the data. For e.g. in image recognition like the pokemon image below, each image can have hundreds of features (pixels). So, any image recognition algorithm has to deal with a huge number of features from multiple images. Hence, the algorithm has to be able to automatically extract and reduce these large numbers of features to a smaller set of meaningful features.
3. Modeling and Evaluation:
This is the step where we have to select the “right algorithm” to get the “right set of solutions” for our business problem. This, as you can see, is an extremely important step and key is to find the most suitable algorithm for the given business objective. In the case above, without going into a lot of details, we have two sets of objectives – (1) finding the most responsive set of customers out of 100MM – lets say that’s x (2) for each customer out of this list of x customers, show offers that are most relevant to him/her preferences. For the first objective, we need a response prediction algorithm (e.g. regression techniques) that’ll give a response likelihood score/probability for each customer, which then can be used to rank-order the customers and select the most responsive ones for the campaign. For objective (2) - finding customer’s offer preferences, we need algorithms that can help select the product offers that are most likely to be preferred by a customer (e.g. recommender algorithms or classification techniques)
Once we have built the algorithms, their evaluation is also based on how well they meet the objectives at hand. Lets understand this using the case study above. Assuming we have built a response prediction algorithm that rank-orders the 100 MM customers based on their probability to buy a product after seeing the email offers:
Now, we bucket these 100MM customers in 10 equal buckets, rank-ordered in the descending order from highest probability of response to lowest. For each of these buckets of customers, we look at their actual response rate to a previously sent email offer campaign, which was sent to all 100 MM customers:
Please note: the response here is product purchase after seeing the email offer
So, to meet objective 1, we just have to decide till which bucket we want to send the email offer.
Now, in the above table, you can see that there are discrepancies between the values of “average probability of response” and “actual response rate” for some of the buckets, for e.g., buckets 3 and 4. So, the predictions are not very “accurate” when compared to the actual values. However, since the objective here is to select a set of high response likelihood customers, we are more concerned about how well the model is rank-ordering the customers in terms of response. Looking at the actual rates, it seems to be doing a pretty good job (the actual response rates from past campaigns is also pretty much ordered in a descending order).
So, here, model result evaluation is more around how well it is rank-ordering the customers by their response probability rather than the accuracy of the predictions.
However, when we evaluate the results of the second model, which gives a preference score for every product offer for every customer, prediction accuracy can be more important. Lets say, in the above case, there are 10 product offers. So, we built a model that gives preference score for each of the customers for each of the 10 product offers:
Here, customer 1 has a higher preference for offers 1, 2 and 4 in that order. For products 3 and 5, since the preference score is very low, we can assume that he doesn’t have any preference for these products. Similarly, we can say that customer 2 is not showing preference for any particular product. We can create a threshold score where if a customer’s score is higher than that threshold, then we will consider the preference, otherwise not.
So, you can see here, that we are doing such assessments based on the value of the score and therefore, it is important that we have accurate scores that reflect the true preferences of the customer. Hence, in this model evaluation, prediction accuracy is very important.
By building a data prototype, what we mean is creating the necessary infrastructure to implement the solution in a production environment. Given that implementation is a time and resource intensive process, appropriate consideration needs to be given. In the above case, some of those could be:
Is this email campaign a one-off marketing initiative or a more regular one? If regular, then it makes sense to create a production platform to execute such campaigns.
For such a platform, how will all the data feeds from different sources be put together? Assessments need to be made in terms of the effort and cost involved in cleaning the source data, its update frequency, internal data hygiene checks and balances etc.
How will all this data be stored and processed? This involves decisions like the need for parallel processing (if data volume is huge) or real-time processing as well as storage infrastructure.
How will the emails be delivered? Again here, decisions required include - need for a third party email delivery vendor, customer data privacy checks and balances, speed to market including need for real- time processing etc.
These are some of the considerations, but depending on scale and complexity of the assignment, there can be many other things that need to assessed and evaluated.
So, as you can see, a data science assignment is a sum total of many stages that requires domain expertise and detailed understanding of the business objectives along with technical expertise. One cannot do without the other!!