Value derived from data science: The tussle between quality & quantity of models
Kaggle is one of the top platforms for data scientists to test their skills against the best.
Ranking among the top in its competitions is one of the hardest tasks to accomplish. However, the gap between the top performers is so small and a minute improvement in score would get one closer to the top ranks in competition. For example, in one of the competitions in Kaggle called Rossman Store Sales, the top ranker had an error rate of 10% while the 700th ranker out of 3500 participants had an error rate of 12%
Moving from 12% error rate to 10% error rate would typically involve:
Very high technical rigor
Exponentially higher time invested
The 10% error rate model would score highly on both the factors above when compared to the 12% error rate model.
This reminds me of one of the quotes of Jack Ma:
"I told my son: you don’t need to be in the top three in your class, being in the middle is fine, so long as your grades aren’t too bad. Only this kind of person [a middle-of-the-road student] has enough free time to learn other skills."
The quote above relevant to data science could be:
"you don’t need to have the best in class model, a decently accurate model is fine, so long as it improves the existing process. Only then, a data scientist has enough free time to work on solving more problems."
For example, think of a typical data scientist who might be working on store sales forecasting problem. (S)he might have multiple problems that can be solved:
Store sales forecast
Optimal product range by store
Optimal shelf stockings
Minimizing supply chain costs
Improving promotional effectiveness
The big question now is: whether one should invest their time in improving the accuracy of model or building more models that solve for different business problems.
In this case, if the business value from improving the model from 12% to 10% is $x, it is a question of whether investing time in rest of the business problems would give a value more than $x.
I have seen companies investing in one over the other depending on the maturity of the product & tenure of data science teams.
If the product is not matured yet (Let's say a typical e-Commerce company in emerging markets) - there are multiple problems that can be solved & so building a version 1 data science model (that potentially gives a decent accuracy if not the best) works.
However, if the product is mature or is purely based on data science then companies have invested in having more data science professionals who have focused on improving the models for the part that they were assigned.
Apart from all the above, there is one other dimension one needs to be careful about while investing time in improving the model - implementation feasibility.
The results of the $1 Million prize Netflix competition is a classic example. The company in an official statement gives the below:
"We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment."
With all the above, the decision boils down to:
Feasibility of implementation
Worth of implementation
Opportunity lost due to time invested in working on another problem
Liked what you read? We are starting a new 8 weekend data science class with focus on solving relevant problems in industry through 15 case studies & being facilitated by some of the highly experienced data scientists.