HOW A PLATFORM SHOULD SUPPORT DATA SCIENCE
A modern data science platform should not have the focus of enabling everyone to build machine learning models. Instead, the focus should be on structuring the deployment process, enabling more transparent and governed models, usable from enterprise-wide applications.
BJÖRN PREUSS | SENIOR DATA SCIENTIST | MARCH | 2019
Data Science is often about model development and the process of developing the best working and most efficient model for a given problem. Kaggle competitions are resting on exactly this view and offer that companies submit their problems so that the world’s best data scientists can solve them.
When doing that you might end up with the best model in class and the problem gets solved, but what is next?
When looking at bigger companies, the use of data science depends as much on a structured approach of model deployment and control as it depends on a good machine learning model.
This all leads us to a question: what might solve this problem?
Gartner’s Magic Quadrant for data science platforms reveilles that there are software providers that try to help companies with getting better at data science. Yes, getting better at data science. Looking at most of the vendors, they actually try to make the process of model development leaner and more efficient. Their aim is to make every business intelligence (BI) person capable of building machine learning (ML) models. But is this really a problem that needs to be solved? Is this what companies need?
Potentially not, companies want to have best in class models build from experts to solve their problems. In the following paragraphs, we will elaborate on what makes data science in a large enterprise problematic and what organizations need to overcome when they want to attain the business value which lies in machine learning and data science models.
ATTAINING THE VALUE OF MACHINE LEARNING AND DATA SCIENCE MODELS
Getting a lot of data – that was the key work when big data was the trend. The question of what should be done with the data is still not answered completely. Data scientists – “the sexiest job of the 21st century” according to HBR – aim to fill this gap with highly complex models written in programming languages, using open source components such as TensorFlow, Spark, Hadoop etc. But even with a good working model, the organization is still left behind when it comes to generating value out of it.
The model from the data scientist, preferably with Dr. and Prof. titles, resides on a server in the basement and no other person makes sense of the code or math formulas that are in the comprehensive documentation. This dilemma often leads to a “spiral of model death” so that no model gets productionized and used. Furthermore, the data scientist might leave for a new job which would also leave the organization with only minor value.
Building on this we recently saw emerging frameworks and software products that aimed to solve it. Lightweight, easy to “play” with and shiny. Those should solve the problem but what they usually focused on was a leaner way to build models and with this, they harvested the flexibility of open source and the advanced features that coding languages have. Models need to be built by experts if they should support a critical business function.
The real problem that should be solved by software is all the long processes around the model development and deployment.
Supporting processes should be standardized for all the models. This covers processes like model deployment, data access points, model management, monitoring of models, deployment of APIs and management of those. Let’s walk through a process of model development and look at the parts that should be supported by software.
HOW CAN SOFTWARE TAKE OVER THE PROCESS OF MODEL DEVELOPMENT
Before even starting a data science project the data scientist often needs to think about the tools and components he or she wants to use. That can be the language to write the model in (e.g. python) or the way data should be stored (e.g. Hadoop). He or she often sets up the environment.
This task of configuration and installation can be taken away from the data scientist and can be done by a software product that integrates all the standard components needed to be fluent in data science. All the data scientist needs to do is to use the components at his fingertips. Tech unicorns like Amazon and Uber have already done this and data scientists have all the important components ready to be used in one managed platform, see as an example Ubers Michelangelo platform. After that, the modeling can start.
The start is about accessing data and cleaning as well as processing it. Here we have some steps which can be solved by software and some that will always be manual.
Let’s start with the manual part. Looking for data that is needed and cleaning the data will, to some extent, always be manual. A data scientist always needs to understand the data etc. Whereby accessing data and structuring the process can be supported by software and frameworks delivered by it. This can include standard connectors to data sources, as well as code snippets for preprocessing (e.g. hashing of variables etc.)
THE HEART OF DATA SCIENCE
After accessing and transforming the data we approach the core part of the process – modeling. The cool stuff! Here the data scientist uses all the provided tools he or she is familiar with. This part of the job does not need and should not be done by software hence this is where a good data scientist will have the biggest impact on model performance etc.
This part is where Kaggle competitors want to be and are good at. However, the part where software can help might be by providing some guidance on how to structure models and by giving examples for code pieces as a benchmark. This could be especially helpful for a less experienced data scientist.
Having said that it is important to mention that this is by no means a standard model but more like Lego bricks that help to get the coder started. The “raw” bricked model then needs to be tweaked to get it to its full potential. This also shows that drag and drop environments might give too little flexibility if models should become very good. They only live up to be a baseline but nothing more.
WHERE SOFTWARE SHINES
Leaving the heart of data science – the model development. The model needs to be taken into production. This is, as stated before, the key. Every model that is not in production or cannot reach this stage is a failed model. Without having the model running in production, management will never see long-term values coming out of data science and we will end up in the before mentioned spiral of death. Putting the focus on model production makes it possible to standardize this process as much as possible. This is where software shines.
Having a standard deployment process that ensures security and scalability will differentiate a successful model implementation from an unsuccessful one.
But the lifecycle of a model does not end here. Having a model deployed is just the start and ca. 90-99% of the model lifetime is still ahead.
This links to the implementation phase. A deployed model must be usable. The predictions need to be brought from the data scientist to management and operations and not in the form of slides. A way to do this easy might be to show it in business intelligence dashboards or connect an existing ERP/CRM system to the model by consuming the prediction results. Like in the deployment step, software should take over most of the workload by exposing a standard API and making it possible to query model results in a structured way.
This standardization of the processes around model deployment and API exposure also ensures a lean and structured model management. This is key when it comes to life support. Hence, the created data points of a model are centrally available in a standard form that can be used to manage models in a structured way. Having the ability to bring transparency into the models and knowing why a model predicts a result, opens more opportunities than just transparency and reliability.
It is also possible to do simulations on the model results and thereby turn model insights into business actions. The usability of model insights and the information provided by standard deployments ensure reliability and governance. In fact, security, governance, and transparency are on an enterprise level by far the most important areas where software must support data science. Without those key functions, machine learning will turn from being the solution for all to the nightmare of all.
To conclude, a modern data science platform should not have the focus of enabling everyone to build machine learning models. But instead, it should give more structure around the deployment process. With this, data science models will be more transparent and governed and hence usable from enterprise-wide applications.
Read about our 8 primary industries and their industry specific use cases, giving you insights into how implementing AI can increase business value
Want to learn more about the field of AI and get the latest insights from 2021.AI employees working with AI every day?