AI Insights, NOVEMBER 2020
How a Platform should support Data Science
Björn Preuß
Chief Data Scientist, 2021.AI
A modern data science platform’s focus should not be to enable everyone to build machine learning models. Instead, the focus should be on structuring the deployment process, allowing for more transparent and governed models that are usable on all applications across an enterprise.
Data Science is often about model development and the process of developing the best working and most efficient model for a given problem. Kaggle competitions share this exact view and suggest that companies submit their challenges so that the world’s best data scientists can develop models to solve them.
When working with data science in this way, you might end up with the best model in class, and the problem gets solved, but then what? What comes next?
When looking at larger-scale companies, the use of data science depends as much on a structured approach of model deployment and control as it depends on developing a good machine learning model.
This all leads us to a question: what might solve this problem?
Gartner’s Magic Quadrant for data science platforms reveals that software providers try to help companies improve their data science capabilities. You read that correctly; they will help you get better at data science. Looking at most of the vendors, they try to make the process of model development leaner and more efficient. They aim to make every business intelligence (BI) person capable of building machine learning (ML) models. But is this really a problem that needs to be solved? Is this what companies need?
Probably not. Companies want to have the best models in class built by experts to solve their problems. In the following paragraphs, we will elaborate on what makes data science in a large enterprise problematic and what organizations need to overcome when they wish to unlock the business value that lies in machine learning and data science models.
Gaining value from ML and data science models
Collecting as much data as possible – that was the key process back when big data was the big trend. But, the question as to what should be done with all that data is still open-ended. Data scientists bridge this gap by pairing data with highly complex models written in programming languages, using open source components such as TensorFlow, Spark, Hadoop, etc. Even with a good working model, the organization is still left behind when it comes to generating value.
The data scientist’s model, preferably made by one with Dr. or Prof. titles, resides on a server in the basement. No other person can make sense of the code or math formulas in the comprehensive documentation. This dilemma often leads to a “spiral of model death,” meaning no model gets productionized or used. Furthermore, the data scientist might leave for a new job, leaving the organization with only nominal value.
Building on this, we’ve recently seen emerging frameworks and software products that aim to provide a solution. Lightweight, easy to “play” with, and shiny. Those should solve the problem, but they usually focus on a leaner way to build models. With this, they harvest the flexibility of open source and the advanced features that coding languages have. Many “drag and drop” and AutoML platforms emerged, but these do not leave enough room for tuning and optimizing models – and “real” data scientists do not want to use these “dumbed down” platforms. For real models, real platforms are needed. Models need to be built by experts if they should support a critical business function.
The real problem that should be solved by software is all the long processes around the model development and deployment.
Supporting processes should be standardized for all the models. This covers processes like model deployment, data access points, model management, monitoring of models, deployment of APIs and their management. Let’s walk through a process of model development and look at the parts that should be supported by software.
How software can take over model development
Before starting a data science project, the data scientist needs to consider the tools and components they want to use. These tools include the language to write the model in (e.g., python) or the way they should store the data (e.g., Hadoop). He or she often sets up the environment.
This configuration and installation task can be taken away from the data scientist and done by a software product that integrates all the standard components needed to be fluent in data science. All the data scientist needs to do is use the components available at their fingertips. Tech unicorns like Amazon and Uber have already done this, such as Uber’s Michelangelo platform, which provides data scientists with all the important components ready to be used in one managed platform.
The start is about accessing, cleaning and processing the data. Here we have some steps which can be solved by software and some that will always be manual.
Let’s start with the manual part. Looking for and then cleaning data is a process that, to some extent, will always be manual. A data scientist always needs to understand the data etc. Whereby accessing data and structuring the process can be supported by software and the frameworks delivered by it. This can include standard connectors to data sources, as well as code snippets for preprocessing (e.g., hashing of variables, etc).
The heart of data science
After accessing and transforming your data, we approach the core part of the process – modeling. The cool stuff! Here the data scientist uses all the provided tools he or she is familiar with. This part of the job does not need and should not be done by software. This is where a good data scientist will have the most significant impact on model performance, etc.
This part is where Kaggle competitors want to be and are good at. However, the part where software can help might be by providing guidance on how to structure models and giving examples for code pieces as a benchmark. This could be especially helpful for a less experienced data scientist.
It is important to mention that this is by no means a standard model but more like Lego bricks that help get the coder started. The “raw” bricked model then needs to be tweaked to get it to its full potential. This also shows that drag and drop environments might give too little flexibility if models should become advances. They only live up to be a baseline, but nothing more.
Where software shines
When leaving the heart of data science – we reach model development. Now, it’s time for the model to go into production. This is, as stated before, the key. Every model that is not in production or cannot reach this stage is a failed model. Without having the model running in production, management will never see long-term values coming out of data science, and we will end up in the before mentioned spiral of death. Putting the focus on model production makes it possible to standardize this process as much as possible. This is where software shines.
Having a standard deployment process that ensures security and scalability will differentiate a successful model implementation from an unsuccessful one.
But the lifecycle of a model does not end here. Deploying a model is just the start, and still, 90-99% of the model’s lifetime lies ahead.
This links to the implementation phase. A deployed model must be usable. The predictions need to be brought from the data scientist to management and operations and not in slides. A way to do this easily might be to show it in business intelligence dashboards or connect an existing ERP/CRM system to the model by consuming the prediction results. Like in the deployment phase, software should take over most of the workload by exposing a standard API and making it possible to query model results in a structured way.
This standardization of the processes around model deployment and API exposure ensures lean and structured model management. This is key when it comes to life support. Hence, a model’s created data points are centrally available in a standard form and used to manage models in a structured way. Having the ability to bring transparency into the models and knowing why a model predicts a result opens more opportunities than just transparency and reliability.
It is also possible to do simulations on the model’s results and turn model insights into business actions. The usability of model insights and the information provided by standard deployments ensure reliability and governance. In fact, security, governance, and transparency are, on an enterprise-level, by far the most critical areas where software must support data science. Without those vital functions, machine learning will turn from being the solution for all to the nightmare of all.
What’s new?
Now we have a system that supports the data scientist in building models and helps IT deploy the models in production. This leaves enough freedom to develop the latest models by framing them to have standardized processes and structure that production system’s needs. But since late 2019, there is more needed when it comes to an enterprise-wide AI system. That more is labeled governance. With Governance, or specifically AI governance, one needs a system that controls both the development process, the AI/ML-model and the model life cycle when it is in production. With governance, the compliance manager has an overview of what the data scientist is doing and how IT runs the system.
Governance is a cornerstone for an enterprise AI system. Compliance must be met with the highest efficiency possible.
A governance framework includes:
- Monitoring capabilities of user and models as well as the system.
- Assessments that can be used to document the model and how it was constructed.
- Certain security and IT admin functionality that guides the user and restricts access.
- Other features that gives the compliance manager an easy overview to check for compliance.
Oftentimes platforms only fulfill some requirements, delivering an incomplete picture for the user. This results in additional documentation and monitoring work done manual or with 3rd party tools.
Summary
To conclude, a modern data science platform should not focus on enabling everyone to build machine learning models. Instead, it should give more structure to the deployment process. Data science models will be more transparent, governed, and overall more usable from enterprise-wide applications with structure. In other words, it will take a flexible data science environment designed for the enterprise that also maintains flexibility for the data scientist with structure and control for the organization all in one.
You might also like…
The Responsible AI Newsletter
Get the latest know-how from those in the know. Sign up for our Responsible AI Newsletter and receive the latest insights from our experts.