April 2021

How data scientists are visualizing their work faster than ever: Closing the gap in applied AI

AI Compliance
MLOps
AI Governance

As any data scientist knows, the most successful projects require communication with business partners early and often. Data scientists who want to rapidly share their work with business partners and clients find themselves running code and screen capturing for PowerPoint presentations.

The problem is that these tactics are time-intensive, usually involving hands-on participation in meetings to walk stakeholders through work that will soon be updated, only to have to go through the entire process of translating into powerpoint again. That’s a productivity hack waiting to happen.

So why not just use BI?

While BI tools like PowerBI, Tableau and Qlick have come a long way, and can usually be found embedded in your enterprise systems today, they still don’t solve the crux of the problem. What they’re great at is displaying data, tracking against KPIs available in enterprise systems and applying formulae to existing data.

But they were never really designed to discover new aspects of your data using ML/AI. More importantly, they can’t help you discover new data sources. BI deals with known unknowns, while Data Science deals with unknown unknowns.

What’s more, BI tools typically require:

  • Some pre-existing tooling knowledge
  • Data that comes from approved and acknowledged data sources within the company
  • Licenses that can be prohibitively expensive and are usually purchased at an enterprise level, limiting easy access at the place and moment of need
  • An interface for use by non-technical people, limiting the utility and options for data science teams in their attempts to visualize and interact with data
  • An enterprise audience for maximum effect, rather than individual users or stakeholders in a project
  • Use cases that don’t involve prototyping and generally not using a lot of open source
BI tool – Dashboard

Over the last four years, a number of options have emerged that are designed for the rapid visualization of data.

Let’s take a closer look at one of these, Dash.

Dash is a set of open source python libraries that can quickly create a web-based visualization application provided by Plotly, popularly known as the best way for data scientists to visualize data. Here’s what’s great about it:

  • It is open source and completely free
  • Runs in Python, no HTML experience required
  • Fantastic visuals
  • Interactive user experiences
  • Easy to use
  • Tried and tested

While Dash was probably the first of such tools on the market, new and even more amazing tools have emerged such as:

  • Streamlit , for even more interactive prototyping
  • Shiny , for R Models
  • Voila-Panel

But the core problem persists. What do you do if you are a data scientist who just wants to quickly display your work to a colleague or client?

Reframe the problem

The first step is to reframe the problem and visualize it in a different context.
Here are the three real reasons why these frameworks have not really caught on in the broader data science community.

Security. Data science teams in most organizations are dealing with datasets and models that are sensitive in nature. They often work with classified or even privacy-shrouded datasets that cannot leave your technology environment. Developing a Dash application and exposing these to any stakeholder out there without any security or compliance process, is a huge risk, and in some cases can even lead to jail, fines or company bankruptcy.

Operations. Data science teams are intended to be working with other data scientists. When building a Dash application, you have to make sure that there is Service Level Agreement and that the APIs serving the Dash frontend are operational 24/7/365.

Scalability. Building a Dash application and exposing a URL to one stakeholder might not be an issue. However, imagine 100 stakeholders, or an entire company and its customers using the Dash app intensively. Quite likely, it would require more CPU/GPU/Memory or even multiple instances running at the same time in a cluster with a load-balancer on top to distribute the load.

One way to fix this is for your data science team to have a virtual machine set up in your organization, deploy Python and run the Dash application. It also means that you need a solution to handle the security, operations and scalability on this virtual machine. This task typically falls to IT who may not have the skills and infrastructure to do this right.

Another way to fix this is to deploy your Dash applications on some of the PaaS services that do this. Plotly, (link https://plotly.com/dash/) for example, offering enterprise hosting of your Dash apps, is one such option. It’s a great way to host your Dash applications, but it does come with a price tag. Moreover, your data will need to leave your technology environment which is not desirable and in many cases, not acceptable. You would also need to worry about integrating an external environment with your other backend services, yet another security concern and one that would likely keep your IT department head up at night.

Step 1. Deploy and operate Dash applications in a secure environment, TLSv2, OAuth, Vulnerability Scans, PEN Testing.

Step 2. Deploy and operate Dash applications using Compliance and Governance functionality making sure that GDPR/NDA/DPIA/EU/FDA/FSA/ISO/IEEE standards are being met.

Step 3. Deploy and operate Dash applications providing all the tools and services to provide 24/7/365, continuous monitoring and surveillance.

Step 4. Deploy and operate Dash applications using a flexible and horizontally scalable Kubernetes Cluster that can scale almost indefinitely, depending on the load.

Data visualization techniques have come a long way but the way forward is to equip your data scientists with the security, operational environments and scalability to shorten time to value.

Transcript

More news

Get the latest news

Stay up to date on our latest news and industry trends