Why do companies need a data science stack in the first place?
Companies around the world are generating more data every day and want to gain valuable insights from it. The demand for solutions has therefore grown extremely in recent years. Today, data science and data analytics are more important than ever for business success. Far too often, data science and data analytics are still lumped together. While data analytics aims to understand past and current developments, data science is intended to enable predictions based on patterns and trends. In order to meet this requirement, different, specialized tools are sometimes needed. For example, data scientists often start with raw data (instead of pre-processed data), mainly use languages like Python (instead of mostly SQL) and work in ML tools (instead of BI applications).
It is therefore obvious that an ideal data science stack is designed differently than an ideal data analytics stack. It is therefore urgently time to address the question:
"What do my data scientists need?"
As so often with data topics, this question cannot be answered in a blanket or simple manner. Organizations have diverse data, underlying systems, budgets, teams and data cultures. However, basic principles and components for a modern data science stack have emerged and become established in recent years. We would like to present these to you here.
What do you want to achieve with a modern data science stack?
At the beginning of all the following considerations, the question of the goal to be achieved should be clarified unambiguously and together with all departments.
- Do data science projects often fail in early project phases?
- Do projects take an enormously long time or are there often delays?
- Do all data scientists run away?
- Do problems with models occur in real operation or during roll-out?
An understanding of the core challenges of data science is necessary for a well-functioning architecture.
Typical technical and organizational challenges
The beginning of a data science project is usually "nebulous". It is not clear whether suitable data is available, whether it can be used, whether a model for the problem will deliver meaningful, high-performance results, or whether the findings can in the end be implemented.
Data preparation is undoubtedly the most time-consuming part of a data science project. It is therefore important to make this phase as efficient as possible with tools, e.g. supporting automatization. Just as the following work requires the flexible use of different tools, the optimal solutions for data preparation can also be very heterogeneous.
Cooperation must not only work within data teams, but also with the business teams. Code must be able to be shared as well as definitions and knowledge in general. In collaborative work, the same environment must be made available to all people. This is the only way to reproduce results reliably.
Documentation and organization are factors that make many things easier and, above all, become inconvenient when they are missing. One aspect where this becomes very clear is the search for problems and errors in models in operational use. These can be, for example, suddenly occurring performance problems. Without proper documentation, this is like searching for a needle in a haystack. Another example is that many assets such as images, prepared data sets or special libraries can also be reused by colleagues without having to develop or find them again.
And last but not least: data quality.
Further, individual challenges should always be determined together with the data scientists (the actual users of the stack). With an understanding of individual and general challenges, an ideal data science stack can now be built.
The composition of the stack - cherry picking allowed
All phases must be optimally supported
Let's start with a brief overview of the typical phases or steps of a project.
However, this procedure is by no means set in stone. Basically, it can be "go back to start" at every step. The duration and workload of each phase also varies greatly from project to project.
A modern data architecture is made up of various components that all work together as seamlessly as possible. It all starts with data-generating systems, continues with transporting, storing, analyzing and organizing systems, and ends with data presentation. Such a permeable architecture minimizes errors, supports efficient projects and optimizes the quality of the results.
But let's start from the beginning.
Basically, data sources can be divided into two groups: primary (level of data generation, respectively direct storage) and secondary sources (pre-processed systems).
Primary systems include ERP systems, CRM systems, database systems for production data, databases for external data, and and and. Famous representatives here are SAP, Salesforce, Postgres databases and MS SQL servers (to name just a few examples). A special form of secondary sources is the data lake. Well-known providers are AWS, GCP, Azure, Databricks and Cloudera. The main representative of a secondary source is the data warehouse. Again, the major platform operators dominate the market. Other examples of secondary sources are real-time memory databases for specific applications. As an intermediate form between data lakes and data warehouses, so-called lake houses have developed, such as those offered by Snowflake and Databricks.
Ideally, data scientists are able to have direct access to the data source. The advantage is that they can work with current, unaltered data sets at any time. This ensures that no patterns may have been masked by previous processing.
However, this is often not possible for
- technical reasons or
- compliance reasons or
- other reasons (resources, capacities, guidelines, ...)
In these cases, data scientists usually receive a data export in the form of csv files. They must then be stored in an intermediate storage. A simple, well-functioning setting here is the use of buckets such as S3 from AWS.
Direct data access is great - but it needs to be regulated. Don't neglect data governance at this point. Establish a standard process for making files available that still offers sufficient flexibility. Individual solutions like Excel exports lead to uncontrolled shadow systems.
Data preparation includes experimental work on the data, ETL processes (extract transform load) and cleansing. Different types of tools are used, depending on the data origin, type and issues. Therefore, one sees a wide range of vendors with different specializations in the market.
As discussed at the beginning, the starting point for data science projects is much more vague than for analytics projects. The possibility to quickly and easily take a look at the data at the beginning of the process, check the data quality and test initial hypotheses is therefore very valuable. If these initial results prove to be promising, the detailed preparation can be continued afterwards.
ETL is the abbreviation for extract transform load and comprises a multitude of steps to obtain data from a source, adapt it and then load it into a data warehouse.(*) Most companies interested in data science are already engaged in data analytics and therefore already have various ETL tools. Connectors (such as Fivetran or Stitch), workflow managers (e.g. Airflow), automated query engines (e.g. Hive) or open source tools such as Spark platforms are used for extraction. In this area in particular, there are excellent open source tools with which a company can "pimp" its data stack very well. An architecture that relies entirely on open source tools is also possible. With the appropriate expertise, a powerful environment can be set up within a few hours. Existing Python libraries alone offer numerous functions to perform almost all necessary steps.
Which approach is the optimal one varies from data source to data source.
The same principles apply to data modeling, transformation and cleansing. In addition to the components of the large cloud platforms, open source tools and public libraries can also be used for this.
The storage (loading) of the processed data must meet three basic criteria in order to do justice to a modern stack:
- Flexibility: you can put data there from different systems and access it with different systems.
- Scalability: it must be possible to adjust the capacity according to the requirements.
- Transparency: it is a documented, accessible storage location, so that the movement of data remains traceable.
Data warehouses, data lakes (e.g. from Databricks, Deltalake,...) or bucket solutions (e.g. AWS S3) are therefore mostly used here.
(*) In recent years, the variant ELT, meaning extract load transform, has also developed. Here, data is first loaded into a data warehouse in a state similar to raw data and then transformed.
The ideal data science stack at these stages is not characterized by the one true technology, but by the fact that different technologies can coexist and be combined. For users, the greatest added value is created when the best tool for the job can be accessed at any time. To make this happen, two things are needed: a transparent overview of available tools and open interfaces so that they can be combined. A micro service-based architecture supports this flexibility and reusability of APIs enormously.
The development, training and testing of the model now takes place in the data science environment. An ideal environment can be compared to a buffet in a hotel. In a well-arranged buffet, plates are of course placed at the beginning of the setup and not at the other end of the room. Side dishes are arranged in such a way that you can directly choose from all of them. Translated to a data science environment, this means:
- How hungry am I? Is that why I choose a big plate or a small plate? → Which tool do I choose?
- Do I choose fish, meat or something vegetarian? What sauces can I round off my dish with? → Which images are available to me? Which libraries?
- I don't want to wait in line at every stage, so I go straight to the finished casserole. → I need a fast, straightforward solution for an MVP and choose an automated tool with a high degree of automation.
The realization of this environment can be designed quite differently. Platforms such as Databricks, Sagemaker, Anaconda and Alteryx provide users with a centralized location for their process management, notebooks, and libraries and in some cases offer low code-features.
Just as well, an organization can also set up a custom platform in a relatively straightforward manner. In that case, a place to store images is needed, GitHub can be used to share code, and container solutions can be used to harmonize the environment. Notebooks, as well as all associated assets, are transparently documented in a data asset catalog.
At its core, the mission of a modern data stack is to ensure that all data scientists have access to the components they need. For efficiency, collaboration and long-term stability, it is crucial to prevent silos with a central repository.
In addition to deployment, the following section also deals with the provision of insights through visualization. In both cases, the focus should be on the end user or the end application.
An example of a straightforward deployment is when a scoring calculated by the model is always to be fed into a database at a fixed point in time. It becomes more complex, for example, if different applications (e.g. website and CRM system) are to be served ad hoc by a recommendation engine. Core aspects for the design of the deployment should be derived from the requirements:
- How frequently are results queried?
- Are the queries batches or streams?
- Which applications are to be served?
- What latency may there be in the deployment?
In addition to the challenge of meeting these criteria, there are a number of other common hurdles.
- Moving to production may require a change in programming language.
- End applications may not only be multiple, but may also change over time.
- Data sources and systems may change
- Transitioning from training and test data to real data can be problematic due to the volume or velocity of incoming data.
- Adjustments must be implementable, documentable, and monitorable.
In summary, scalability, flexibility and sustainability are the key criteria that the deployment strategy must meet. Our recommendations for a modern data stack are therefore:
- Cloud deployment is always preferable (if possible) → Scalability, performance and transfer options are significantly better and easier to regulate than in legacy systems.
- Container solutions have proven themselves → They provide enormous support for challenges such as incompatibility and complex troubleshooting.
- A microservices architecture has proven its worth → In particular, transfers regarding end applications and sources become much more efficient as a result.
- Repositories that support collaboration, such as Github, are strongly recommended → Experience has shown that such platforms make communication with the responsible dev teams much easier. The history remains traceable and, if necessary, changes can be easily reversed.
The current trend towards platformization holds great potential here. A platform, whether as a platform-as-a-service or as an open source variant, creates transparency about available options and simplifies the reuse of already created APIs.
Making results available via visualizations is becoming increasingly important.
This includes not only the final results, but also insights gained during data preparation and model building. So if more and more users are to use data science results, the used visualization tools must support that accordingly.
Currently, one of these two structures is often seen in companies:
- everyone has to use the tool procured across the enterprise
- each data scientist uses their preferred tools and have built custom solutions
The range of available visualization tools is rich: Tableau, PowerBI, Looker, Redash, Superset, Python and R themselves, ... . The trend is definitely towards b), towards a higher tool diversity within an organization. For certain use cases, one special tool is better suited than another. The point of an ideal data stack is to make suitable tools available to everyone without creating silos. This can be achieved by 1) paying attention to open interfaces overall and 2) having central, efficient documentation. If these two criteria are met, results can be made accessible and comprehensible to all.
Cross-functional cooperation is important for successful deployment. The clear definition of requirements together with departments and cooperation with software engineers form the basis. A modern, cloud-based architecture and open interfaces promote efficient deployment and ensure long-term use.
Documentation has already been emphasized frequently in the previous paragraphs, but how is it realized in an ideal data science stack?
The representation of the overall process takes place in a data catalog. For clean data governance, but also efficiency, transparency and successful communication, the following aspects must be mappable there:
- Data sets used
- Performed transformations
- Processed data sets
- Data science libraries, images, platforms, tools, ...
- Milestones of the model development
- Implementation location of the model incl. APIs, dashboards, etc.
The point is not to duplicate every statistical analysis in the data catalog for local documentation in the catalog. The ideal data catalog for data science is characterized by the fact that each completed phase can be deposited really quickly and easily. A clear reference in which notebook the statistical analyses can be found is completely sufficient for traceability. At the latest when revisions are made or problems arise, every data scientist is grateful for structured preliminary work from colleagues.
Another task of documentation, besides improving the working environment, is to create trust. There are still reservations about the black boxes that ML models may represent. Users do not understand how predictions are made, for example, and distrust them. Besides change management, documentation is an essential step towards an improved data culture.
In the modern data science stack, documentation takes place on two levels. The first level is formed by tools, such as notebooks, in which work should always be comprehensible. The second level is that of the data asset catalog, which brings all components of a project into context and guides the viewer like a travel guide from the data source to the result.
Let's look again at the questions we asked at the beginning: "What problem do we want to solve?" and "What do my data Scientists need?".
The best way to develop measures is to understand the components of the data stack as parts of an overall process.
The future of the data stack lies in a distributed architecture, as also realized by the concepts of a data fabric and a data mesh. Responsibilities are divided, centralization is replaced by central transparency. The development of automation and the trend towards no- and low-code will progress. Our entire data culture will evolve and take over more and more companies. An evolving data culture will lead companies to tap into more and more data-based opportunities. Accordingly, a modern, sustainable data stack must be able to grow and scale with it.
Data democratization will no longer be primarily about employees within an organization, but about the general public's access to data technologies. The open source movement will continue to grow in importance.
Of course, technologies will also continue to evolve. The recurring cycles of innovation from invention to establishment will continue. In the future, components of the data architecture will be much more flexibly exchanged and replaced by new, more suitable modules. Software that relies on lock-in effects and does not provide open interfaces (looking at you, old SAP systems) will decline.
Therefore, prepare your data science stack for the future today by providing more flexibility and growth opportunities.
This article was also published on towardsdatascience.