Practical DataOps

The Book in 3 Sentences

This book explains DataOps as a collaborative data management practice that combines principles from agile development, Lean manufacturing, and DevOps to improve data science outcomes. Unlike traditional software development, DataOps focuses on both code AND data at every step, emphasizing the need for automation, continuous improvement, and waste reduction in data workflows.

The book addresses why many organizations struggle with data science ROI (only 22% see significant returns) and provides practical frameworks for implementing DataOps principles, from data lifecycle management to team organization and technical infrastructure.

This book was more about culture and stuff like scrum etc than any practical advice. Buyer beware.

Impressions

A thorough and practical guide that goes beyond the typical “data science is all about ML models” narrative. The author does an excellent job connecting manufacturing principles (like Theory of Constraints and Just-in-Time) to data operations, while maintaining a pragmatic focus on organizational challenges and solutions. The extensive references to real tools and frameworks make this particularly valuable for practitioners.

My Top Quotes

Data science cannot exist on its own and is part of an ecosystem of skills that includes data engineering and the broader field of data analytics.
According to Forrester Research, only 22% of companies are currently seeing a significant return from data science expenditures.
Gartner narrowly defines DataOps as a data management practice: …a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and consumers across an organization. The goal of DataOps is to create predictable delivery and change management of data, data models and related artifacts. DataOps uses technology to automate data delivery with the appropriate levels of security, quality and metadata to improve the use and value of data in a dynamic environment.
DataOps borrows best practices from agile software development, Lean manufacturing, and DevOps, but does not copy them. A fundamental difference is that, in software development, the focus is on the application code deployed at every stage. In data science and analytics, the focus is on the code AND the data at every step.
In data science, there is an overemphasis on machine learning or deep learning, and especially among junior data scientists, the belief that working in solitary isolation to maximize model accuracy score(s) on a test dataset is the definition of success.
- Your Highlight on page 44 | Location 778 | Added on Tuesday, September 23, 2025 11:28:52 AM
The absence of advanced data literacy is why democratizing data has limits. While more information is better than less, moving data from an Excel spreadsheet to a visual Business Intelligence (BI) tool will not automatically lead to better decisions, just as showing me CT scan images and blood test results will not help me cure myself.
A common expectation is charging data scientists to be solely responsible for educating the business to make it data-driven. However, data scientists tend to be relatively inexperienced in business process compared to the people they work with and can thereby struggle to change the culture.
A more widely cited definition by Malcolm Chisholm defines 7 phases of a data lifecycle: data capture, data maintenance, data synthesis, data usage, data publication, data archival, and data purging.
Version of the data lifecycle with the following stages is needed: Capture. The initial step of data materialization within our organization. Data can be acquired from external sources, entered by humans into devices, or generated by machines. Store. Once captured, files, databases, or occasionally memory persists data in different data structures, data models, and data formats. Process. Data is processed as it moves through its lifecycle. It will go through ETL, cleaning, and enrichment processes to make it usable in the next stage. Share. Stored data needs to be shared or integrated between systems to use it effectively. Sharing includes more than transactional data generated by operational systems but also covers intermediate analytical outputs and one-off extracts. Use. The value from data comes from refining it and fulfilling a use case. The use can be simple data sharing or the outputs of descriptive, diagnostic, predictive, and prescriptive analytics. Consumers of the outputs can be internal or external to the organization. End-of-life. Whether it is due to regulatory requirements, cost, or declining value, eventually, data must be retired. The process begins with data archival and ends with data purging (the permanent deletion of data).
CautionIt is crucial during situational analysis not to come up with solutions and objectives. The situational analysis aims to make plan development more robust not to be the plan.
The world is not a 2D Boston Consulting Group (BCG) Matrix.
It is necessary to know: What are the organization’s mission, values, and vision statements? What are the organization’s strategic objectives and strategic scope? What are the KPIs the organization cares about, and what are the trends? What are the organization’s strengths, weaknesses, and distinct competitive advantage in terms of product ease of use and range, price, distribution, marketing, service, or processes? Who are the leading competitors, and how do they compete against you? What is its product portfolio performance (by product/market segment), how is it evolving, and what is the growth strategy? Using gap analysis, where business unit results differ from stated objectives, what are the root causes? Is it lack of skills, wrong structure, wrong systems, or something else? Who are the customers for the products, and how do they differ from ideal customers? What are customer needs, and how are they satisfied? What are future customer needs, and how can we identify them? How do customers find us, and what are their pain points in dealing with us? How do we continually improve our relationships with customers? Are they loyal, are they happy, and how do they see us? What political, economic, social, technological, and environmental opportunities and threats do we need to consider?
An understanding of how prominent business decision-makers use data analytics is required: Who are the key leaders and at what stage of the data-driven organization buy-in process are they? What are significant stakeholders saying about data analytics and data teams in the organization? Which internal leaders are more likely to use data analytics and make data-driven decisions? Who should be the ideal customers for data analytics be? Why do internal consumers use data, what are their needs, are requirements fulfilled, and what are their future needs?
Automating the prescriptive analytics process can have a more significant impact than automating other forms of analytics as it makes possible action in real-time and at scale. Automated prescriptive analytics can also create a considerable benefit by removing gut instinct from decision-making.
Development of the data strategy centers on the end-to-end data lifecycle and requires sufficient executive sponsorship and buy-in to deliver change. Comprehensive situational awareness is crucial to ensuring that the data strategy leads to successful outcomes aligned to the organizations’ mission, vision, objectives, strengths, weakness, and the external environment in which it operates.
Just-in-time (JIT) involves making just what is needed, when required, eliminating the need to carry inventory, while being much more capable of handling the complexity building a variety of products requires. JIT was the antithesis of contemporary beliefs for how an efficient manufacturing process should operate. For instance, Henry Ford leveraged the conveyer belt system to mass-produce as many cars as possible to bring down unit costs and selling prices without due consideration for market demand. He was famous for saying “… customer can have … any color [car] so long as it is black.” This adage is definitely not JIT.
The seven wastes are as follows: Overproduction. The most significant waste is producing too much or too early. It results from making large batch sizes with long lead times and leads to an irregular flow of materials. It is also a primary cause of other forms of waste. Waiting. Materials or components waiting for processing are a waste of time. Transportation. Moving material from one location to another for processing adds no value to the product. Overprocessing. Doing more than is required by a customer, or using tools that are more expensive or complex than needed, results in higher costs and unnecessary asset utilization. Excess motion. Making people and equipment move more than necessary wastes time and energy. Inventory. Work in progress (WIP) and finished products in storage result from waiting and overproduction. Excess inventory ties up capital, has to be tracked or gets lost, hides defective processes, requires storage, becomes superseded, and may need offloading. Defects. Inspecting production, rework, or scrapping all consume resources, introduce delays, and impact the bottom line.
TPS also identifies two other forms of wasteful practices to be eliminated. Mura is an irregularity or nonuniformity that causes unevenness in workflow causing workers to rush then wait. Muri is unreasonableness that requires workers and machines to work at an unsustainable pace to meet deadlines or targets.
Data analytics and data science have the characteristics of both a production system and a product development system.
In data science and analytics, there is much waste when you look closely.
Extra features are the counterpart to overproduction in manufacturing and considered the worst waste if they do not help data consumers make decisions.
Extra processes result in an unnecessary effort that does not create value. This category of waste is very varied. Extra processes include duplication of data and transformations in multiple data stores across the organization, using a complex algorithm when a simpler one would have worked as well or relearning the task because knowledge is not captured and reused.
Since waste shows up as a time delay, the best way to quantify performance is to measure the average end-to-end cycle time.
Although data analytics is not software development, the concepts of releases and iterations are useful to adopt. Releases result in shipping a consumable product to customers. Iterations are smaller units of work between releases that are sometimes, but not always, released to customers.
The solution to highly variable work size is to make release cycles of complete development as short as you can handle, determine how much work can be done in the cycle, and never take on more than the cycle can handle. Instead of delaying a release, leave work out for future iterations, accept a less accurate machine learning model, use fewer data items in a pipeline, use a more straightforward dashboard, or produce only headline data insight.
Managers often ask teams to squeeze in one more task without asking for anything else to be dropped believing the way to get more done is to pile on more work. Queueing theory shows that average cycle times surprisingly interact with utilization rates. Road traffic does not go from usual speed to immediate standstill when road utilization goes from 99.9% to 100% of capacity. It starts to slow down long before that point as more and more vehicles join the highway.
Data pipelines can also be push or pull based. The popular distributed streaming platform Apache Kafka is a good example of pull-based design benefits. Kafka uses a publish-subscribe pattern to read and write streams of data between applications. Producers of data do not directly send it to consumers of data (known as subscribers). Instead, the application producing data sends a stream of records to a Kafka topic (a category of data) hosted in a Kafka broker (a publishing server). A topic can have one, many, or no applications who subscribe to consume its records.
The theory of constraints (TOC) , introduced by Eliya M. Goldratt in his 1984 book The Goal8, views the performance of a system as being limited by its most significant constraint. Just as a chain is only as strong as its weakest link, the biggest bottleneck limits output of any system.
Constraints fall into four categories: physical, policy, other nonphysical, and people. Physical constraints are typically equipment related. Policy constraints are rules and measurements that prevent the system from achieving its goal.
There are two different techniques you can use to find a root cause to answer the “What to Change?” question correctly, current reality tree (CRT) from the theory of constraints and 5 whys from Lean thinking. 5 whys is suited to finding a relatively simple root cause of symptoms with few interactions with other root causes. CRT is more structured and designed to uncover relationships between issues.
To construct a CRT, a small group with knowledge and experience of the organization and its systems agree on a list of no more than five to ten problems (undesirable effects (UDEs)) to analyze. The UDEs must have a measurable negative impact, be clearly described, and should not be a list of missing solutions. An example might be lack of training in data science methodologies. Other examples of UDEs may include a high turnover of data engineers, managers making decisions based on gut instinct, data scientists spending excessive time cleaning data, or taking a long time for data science products to reach production.
The ten-minute build practice aims to run an automated build for the whole codebase and run tests in under 10 minutes. The time limit is chosen to encourage teams to run tests and get feedback as often as possible.
The other practice is continuous integration, where integrating code transitions into the larger code base and integration testing at least every 2 hours. Finding problems before making more significant changes makes them easier to fix.
The DataOps
To address the differences between software development and data analytics, Christopher Bergh, Gil Bengiat, and Eran Strod published the DataOps manifesto.10
According to W. Edwards Deming, “every system is perfectly designed to get the results it gets” and “the system that people work in and the interaction with people may account for 90 or 95 percent of performance.”
W. Edward Deming’s Plan-Do-Study-Act (PDSA) loop is a common approach with teams implementing continuous improvement. The first stage is to identify a potential improvement with clear objectives and create a plan to deliver it. The do phase involves running an experiment with the improvement change and collecting data. Next, study results and compare to expected outcomes. Finally, if the experiment is successful, act to implement the improvement and create a higher baseline of performance. With every iteration of the PDSA loop, there are opportunities for continuous improvements in process or gains in knowledge.
But, fitness for purpose is also needed. Fitness for purpose is the ability to change in the right direction and stay relevant to customers. To ensure the data analytics system is delivering the right thing, in the right way, and moving in the right direction feedback is required on performance. Feedback is an essential element of systems thinking.
The organizational coach Matt Philips proposes that knowledge work consists of two dimensions where internal and external viewpoints are on one dimension and product and service delivery are on the other.2 Philips uses the restaurant metaphor to describe the elements. When eating out, customers care about how the food and drink are delivered (service delivery) as much as the product itself (the food and drink). The staff also care about the service and product. But, from an internal viewpoint, they want the food to be consistently good, ingredients stored correctly, and everyone to work well together as a team.
A service delivery review meeting should be a regular assembly for the analytics team to discuss with internal customers how well they are meeting their service delivery expectations.
A passive strategy to solve the problem of concept drift is to retrain models using a window of recent data periodically. However, this is not an option in some circumstances due to negative feedback loops. Imagine a recommender system where specific customers see recommendations for product X based on previous purchasing relationships.
Many organizations don’t measure the benefit of their initiatives beyond simple correlations such as we performed an action and revenue went up at the same time, so it must be due to our action. There can be many reasons why revenue increased that have nothing to do with the action. As anyone who has seen charts from Tyler Vigen’s spurious correlations web site knows, correlation doesn’t imply causation.7 Without a counterfactual (a measure of what would have happened if the action had not occurred), it is difficult to determine cause and effect.
Counterfactuals are the reason A/B testing is the gold standard for measurement. However, in certain circumstances for regulatory, ethical, or trust reasons, it is not possible to treat customers differently by assigning them to random treatment groups.
Some organizations are reluctant to measure benefit because they operate in a culture that rewards outputs and not outcomes. The substantial process changes from JFDI (Just Freakin’ Do It (polite form)) execute and move on mode to run experiments and wait for results before iterating is a significant barrier to benefit measurement. Many managers also lack statistical knowledge or fear statistics and do not want their decisions guided by something they do not understand.
Differences-in-differences (DID) is another commonly used econometric technique. DID relies on a comparison of pre and post outcomes between the treated group and a control group that is not identical but displays parallel trends. Imagine our taxi company uses the ML model in Washington but not yet in Baltimore where it also operates and shows very similar patterns. Without the model, the same trend in gross bookings is expected between Washington and Baltimore. Hence, a comparison of the change in the delta of gross bookings between cities, before and after implementation of the model, provides an estimate of the model’s effectiveness. The drawback of DID is that if something other than the treatment changes trends during the post period in one group but not the other, it violates the parallel trends assumption.
Multiarmed bandit algorithms are an example of an approach that some people claim is superior to A/B testing. In standard A/B testing, you usually split users 50:50 between two versions of your data product. One of the versions will perform worse than the other, so there is an opportunity cost of exposing users to this variant. With multiarmed bandit algorithms, you also start with an equal split during an exploration phase (typically 10% of the total time the experiment is expected to last). However, in the following exploitation phase, users are split based on the relative performance of the variants with more users exposed to the better-performing variant. This split reduces the opportunity cost of testing, but because the lower-performing variant is exposed to fewer users, it is harder to tell if it is worse or if the difference is due to chance. It takes much longer to reach statistically significant results than if users are split equally between variants. If the difference in performance between variants is small, even the opportunity cost-benefit is negligible over A/B testing.
Cars have brakes not to stop, but to travel fast. If cars had no brakes, people would still drive but just slowly enough so they could use a nearby tree or lamp post to stop safely in an emergency.
Unfortunately, many IT departments believe the route to reducing risk is not to introduce safety features but metaphorically remove the tires from the vehicle and make it as hard as possible to drive. It is indeed one strategy to prevent accidents, but it also guarantees you will not go anywhere quickly.
Google suggests that 70% of tests should be unit tests, 20% integration tests, and only 10% end-to-end tests.
Conway observed that any organization that designs a system produces a design structure that is a copy of the organization’s communication structure.
There are two main models for team member alignment – functional and domain. Functionally orientated teams organize around technical expertise, and domain-orientated teams organize around a market, value stream, customer, service, or product.
Long-lived and stable teams are more efficient than ad hoc or project-based teams. Stable teams avoid the painful and time-consuming forming, storming, and norming phases associated with transient teams.
Google believes the best way to make communication easy is to put team members within a few feet of each other.
Humans are hard-wired for face-to-face social interactions and perform less well in their absence. In her book The Village Effect, psychologist Susan Parker cites an experimental study of 25,000 call center agents. Half were asked to take breaks alone, while the others took breaks with coworkers. Those who socialized with coworkers showed a 20% performance increase.
Applying the spine model by Trethewey and Roux helps make sense of which tools are best to use.2
One of the principles is to improve cycle times of turning data into a useful data product. Requesting data or access to infrastructure and waiting for permission and provisioning is a significant bottleneck and massive source of waste in organizations.
Apache Airflow is the most popular open-source software for creating, scheduling, and monitoring DAGs. Alternatives include Luigi, Apache Oozie, or Azkaban for Hadoop and Google’s managed Airflow service Cloud Composer.
Open-source package Great Expectations automates pipeline tests, the equivalent of unit tests for datasets, at batch run time. Commercial tools such as iCEDQ, RightData, and QuerySurge test data and validate continuous data flows in production in addition to providing automated testing during development.
Packages such as Prometheus can be set up to monitor data pipeline KPIs for timeliness, end-to-end latency, and coverage and then send the data to a time-series analytics dashboard such as Grafana for visualization and alerting.
Most cloud providers offer hosted Kubernetes platforms such as Google Kubernetes Engine (GKE) , Amazon Elastic Container Service for Kubernetes (EKS), and Azure Kubernetes Service (AKS).
Data lakes have not made data warehouses redundant but are complementary. Data warehouses operate on the principle of schema on write, which transforms data to a fixed structure on writing to the database for optimized consumption. However, data warehouses are time-consuming to develop. Data lakes store data in a basic raw form and work on the basis of schema on read. A schema to transform data into a more useful form is applied on extraction. Schema on read puts the onus on the consumer to understand the data and transform it correctly, but the trade-off is that data lakes offer access to much more data than data warehouses alone.
The journey of DataOps adoption begins with a big-picture view and the creation of a data strategy. A 2017 McKinsey & Company survey found the companies with the most successful analytics programs are 2.5 more likely to report having a clear data strategy than their peers.

Quartz 4

Explorer

Practical DataOps

Practical DataOps

The Book in 3 Sentences

Impressions

My Top Quotes

Graph View

Table of Contents