Recovering Data Quality When Big Data Goes Wrong
In today’s data-driven world, having data to make decisions gives you a significant advantage… unless the data quality is bad. See how Datafold can assist you.
In today’s society, everything is based on data.
Despite the fact that John Mashey coined the term over two decades ago, Big Data has risen to the forefront of technology in the last ten years. Companies have formed teams that use mathematical analysis and inductive statistics to uncover linkages and dependencies as a result of the Big Data hunt. This subset of Big Data engineers’ purpose is to use data to forecast events and behaviors, resulting in a competitive advantage for the company.
In order to use data in this way, the data must be sound and dependable in the first place. In other words, attempting to make decisions based on faulty data is really worse than making a decision based on no data at all.
“Good business decisions cannot be made with bad data.” – Uber Engineering
In this essay, I reflect on a lesson I learned from a past employer who tried to exploit data that turned out to be bad data. We’ll fast-forward to modern engineering methodologies that preserve data quality as part of the development lifecycle based on that lesson.
Taking a Look Back at the Real Estate Industry
Prior to Big Data, data warehouse (DW) and business intelligence (BI) tools were used to get insight into the state of a company’s operations. Even before then, information technologists were frequently recreating the wheel (in silos) in the hopes of gaining a competitive advantage through custom programming.
At this time, I was fortunate enough to be working with a real estate industry leader. Despite being the market leader in their industry category, keeping a safe distance from competitors became a struggle.
The length of time required to define, justify, and safeguard the amount of money charged to tenants became one of the company’s interest areas. Rather than charging a flat cost per square foot, additional data factors influenced the rent—a price agreed upon by both parties.
Consider the following five data points:
- The condition of the property on which the space is located.
- The space’s location within the property
- The property’s proximity to other occupants
- Tenants’ previous interactions with the real estate firm
- Tenant stability while considering a new lease
Each of these issues was studied and answered by the leasing team, who used several systems to do so.
Providing the Most Appropriate Rental Option
To address this issue, the IT division launched a self-funded effort. The idea was to launch an app—called lets it Ideal Rent—that would prompt the user with a series of questions, such as these:
- The intended space’s property and location(s)
- The proposed lease’s start and expiration dates
- Name of the tenant and details about their usage
Using this data, the system would calculate and predict a rate that was justified by the factors that gave the property and the tenant equal value. The Ideal Rent solution used the following design at a high level:
Because data integration products were still in the Gartner hype cycle’s Technology Trigger phase, the work to finish the logic behind the scenes was highly significant.
Introducing the Perfect Rent Solution
When the leasing team first looked at the program, they were skeptical that such a basic input form could provide a result that previously required a lot of human examination. The leasing teams soon spotted portions of the resulting recommendations that were not legitimate assumptions when they first saw the application. Essentially, the IT team believed that they had a greater understanding of the leasing process than the owners.
The system did not become the single point of contact for finding the best option for a certain lease. In truth, this incident taught me two important lessons:
- Because the leasing team was not completely engaged in the project, the data was misunderstood.
- The feature teams were unaware of the data changes that were occurring upstream. This had an impact on the data quality and the downstream results of the Ideal Rent application’s recommendations.
Quality data is required for data-driven decision-making.
The main takeaway from the leasing business scenario is something I’ve explored in previous DZone.com articles. One of my favorites is the publication I wrote in 2017 called “The Secret to a Superior Product Owner.” It’s about a guy named Michael Kinnaird, who is still the best product owner I’ve ever worked with in my 30-plus years in IT.
The second lesson we gained in the Ideal Rent case is summarized in the Uber Engineering quotation from earlier.
Quality control measures for data are just as vital as those put in place to test and validate computer code before it reaches the hands of end-users. Changes in the design of the data were not known to the team using the data for their application in the case above. The findings were harmed as a result of this.
At the time, I remember being astonished by this conclusion because I thought the data was sound. The irony was not lost on me, as I had spent my whole career dealing with “change” as the main motivation for product design and development.
How Should Data Quality Be Achieved?
I realized something as I reflected on the timing of the sample use case. The result would have been devastating if the Ideal Rent application had been released prior to the exposure of the game-changing data modifications. I can only imagine how non-ideal rent values would have affected this company’s future Wall Street valuations.
We would have discovered our data issues far sooner if we could have done data observability and data quality like we do now. This would have averted embarrassment, headaches, and frustration, as well as the chance of massive risk exposure.
I recently came across Datafold, a data dependability platform that assists businesses in preventing data breaches. Their Data Diff function is laser-focused on detecting data differences in source data that apps and processes are using. The software is even capable of handling billions (not thousands or even millions) of records.
Let’s look at three simple data quality concerns in the real estate business that would be difficult to understand otherwise to demonstrate the value of recognizing data quality issues:
- Adoption of a SIC (standard industrial classification) code system that is unique to the company.
- Modifications to the property tier structure
- Changes to the space quality ranking system
If the data consumers were unaware of the data-affecting difficulty in each case, the effect would be a negative influence on data quality.
Adoption of a SIC (standard industrial classification) code system that is unique to the company.
The standard industrial classification (SIC) code system was created to assign a four-digit code to each industry. For instance, if you were to create a bicycle shop, you’d use the 3751 SIC code.
Consider the difficulty where the SIC codes were too broad to reflect the genuine desire of the spaces being occupied, to simplify the example use case. To put it another way, stores that specialize in supplying various forms of entertainment (such as video stores, music stores, and musical instruments) all have the same SIC code.
Let’s pretend the real estate corporation took the time to add more SIC codes to solve this deficiency. More information on the underlying firm that utilized space at the sites would be available as a result of this.
The team aiming to provide an optimized rent suggestion, on the other hand, was unaware of the change. As a result, in circumstances where the new custom SIC code could not be determined, the computation reverted to an unknown state, resulting in a sub-par result. Furthermore, when a SIC code was reused, the projected rent value produced undesirable outcomes. The monthly rent value would be substantially lower than predicted if the custom SIC code mapped to a tire store (using the usual SIC code) instead of a custom jeweler.
Modifications to the property tier structure
To better identify the quality of their properties, the real estate company used a tiered structure. A Tier 1 property was essentially designated for people who were regarded as the best. Based on corporate-wide appraisals, the property moved down the list as the tier level grew.
Even though the Tier 3 and Tier 4 properties were on the lower end of the scale, they were nonetheless profitable. The desired rent for those places, however, was less than that of Tier 1 or Tier 2 property.
When evaluation metadata was included at the Tier 1 level, another surprise for the IT team could have occurred. Let’s say you needed to add sub-tiers to answer the question, “Why is this property one of our best?” Location and closeness, tenant quality, and financial revenue generated are all possible solutions.
When location and proximity were factors in determining the tier, the sub-tier would have a varied impact on the optimal rent recommendation. Tier 2 or Tier 3 was usually the tier level in that situation.
Changes to the space quality ranking system
Changes to the business principles that govern space quality could have an impact on how ideal rent is calculated. Consider if the original design for space quality evaluation was based on a scale of 1 to 5, with 5 indicating the best of the best. The design was then altered to reflect a four-point scale, with four being the highest value.
The feature team would be unaware that the definition had been refactored unless they were aware of the choice or were closely watching production data. This would mean that the space quality component of the computation would be inaccurate by at least 20%, negatively impacting the estimated ideal rent.
Including Data Diff in the Development Process
Extract, transform, and load (ETL) services were used in the Ideal Rent application. In other words, it gathered the required information from the source systems and converted it into a format that the program recommending the best rent could understand. Changes to the underlying data went missed at this level, resulting in detrimental consequences for the decisions based on that data.
Adding Data Diff to the workflow just adds another stage to the continuous integration (CI) process. The results of a Data Diff test appear as part of your pull request review process after setting data sources linked to your integration and then adding Datafold to your debit setup.
As a result, everyone who is involved in the PR process has access to Datafold’s data quality analysis.
But Wait, There’s More
You might be thinking at this time that there’s still a hole here. When there’s a code change and a pull request, data quality steps can’t be consigned to the CI/CD pipeline. What happens if the code for the Ideal Rent application hasn’t changed, but the source data’s rules have?
Datafold’s column-level lineage function comes in handy here. When the engineering or data teams are only exploring data rule modifications, they could wonder, “How would the data utilized in our final calculations be altered if our query included values from that table’s field as well?” The team can see how data travels via the waterfall of queries and transformations using column-level lineage. Make a modification here and observe how it affects your data collection.
The team would use Datafold’s UI to view and understand how upstream changes to their data rules affect their downstream data, whether it was the data team or the engineering team. This analysis is performed independently of the CI/CD pipeline and code changes.
Remember that you must be able to detect data quality issues without requiring a code modification. After all, the Ideal Rent development environments might not have all of the changes that the source systems have, therefore there needs to be a safeguard to protect production users making data-driven decisions.
This is why data quality is so important for every application that relies on data to make decisions. Data lineage technologies, such as Datafold’s column-level lineage analysis, can aid with this.
I’ve been striving to live by the following mission statement since 2021, which I believe can be applied to any IT professional:
“Focus your time on delivering features/functionality which extends the value of your intellectual property.
Leverage frameworks, products, and services for everything else.”
– J. Vester
The importance of data quality is shown in this post by an experience I had early in my career. A lack of data quality will always have a disastrous effect on data-driven decision-making systems.
Corporations that rely on data to make key choices should consider data quality tooling, which should be part of the software development lifecycle.
Have a fantastic day!
Get in touch
Address: 3rd Floor, QTSC Building 1, Street 14, Quang Trung Software City, Tan Chanh Hiep Ward, District 12, HCM City, Vietnam