Clearing the Polluted Data Streams – Flying Blind

Whether organisations are designing and building automobiles, healthcare systems, marketing campaigns or any of the other products, services, systems and infrastructure we all use, most, if not all, have a common data life cycle. From a high-level perspective, that cycle’s stages include conceptualisation, definition, design, build, and operation. In reality, this cycle may have many iterations between phases as a result of corrections for detected problems and other modifications. Since Boehm published ‘Software Engineering Economics’ in 1981, there have been numerous studies (IBM, NASA and others) on the cost to fix problems caused in earlier stages of the life cycle at later stages. The cost escalation is exponential demonstrating that early detection of errors is critical to the success of the product or service. As sometimes is noted in the public media, detection by the user or customer of mission critical problems can threaten the existence of the product or even the producing company itself.

For an organisation working on early detection of data related errors, two major aspects need to be addressed.

The first question to answer is what is important about the data, i.e. what are the critical attributes?

For example, many information systems use the concept of Person. In some systems Person may be Employee or Customer or even both. Some of the attributes of Employee and Customer may be the same such as legal name and residence address. Other critical attributes will be different such as ‘employment status’ versus ‘products purchased’. Whether documented in business glossaries, concept diagrams, interviews or elsewhere, complete and accurate identification and definition of the business terms, business rules, constraints, usage and interrelationships are the prime focus of the conceptualisation phase. Evaluating the completeness and accuracy of these attributes is a fundamental technique for early detection of potential errors prior to the design phase.

Besides receiving incomplete or inaccurate information from the conceptualisation phase, not having data design standards (including design methods), failing to have consistent standards or failing to adhere to existing validated standards is a principle source of errors in the design phase. A corollary is that qualified data design professionals and tools, such as model-based design tools, must also be available and trained to implement these standards. What typically shows up in design level errors include redundant data elements with no or conflicting meaning and usage, as well as failures to accurately implement the business’ intentions, conflicted or inaccurate datatypes and invalid constraints among other problems. Similarly, in the build phase where the task is to map the data design to physical architecture, standards relating the chosen RDBMS are required for successful implementation.

The second question to answer is how does an organisation know it is looking at, and finding errors during the life cycle?

The underlying principle is accountability. In less mature organisations or for very simple systems, accountability often is the responsibility of the developers during the design and build phases. They may also be called upon to validate the information from the conceptualisation phase. As the complexity of the data systems and/or the maturity of the organisation’s data management practices increase, the concept of ownership and stewardship, often independently implemented, becomes institutionalised. With this, detection and measurement of error types and rates of relevant attributes can be made at convenient points in the data life cycle. Analysis of this information will lead to understanding correcting the root causes of these errors.

Some organisations may perceive that standards and stewardship strategies negatively impact time to market. However, when they evaluate the cost to repair errors and respond to the unacceptability of their error prone products and brand, they may soon realise they need to step up their game.

Organisations that fail to institute methods and means to detect errors through the data life cycle are ‘flying blind’. That is no longer affordable.