The business case for big data platforms to date has been based on generating incremental business value through the insights gained by analyzing vast data sets available across the enterprise and beyond. The promise of delivering this business value by simply dumping all the organizations’ data into a “data lake” with a set of analytical tools on top has failed to materialize. As a result, Gartner predicts that over 60% of big data implementations will fail through 2017.
The reason that organizations have failed to realize the value isn’t because having the data in one place isn’t valuable; it’s that the data in the lake is often sitting in its raw form. While identifying how data will be used and fully defining it upfront could help to address this problem, it limits the speed at which value can be delivered and hinders the ability to develop new insights where the value is not yet known. Organizations must rethink data governance in this world of unknowns: providing guidance that maximizes the value that can be derived from big data while minimizing risk to the organization from that data being used incorrectly.
Rethinking Data Governance for Big Data
All data starts off in a raw and messy form, often from multiple sources and of unknown quality. Before analysis, most data gathering methods are indiscriminate, netting irrelevant and bad data along with reputable information. Trying to extrude insight from that jumbled mess without first cleaning it up is about as straightforward as trudging through a bog – progress is possible but not a foregone conclusion.
Experienced data scientists who have the time and passion to validate the quality of data assets have historically been the only ones handling these big data platforms. Self-service analytics have significantly expanded who has access to this raw data and for many organizations that means more end users spending their time querying, cleansing, and preparing data sets in search of value.
With the involvement of more end users with discrete needs, data governance processes must evolve to enable the curation of sets of data that are fit for a variety of purposes. While some users have a need for truly raw data, such is the case in detecting fraud, most users have the need for data that is more refined.
Consider this revenue-driving healthcare scenario. The average healthcare payer stores petabytes of data ranging from member/patient claims and medical history to member engagement and general workforce management. The ability to refine trusted sources of member data into a Member 360 enables improvement in customer experience, and when supplemented with external sources of clinical data, assists analysts and quality stakeholders to help pinpoint increases in Medicare Star Ratings and HEDIS scores. An incremental movement in just a half of a point means millions in potential reimbursement and incentive dollars.
Accelerating Big Data Analytic Value through Data Governance
Imposing order on raw data requires intention and a firm understanding of data governance principles. Data procedures, focused analytical scopes, and user-centric mentalities need to be developed from unprocessed foundations. Organizations that are either new to data governance or are looking to evolve their data governance program in the face of big data often find the following strategies enhance their analytical abilities the most.
Formalizing data stewardship – One of the biggest challenges in working with big data is that most end users don’t have the time or expertise to fully comprehend raw data. The business rules that have traditionally gone into processing these data sets often reside in the heads of SMEs who have full-time day jobs. Making this information relevant and discernable to other departments that need it can quickly become a burden on SMEs as more users reach out to understand the data.
Formalizing the role of the data steward enables SMEs to focus part of their time on documenting those business rules once, so that they can be available to the broader organization. Stewards have these data governance responsibilities become part of their goals and objectives for the year so that they get credit for the time spend supporting governance. As organizations mature their data governance program, metrics can be established that ensure the organization is realizing value from data in the data lake and that that lake isn’t becoming a swamp of poor quality data with little to no analytical value.
An important note is that while the role is formalized, organizationally it is critical that these stewards remain connected to their operational areas as the business rules will evolve over time and they can quickly become disconnected if pulled out of these day-to-day processes.
Developing the Data Catalog – It is important to equip the data stewards with the proper tooling to document the data sets within the lake. This allows them to collaborate with the data users so that the knowledge is retained, cataloged, and easily searchable in the future. The data catalog is most effective when it is integrated into the environment in which analysts are already working with the data.
This has been where traditional metadata management solutions have fallen short: they primarily have served a small number of IT staff who have to maintain ETL code. The modern data catalog enables the organization to establish a central business glossary, register all data coming into the organization, link data back to the glossary, and begin to capture the business rules that are applied to each data set along with the rationale. The more advanced solutions allow users to even add trusted data sets to a virtual shopping cart so that they can check out the data sets they need.
Prioritizing Data Quality – In data governance, cleansing the data for quality is a reactionary fix to a known challenge. Feeding in sources that are incorrect, shallow, or without metadata forces a future cleanup. As a result, bad data sneaks into the analysis process. Strong data governance ensures team involvement to isolate potential contaminants before they infect data lakes.
Data quarantine works effectively as a governance strategy. By dumping all potential data into quarantined storage, data scientists are able to explore and juxtapose the data alongside business rules, standards, and verified data sets to ensure only the best structured and unstructured data reach end users.
Narrow the Initial Focus – The potential of big data analysis encourages some organizations to launch with a big bang approach. However, data governance cannot establish relevant context with such a comprehensive scope. Running a precise use case in a test department led by a single champion defines the context. From there, data scientists screen out all but information relevant to the case.
This more measured approach provides organizations with a substantial ROI. Because there are fewer test cases running simultaneously, there is less redundancy as new parameters are explored. Plus, by determining data governance rules on a smaller scale, less time is waste fumbling to find relevant data standards that appeal across divisions. This smaller scope guides the overall process.
Why Data Governance Is One Piece of Smart Big Data Strategies
As a sophisticated platform, an organization’s big data tools will require careful delivery across several layers of performance. Data governance is a central part of unifying the comprehensive analytical strategy, but can still fall short if certain obstacles to implementation are not engaged carefully.
Want to learn more about the most common threats to big data implementation. Download our whitepaper “5 Avoidable Big Data Platform Mistakes that Companies Still Make.” Our team of industry experts explore where companies struggle and how to effectively unleash a strong big data strategy.