Data Governance Tutorial
In the 15 years I’ve been working in data-related roles, I’ve seen the growing focus on data governance.
In this data governance tutorial, we’ll look at what data governance is, why it’s important, and dive into the key details that set good data governance apart from poor (or nonexistent) ones.
Note: Some links may be affiliate links.
As the amount of available data has grown and the regulations around said data have also started to appear, many organizations have started to think more about data governance.
What is Data Governance?
Data Governance is the rules, processes, accountability around data.
Organizations need to plan how they use data so that it’s handled consistently throughout the business.
Successful data governance considers the who, what, how, when, where and why of the data it’s governing.
The goal of data governance is to ensure security and compliance while also creating value from the data that’s collected and stored by the business.
What is difference between Data Governance vs Data Management?
While data governance creates a framework, data management is more focused on the executive of the rules and processes.
If you want to know more about data management, check out this video.
What does Data Governance matter?
The goal of data governance isn’t just to put a lot of rules in place.
Ideally it’s ensuring consistent, quality data accessible to the right people in an efficient way across the organization.
It should make the business more effective at meeting their business goals and maintaining compliance with applicable regulations.
How to Get Started with Data Governance
1. Define Data Governance Roles
A preliminary step in setting up a data governance framework is figuring out who is going to be involved.
There are normally a few roles related to data governance which may also be performed by the same person depending on the size of the organization.
Data Owners or Sponsors: These people are able to make and enforce decisions about the data.
There are typically many data owners or data sponsors who handle different sets of data - perhaps customer records, product records, and employee records.
Data owners are ultimately responsible for the data.
Data Stewards / Data Champions / Subject Matter Experts: Data Stewards are the folks that really understand the data.
They’re often subject matter experts on a particular type of data and should be consulted on how to take care of the data.
They help make sure that data policies and standards are adhered to in the business.
Data Governance Committee: There are normally many different data owners, data stewards, and data users within an organization.
Because of this, there can easily be different approaches or standards throughout a company.
A data governance committee approves data policies and standards to maintain uniformity. They also handle escalated issues or conflicts.
In addition to establishing who is involved, a very early step in data governance is determining what your goals are.
2. Set a Scope
What’s the scope of data you’re going to be working on?
It’s tempting to say - we want to govern all the data!
But realistically, trying to document and set up rules and process for everything at the same time is usually a recipe for disaster or a delivery years in the future.
Instead, define a scope for the project.
This means determining what you are focusing on - and sometimes of equal importance - what you aren’t.
This can take many different forms.
If your organization needs to maintain regulatory compliance in how you collect, maintain or process some of your data, this is often a great place to start.
For example, maybe your initial scope of data is all customer data.
Setting the scope and determining the members are often intertwined. If you establish the participants first, this will usually drive the topics. If you set the topics first, this can often change the participants.
Once you have the Who and What in place, it’s time to move on to more details.
3. Document the Data
Document what data you have available.
These are your data sources.
Include at least the following information:
- Where does it come from?
- Who owns it?
- Who is the expert on it?
- How often is it updated?
- Who has access to the data and how do they use it?
Before you start thinking about what rules you want to put in place, it’s key to know how data is currently used.
Especially in larger organizations it can be common to make assumptions about how you think people SHOULD use the data, but that doesn’t mean this is the case.
For anyone that has access to the data, check what they are using it for.
4. Check for Multiple Data Sources
You may find as you start exploring what data is available that you have multiple sources for the same information.
This is a great time to start asking questions.
What’s the quality of data from each source?
For instance, if I were working for an automotive company and had multiple sources for vehicle mileage including self-reported mileage on warranty claims or mileage read remotely off of the vehicles control units, the mileages read automatically by the system are likely to be more correct.
This leads to questions about whether you should continue to collect the data in multiple ways and which source becomes the authority to be used within the organization.
As you look into the available data, you’ll usually find that the data related to a particular business topic isn’t just coming from one source.
Rather, there are normally multiple different sources of data that are combined together to give a complete picture.
Let’s take an example of sales data.
You may think that this can easily be maintained within one source.
In practical terms though, sales data is often made up of a few different sources.
You have product information - what do you sell? What variations do you offer?
You have a customer list - who have you sold to? Who have you given quotes to that might purchase?
And then you have actual order information - when was the order placed? How many items were ordered?
5. Map Your Data
To combine all of these, we get into data mapping.
Data mapping tells us how the information from one source aligns - or maps to - data in another source.
In the sales example, the product information probably maps to the order information based on a part number or item number.
The customer list maps to the order based on a customer name. The customer list and the product information don’t directly map to each other at all. They only experience overlap when there’s a third set of data that relates to them both.
Sometimes there’s more than one step in between relating 2 sources of data.
6. Create and Update Metadata
Metadata helps inform everyone more about the data.
Think of it as an interpretation tool.
Metadata establishes specific details about each type of data usually in a few fields.
It tells us what format of data the field contains and usually the general contents.
For instance, in our order information table, the order date will have data in a specific date format. A second field in the metadata tells us this is the data the customer places the order.
Ideally different tables of data have a clear, direct mapping of how they relate to each other with fields that overlap from table to table.
This isn’t always possible or feasible though. In those cases, data scraping can come in handy.
7. Find Missing Data
Data scraping, in its most general form, refers to a technique in which a computer program extracts data from output generated from another program.
For instance, if you require a customer to specify the color of shirt they want in a text box when they order, you might data scrape to automatically identify and pull that color out into a separate field.
Often data scraping becomes much more complex in attempt to add a routine structure to data that’s not all formatted the same way - or not formatted in a useful way.
8. Maintain Data Integrity
I mentioned data quality as an important aspect to be aware of. This consists of many subtopics which I won’t fully cover today.
One important subtopic related to data quality though, is data integrity.
Think of data integrity as how well the accuracy, validity, and consistency of the data are maintained across it’s lifecycle.
Just because you have a data source that you know is useful and accurate doesn’t mean you can assume it’s always good to use.
There are many things that can go wrong.
This can happen when a change is introduced to how information is collected - including WHO or WHAT systems are collecting the information. It can also happen after the data collection before the data is presented to the end user or users.
There could be changes in the way the data is processed or interpreted or even labeled within the system which can lead people to inaccurate results.
Having processes like data validation and error checking in place help ensure consistently high data integrity and flagging problems quickly so they can be resolved.
9. Finalize a Data Governance Framework
Ultimately all the work of data governance should lead to a set of rules, processes and policies to ensure good data accessed by the right people with accountability in place.
This work also should inform business policies as well as data management.
Data governance isn’t a one-off activity.
There’s usually a high level of work when first setting up governance simply because there’s a lot to learn, a lot of information to collect, and often changes to be made.
Once this initial work is complete, there’s still a need to revisit the policies that are put in place and to track how well their being followed.
Over time, new data sources will be established, old sources may become legacy sources or disappear, and the people who need access to the data may change.