Purpose
Provide guidance to Information Managers and Architects on
design concepts, development and implementation approaches, opportunities for
improvement, and future direction of Analytical Data Warehouses and Data
Marts.
Background and History
The most fundamental Analytical system may perhaps be the
Analytical Data Warehouse. No other
informational system is designed for such a singular analytical purpose as an
Analytical Data Warehouse. Before we
provide a definition of a Data Warehouse, let’s first look at how Data
Warehouses came into existence. Knowing
the history of a system methodology is a great way to put its definition into
context.
There is general consensus that Data Warehousing first
appeared in the 1980s. The components of
Data Warehousing, however, were being created as early as the 1960s. There are
many different individuals credited with discovering Data Warehousing. Wikipedia has a great chronological timeline
that provides a lot of context to the evolution of data warehousing
methodologies.
We have added the timeline to this blog. Go here for the complete Wikipedia article: http://en.wikipedia.org/wiki/Data_warehouse
- 1960s — General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts.[2]
- 1970s — ACNielsen and IRI provide dimensional data marts for retail sales.[2]
- 1970s — Bill Inmon begins to define and discuss the term: Data Warehouse
- 1975 — Sperry Univac Introduce MAPPER (MAintain, Prepare, and Produce Executive Reports) is a database management and reporting system that includes the world's first 4GL. It was the first platform specifically designed for building Information Centers (a forerunner of contemporary Enterprise Data Warehousing platforms)
- 1983 — Teradata introduces a database management system specifically designed for decision support.
- 1983 — Sperry Corporation Martyn Richard Jones defines the Sperry Information Center approach, which whilst not being a true DW in the Inmon sense, did contain many of the characteristics of DW structures and process as defined previously by Inmon, and later by Devlin. First used at the TSB England & Wales
- 1984 — Metaphor Computer Systems, founded by David Liddle and Don Massaro, releases Data Interpretation System (DIS). DIS was a hardware/software package and GUI for business users to create a database management and analytic system.
- 1988 — Barry Devlin and Paul Murphy publish the article An architecture for a business and information system in IBM Systems Journal where they introduce the term "business Data Warehouse".
- 1990 — Red Brick Systems, founded by Ralph Kimball, introduces Red Brick Warehouse, a database management system specifically for data warehousing.
- 1991 — Prism Solutions, founded by Bill Inmon, introduces Prism Warehouse Manager, software for developing a Data Warehouse.
- 1992 — Bill Inmon publishes the book Building the Data Warehouse.[3]
- 1995 — The Data Warehousing Institute, a for-profit organization that promotes data warehousing, is founded.
- 1996 — Ralph Kimball publishes the book The Data Warehouse Toolkit.[4]
- 2000 — Daniel Linstedt releases the Data Vault, enabling real time auditable Data Warehouses warehouse.
If you have any familiarity with Data Warehousing, you will
see some interesting items in this timeline.
The first item that should jump out at you is that the concepts of
Dimensions and Facts, which are fundamental elements of Data Warehousing, were
created prior to the Data Warehouse.
Another item that you should notice is the repeated and competing
contributions of two key Data Warehousing individuals, Bill Inmon and Ralph
Kimball. Bill Inmom and Ralph Kimball
are considered by many to be the founders of Data Warehousing, though IBM
surely disputes this fact. The reason
these two distinguished IT leaders are considered the founders of Data
Warehousing is that each developed a unique, practical, and somewhat dogmatic
methodology for implementing Data Warehouses.
We will discuss both Inmon’s and Kimball’s approaches to Data
Warehousing later in this blog.
Another very interesting item that is inherent in this
timeline is that it paints the picture that no innovation to Data Warehousing
has occurred after the year 2000 with the introduction of the Data Vault. We
will describe the Data Vault concept later. While there has been no major data modeling or methodology innovation in
Data Warehousing since 2000, there have been huge technical innovations that
affect the Data Warehouse. These technical innovations can be summarized into
three categories:
- Improving Data Warehousing
- Includes items such as Massively Parallel Processing, MPP, and columnar database technologies that have enabled huge gains in the performance and scalability of Data Warehousing.
- Augmenting Data Warehousing
- Includes items such as Big Data appliances that specialize in capturing and presenting data elements that have traditionally been too costly or unpractical to place into a Data Warehouse, like unstructured data, i.e. data not stored in a data base, or clickstream data, i.e. web site behavioral data. These items are allowing the traditional Data Warehouse to focus on what it does best while providing end users access to other forms of information.
- Replacing Data Warehousing
- Includes items such as NoSQL databases and in memory appliances that aim to replace the need for Data Warehousing by providing the same functionality of a Data Warehouse without all of the effort associated with traditional Data Warehouses.
As you look at the background and history of Data
Warehousing you will notice a theme of technology-enabled advancement in this
field. The components of an Analytical Data Warehouse were invented first. These
components were born out of the academic world and were initially theoretical
concepts because technology couldn’t support their realization. In the 1980’s,
technology finally began to “catch-up” to the theory of Data Warehousing and
individual concepts were merged together to form a practical methodology,
actually 2 Inmon’s and Kimball’s versions of Data Warehousing. During the last
part of the 20th century Analytical Data Warehouses were focused on
storing structured, electronic data in a relational database format to provide
end users fast access to aggregated information for decision making purposes.
The introduction of the data vault methodology at the onset of the 21st
century signaled that data centric technologies were starting to become
sophisticated enough to evolve Analytical Data Warehouses past the point of
being seen as a specialized decision support system to a point where they were
viewed as a hybrid system that supported both analytical and operational
requirements.
We will now define the terms related to Analytical Data
Warehouses keeping in mind the context and knowledge of their history and
background.
Definitions
The following terms are common in Data Warehousing and need
to be clearly defined/understood.
Analytical Data Warehouse
- Specialized data storage system that combines desperate data sources into a clean, business centric, historically accurate, user friendly, cohesive view of an enterprise’s business information. Analytical Data Warehouses turn raw data into business friendly information that allows business users to easily preform enterprise analytical functions, such as performance management or business optimization. Analytical Data Warehouses contain an enterprise wide view of information that combines many different views of business unit information.
Analytical Data Mart
- Specialized data storage system that combines desperate data sources into a clean, business centric, historically accurate, user friendly, cohesive view of a business unit’s business information. Analytical data marts turn raw data into business friendly information that allows business users to easily preform business unit analytical functions, such as marketing segmentation or sales forecasting. Analytical data marts contain a business process specific view of information that focuses on a select few specific business processes.
Information Factory
- Specialized data management architecture designed to manage the data created by producers, i.e. operational systems, and transferred into consumers, i.e. analytical systems. This methodology is much broader than a logical or conceptual database design methodology as it incorporates larger, non-database data management elements such as integration, meta-data management, etc. Data is stored in a central, normalized Enterprise Data Warehouse and distributed to business department aligned Data Marts.
Data Bus Architecture
- Specialized data management architecture designed to support institutional decision making leveraging a coordinated collection of related business process aligned Data Marts. One of the central themes to the Data Bus Architecture is to align Data Marts to business processes rather than business departments. This means having a sales transaction Data Mart that is leveraged across Marketing and Stores instead of having independent Marketing and Stores Data Marts with isolated sales transaction data within them. A Data Warehouse is created by combining the business process oriented Data Marts through the use of common (conformed) dimensions.
Dimensional Model
- Specialized data management design methodology that is used in analytical data systems to provide superior query performance for end users. Dimensional Models are de-normalized, meaning they do not adhere to the normalization rules that many transaction systems use. Instead data is physically stored in Fact and Dimension tables. Dimensional Models leverage a star schema approach where Fact tables are placed at the center of subject specific Dimensional Models and Dimension tables join to each Fact table.
Fact
- Specialized data attribute in analytical systems that typically represent a quantity, measurement or metric. Facts are typically numeric. An example of a Fact is sales, or order quantity.
Dimension
- Specialized data attribute in analytical systems that provides descriptive context for Facts. Dimensions are typically strings or dates. An example of a Dimension is date or employee.
Atomic Data
- Important Data Warehousing concept that describes the logical database storage structure designed to capture and retain data in its most granular form. Typically atomic data in a Data Warehouse mimics the original, pre-aggregated state of data. For example, atomic data in a sales subject area of a Data Warehouse would represent data at the sales transaction level.
Summary Data
- Important Data Warehousing concept that describes the logical database storage structure designed to capture and retain data in an aggregated form to enable faster end user query performance. In addition to faster performance, summary data in a Data Warehouse is created to give end users access to more meaningful information compared to atomic data. Summary data is built by aggregating and summarizing atomic data.
Data Vault Model
- Specialized database design methodology that is designed to capture and store historically accurate data from multiple operational sources. The goal of the Data Vault is to store “all the data, all the time”. Data is preserved in its original form. There is no cleansing of data on entry. Therefore “wrong” data is deliberately stored in a Data vault. Data is stored in one of three table types; Hub, Link, or Satellite. Though Data Vaults can play a role as an Analytical Data Warehouse, they typically act as Operational Data Warehouses. They are included in this blog entry because one may encounter this methodology in a “real world” Analytical Data Warehousing situation.
Hub
- Data Vault table type that stores natural business key data. Hubs were created based on the premise that natural business keys only change when the business physically changes. Therefore a best practice data model methodology should abstract descriptive elements from business keys and should place the business keys at the center of the data model.
Link
- Data Vault table type that models relations, associations and transactions between Hubs. Links could be considered an associative or many-to-many table in other data modeling methodologies. Links exist to track the historically accurate relationship between Hubs. The structure of a Data Vault comes from Hubs and Links.
Satellite
- Data Vault table type that contains the descriptive elements for Hubs and Links. Satellite tables store the meta-data that describes the business process that is modeled in the Hub/Link structure.
Survival
Tip: Learn all the Analytical Data Warehouse data methodologies and
apply the techniques that will work best in your organization. In the Data Warehousing market, an argument
has existed for many years over which methodology was supreme, Kimball’s
methodology built on business process aligned Dimensional Modeling or Inmonn’s
Information Factory methodology. The argument should have never been about
Kimball vs. Inmonn or Dimensional Models vs. Information Factories. Rather the
argument should be about, how much should you leverage from both methodologies
to be successful. Here’s a great article, though it’s written by Kimball, on
the differences between the two methodologies:
http://www.informationweek.com/news/17800088
Please note that the terms Data Warehouse and Data Mart are
used interchangeably in the “real world”. Though these terms both represent
similar structures and concepts, Data Warehouses have a much broader scope of
data and end users compared to Data Marts. Data Warehouses typically contain
multiple business unit contexts of data to provide an enterprise wide view of
information. Data Marts typically
contain a single business unit or process view of information. The breadth of
data contained in the data structure, not the actual size in bytes of the
structure, is what drives the difference in classification between a Data Warehouse
and a Data Mart.
Role of an Analytical Data Warehouse
The role of an Analytical Data Warehouse or Analytical Data
Mart is to support decision making. The methodology used to model the Data
Warehouse does not change the ultimate end goal, that of decision making.
To enable decision making, data is sourced from multiple
operational systems, like HR, CRM, Sales, ERP systems. The operational data is
combined and transformed into a more business friendly format to support
decision making. An example of the transformation that occurs in an Analytical
Data Warehouse to turn raw data into information is taking reference codes,
such as UPCs, and transforming the codes into descriptive information, such as
a product name.
Both people and other software programs act as the decision
makers who leverage Analytical Data Warehouse information. The information in
the Analytical Data Warehouse is typically delivered through specialized
software tools that provide a friendly user interface, through the use of
consumer-level web technologies, data visualizations, and drag and drop
interfaces. These tools are typically referred to as Business Intelligence or
Decision Support tools.
To reiterate the role of an Analytical Data Warehouse is to
collect and consolidate multiple operational sources of data, transform that data
into business-friendly information, and make the information available for
users or other software systems to consume. All this is done to enable
fact-based decision making, either by end users or automated systems.
Survival
Tip: Use Analytical Data Warehouses/Marts wisely. Their purpose is to
enable decision making. They should not be placed in a mission critical,
transactional i.e. “operational” business process as a real-time data provider.
They should be used to present information to end users to enable effective, fact-based
decision making. Be wary of those that
want to “hit the Data Warehouse” as a part of an operational business process
and make it sound like it’s a good idea.
Architecture Reference
Detecting an Analytical Data Warehouse from an architectural
perspective is pretty straight forward.
Look for a data persistence layer that has multiple sources of data
flowing into it and aggregated information flowing out of it. Also, the
consumers of the persistence layer’s information should also help you identify
an Analytical Data Warehouse. If the consumers of the information are making
fact-based decisions based on the information being served up from the
persistence layer, than chances are you are looking at an Analytical Data
Warehouse.
The following diagram, introduced in the Analytical
and Operational Systems blog post shows a typical Analytical Data
Warehousing environment.
Note the multiple sources of data, the centralized
persistence layer (Analytical Database), and fact-based enabling presentation
layer.
The definitions section of this blog described two main
types of Analytical Data Warehouses; the Information Factory and the Data Bus
Architecture. It is important to be able to detect what Analytical Data
Warehousing environment you are working in as each has a unique way of
persisting, presenting, consuming, processing, and managing the data and
information that flows within the environment.
Data Bus Architecture
The Data Bus Architecture creates Data Warehouses by combining business process oriented Dimensional Model Data Marts. The Information Factory also leverages Dimensional Modeling in Data Marts. However, there is a big difference in how Dimensional Modeling is applied in Information Factories compared to how Dimensional Modeling is leveraged in a Data Bus Architecture.
In a Data Bus Architecture
environment the Data Marts are:
- Always aligned to business processes not business departments
- Always related through the use of a common set of main, conformed, Dimensions
What makes the Data Bus
Architecture unique is that the Data Marts are coordinated around the conformed
dimensions so that a larger, enterprise view is derived from the individual,
business process aligned Data Marts. Conformed dimensions provide a mechanism
for an Enterprise Data Warehouse while eliminating the need to create a
specialized Enterprise Data Warehouse. In other words, data is stored in Data
Marts, and Data Marts are joined together logically to create a Data Warehouse.
Understanding business process
oriented Dimensional Modeling is a big key to understanding the Data Bus
Architecture.
The following image shows a
typical Dimensional Model. This image is source from The Kimball Groups website
from one of Ralph Kimball’s original articles introducing the concept of the
Data Bus Architecture and Dimensional Modeling.
Note: The numbers in this
diagram refer to the steps to “gracefully modify” the model, i.e. extend it by
adding another dimension table to it.
Like any Dimensional Mode there
is a “star” like structure with a fact table placed in the center of the star
and dimension tables placed around it. The fact table is focused on a singular
business process. This Dimensional Model looks like a sales transaction
Dimensional Model.
The example above is a single
subject area view of a dimensional model. What is missing from this view is an
additional fact tables and conformed dimensions.
The Data Bus Architecture
typically leverages an Atomic and a Summary Dimensional Model layer where
atomic level data, i.e. data at the lowest grain like a sales transaction, is
stored in the Atomic layer of the Dimensional Model and aggregated views of
this atomic data are precalculated and stored in summary fact tables in the
Summary layer to help with query performance. The following diagram shows how
data flows into a typical Dimensional Model based Data Warehousing environment.
The source/staging area in this
model is a landing area used as an intake mechanism for data to flow into the Dimensional
Model. We will discuss specifics around ETL/Integration in a separate blog
entry.
As discussed, the Atomic layer
contains data at its lowest level, i.e. the lowest grain. This layer is
typically the easiest to produce from an ETL perspective as the data in this
layer aligns very closely with the source data coming into the Data Warehouse.
The value of this layer is that it is very flexible since it is easy to
summarize data up from the lowest level.
It is much more difficult to “back into” atomic data from summarized
data.
Survival
Tip: If you are going to use a Data Bus Architecture approach for a Data
Mart or Data Warehouse, create an Atomic layer. Learn the value of this layer
and relentlessly promote it to your team. Take every opportunity to promote the
usage of this layer to the end users/tools. Make the Atomic layer the “go to”
source for data for your Data Mart or Data Warehouse. Remember it is easier to
summarize data “up” from the lowest level vs. trying to add “lower level” data
to an existing environment.
As discussed, the Summary layer is
a logical abstraction/layer. The physical representation of this layer is sometimes
persisted in tables and sometimes it exists logically in views. The purpose of
this layer is to enhance the end user experience, either by improving query
response time or by presenting data in a more user friendly manner compared to
the Atomic layer. Data is summarized into a higher grain from the Atomic
layer. For example, if data in the Atomic
layer is captured at a time grain of day, data may be summarized in the Summary
level to a month grain. In this example,
the advantage of creating a Summary layer is that end user tools won’t have to create
a “complex” query to summarize the day grain data to return results at the
month level. Instead the end user tools can pull data directly as it exists in
the summarized table, at the month level.
There is a tradeoff with placing
a Summary layer into your Data Warehousing environment. The tradeoff has to do
with processing time. Essentially you
are shifting the processing time it takes to run end-user queries to derive a
summarized data result from the end user tool and pushing it back into the data
flow, i.e. ETL, process. The tradeoff can be calculated which enables a logical
decision making process when determining when to create a Summary layer. The
decision can be calculated by comparing the end users experience/processing
time with the amount of processing time/effort it will take to load a summary
layer table as part of an ETL routine.
The orange blocks in the
Dimensional Model layers represent Conformed Dimensions. Note how they are
shared among the different facts in both the atomic layer and the summary
layer. There are only 3 conformed dimensions in this example and they are
logically shared between the atomic layer and the summary layer.
Survival
Tip: Be conscious of how many persistence points, i.e. physical data
storage points, exist in your Data Warehousing environment. Each physical
persistence step in a Data Warehouse requires time, management, code, etc to
move the data from one spot to another. Be paranoid about persisting data and
always understand why someone is recommending persisting data as a step of data
flow. If a persistence step provides no business value, then eliminate it.
The Data Bus Architecture
organizes Dimensional Models around unique business processes. Data is presented to end users across the
business process oriented Data Marts. A special planning and management tool
called a Data Bus Matrix is used to understand what Data Mart’s/Fact Tables
each department within an organization will use as part of their Analytical
decision making toolset.
There are several excellent
books, web sites, articles, etc written about the Data Bus Architecture. This
architecture reference/overview only presents some of the high points. Please
research this architectural methodology if you think it will help your
organization meet its goals.
Information Factory Model
From an architecture perspective, Information Factories are larger than just database models. They are all encompassing Information Management ecosystems. Dimensional Modeling techniques are used in the Data Mart layer of Information Factories, but the main persistence layer of the Information Factory is the Enterprise Data Warehouse (EDW).
The following diagram is taken from Inmonn’s website (http://www.inmoncif.com/library/cif/ and
depicts a Corporate Information Factory (CIF) in a web site environment. The
diagram is a good look at the CIF as you can see several identifying
characteristics of a CIF, the ODS, the EDW, the business area aligned Data
Marts. Take a look at the diagram and see if you can spot any of your
organization’s components within it.
The next diagram of an Information Factory is much more
conceptual and shows the major components more clearly. This diagram is sourced
from a very good Information Management article written by Claudia Imnhoff http://www.information-management.com/issues/19991201/1667-1.html
The article provides a good overview of the components
within the Information Factory, including the ODS, Data Warehouse, Data Marts,
Integration Layer, Metadata Management layer, etc,
There are 2 sure fire ways to identify if you are working
within an Information Factor environment.
- The Enterprise Data Warehouse is in 3rd normal form (3NF)
- Go here for the definition of 3rd normal form: http://en.wikipedia.org/wiki/Third_normal_form
- Data Marts departmentally aligned
- As opposed to business process aligned
If the answer to either of these questions is yes, then you
are working within an Information Factory.
Survival
Tip: Introducing an Information Factory into an enterprise requires a
lot of deliberate and strong executive vision and support. The fundamental
changes of Information Management that are introduced in an Information Factory
will cause friction within organizations. The only way to successfully navigate
the cultural changes within organizations is through strong executive
leadership.
Design Concepts Applied
Now that we’ve helped to classify the two main categories of Analytical Data Warehouses we will walk through the design concepts associated with them. These design concept questions are provided to help you when reviewing prebuilt systems or when designing new systems.Go here for an overview of the definition of the different design concepts
Design Concept: End User Access Points
Application: Most Analytical Data Warehouse systems have tools sitting in between end users and the system. There is very little direct end user access of the data held within the Data Warehouses. However, we will provide a common set of questions to help determine how end users should and will access an Analytical Data Warehouse.
Common Design Considerations:
- What data, tables, schemas, etc will end users access directly?
- What tools will the end users use to directly access the Data Warehouse system?
- What type of queries will the end users run against the Data Warehouse?
Survival
Tip: Use caution when granting direct access to an Analytical Data
Warehouse. It’s true that the Data Warehouse exists to provide end users access
to information, but direct access to the data contained in the Data Warehouse
poses all sorts of risks. A few examples of these risks are: negative
performance impacts from expensive queries, data security access risks, data
integrity access risks. Be sure you are very deliberative with who, how, when,
and why you grant direct access to an Analytical Data Warehouse for an end
user.
Design Concept: Integration Points
Application: Integration Points for Analytical Data Warehouses fall into two main categories, what data is coming into the Data Warehouse and what data is flowing out of the Data Warehouse. A well-built Data Warehouse will have a good understanding of the data flowing into and out of it.
Common Design
Considerations:
- What data is flowing into the Data Warehouse?
- Is the data flowing into the Data Warehouse structured, semi-structured, or unstructured?
- How much data is flowing into and out of the Data Warehouse?
- Has a network capacity study been completed to ensure adequate capacity.
- Have SLAs been put in place to provide clarity on information exchange requirements?
Survival
Tip: Beware of the Data Warehousing Trap. The Data Warehousing Trap
occurs when the overwhelming majority of attention is myopically focused on
getting information into a Data Warehouse. At the same time, very little
attention is focused on getting information out of the Data Warehouse. The
purpose of a Data Warehouse is to combine and share data to end users. Sharing
data means making meaningful information easily accessible from a Data
Warehouse. There seems to be a correlation with the maturity of a Data Warehouse
and the likelihood that a Data Warehouse Trap gets created. Remember,
information flowing out of the Data Warehouse requires just as much attention
as the information flowing into a Data Warehouse.
Design Concept: Information Persistence
Application: Information Persistence in a Data Warehousing environment refers to when, where, and how data is physically stored, i.e. persisted. Data is usually persisted on disk but could also be persisted in memory, tape drives, etc. We will discuss ETL persistence in a later blog.
Big picture, the
best way to persist data in a Data Warehouse is in a format that optimizes 2
things:
- Writing large amounts of data in large singular transactions or “batches”
- Reading large amounts of data in aggregating queries that return higher level information
A lot of
attention should be given to how data is physical stored in a Data Warehouse.
This is true no matter the technology stack that you are using.
Common Design
Considerations:
- What physical storage technologies will be used for the Data Warehouse?
- What design techniques will be used to ensure fast query response time?
- Will summary data be created and persisted in the Data Warehouse?
- How many times is a piece of data persisted in the Data Warehouse, either on disk or in memory?
- When will data be purged from the Data Warehouse?
Design Concept: Business Logic
Application: Business Logic in Data Warehousing is represented in the data model. The ETL process that transforms and populated the data model is where the business logic is implemented, but the data model itself reflects the rules, taxonomies, and end state of business logic.
Dimension
hierarchies are a great example of a representation of business logic in a Data
Warehouse. A hierarchy in a financial dimension, like a chart of accounts
dimension, will represent the physical structures, names, and aggregation logic
associated with the accounting rules used by finance to manage the General
Ledger or other finance structures.
Table granularity
is another example of a representation of business logic in a Data Warehouse.
If a business rule states that a POS transaction must capture sales tax per
receipt item, then the grain of the POS table, Dimensional or 3NF, should
represent this by storing data at a receipt line level including a tax metric
attribute.
Another area of
business logic that will affect the Data Warehouse is the aggregation rules
that govern metrics. I wrote a blog about aggregation logic here: http://jalsolutions.blogspot.com/2011/08/bit-of-clear-headed-thinking-grain-and.html
Having a clear
understanding of the business logic used to manage and operate a business is
imperative to a good Data Warehouse design. Take the time and work with
business analysts and business architects to ensure you clearly understand your
organization’s business logic.
Common Design
Considerations:
- How are your organization’s key business process and logic represented in the data model(s) of the Data Warehouse?
- What are the key business hierarchies that should be modeled in the Data Warehouse?
- What is the fiscal calendar structure for the organization?
- What is the lowest level of granularity needed to calculate metrics?
- What metrics are classified as purely additive, semi-additive, or non-additive?
Survival
Tip: Finance and Operational business logic determines grain in Data
Warehousing. Be sure to understand how a company calculates the key metrics it
uses to measure growth and calculate compensation. Take these calculations and
use them as inputs to your data modeling. If you are working in a public traded
company you will be able to find copies of your company’s financial earnings
statement reports. These are filled with business logic definitions and
descriptions that can help you understand how data should be modeled in the
Data Warehouse. Typically these reports
explain how key metrics, such as sales, products produced, or customer ads, are
calculated.
Design Concept: Security
Application: Security in a Data Warehouse deals with defending the data stored in the Data Warehouse. It can be classified into four categories:
- Authentication
- How will the identity of an end user be determined and passed into the Data Warehouse?
- Authentication typically requires expertise outside of the Data Warehousing “space”. Most of the time you will work with domain or network administrators to determine how Authentication will be constructed.
- All other security topics are reliant on successful Authentication.
- Access
- Who is allowed to access the contents of the Data Warehouse?
- An end user should not be able to access a Data Warehouse without being explicitly enabled to do so.
- Access determines how desired users will access the Data Warehouse. It also determines how undesired users will be prevented from accessing the Data Warehouse.
- This concept is the “first line of defense” for a Data Warehouse.
- Authorization
- What contents are end users allowed to view?
- An end user should not be able to view data unless they have been explicitly allowed to do so. Authorization determines what data end users can view.
- It also determines what data end users are not allowed to view.
- This concept is the “second line of defense” for a Data Warehouse.
- Encryption
- How will sensitive data be securely persisted in the Data Warehouse?
- How will sensitive data be securely transmitted in a result set from the Data Warehouse?
- Encryption is a key defense mechanism of the data itself. This concept is mutually exclusive of the topics 1, 2, and 3 (the 3 A’s of Security). In fact, good Encryption design should assume that topics 1, 2, and 3 have all failed.
- This concept is the “last line of defense” for a Data Warehouse.
Security should
be explicitly implemented in a Data Warehouse. This means that no one,
including administrative users, should be able to access or view data without being
granted explicit permissions to do so.
Common Design
Considerations:
- What are the security policies governing the data in the Data Warehouse? How compliant is the Data Warehouse to these policies?
- How will end user access be managed; via groups, individually, application roles, etc?
- What level of access with administrative and developer resources have in the Data Warehouse?
- How will end user authorization be managed; via roles, individual grants, etc?
- What level of data authorization with administrative and developer resources have in the Data Warehouse?
- What are the sensitive data elements maintained in the Data Warehouse?
- How is sensitive data managed and stored in the Data Warehouse? How is sensitive data exposed to end users/applications?
Survival
Tip: Traditionally most Data Warehouses have resided behind corporate
firewalls. The effect of this is that a lot of Data Warehouse managers are complacent
with Security concerns, thinking that the Data Warehouse is secured because
it’s behind the firewall. To put it plainly, security for Data Warehousing
professionals is often an afterthought. As the responsibilities of Data
Warehouses broaden outside of the corporate firewalls, the need to leverage
security experts to audit and document security vulnerabilities of the Data
Warehouse is increasing. It is recommended that you hire or contract a security
expert to assess your Data Warehouses security vulnerabilities before you
experience a security breach. Your chances of survival in the information world
will increase drastically if you can proactively resolve a security issue
before an undesired user exposes it.
Design Concept: Disaster and Recovery
Application: Fortunately disaster recovery in Data Warehouses is a time tested concept with lots of proven solutions and techniques available for reference. From a high level, disaster recovery in Data Warehousing is focused on the ability to restore the structure and content of the Data Warehouse quickly to any point in time.
Forming a
disaster recovery strategy early in the Data Warehouses lifecycle is
recommended. So is a deliberate and periodic audit of the strategy to ensure it
is still a feasible plan in case of necessity. The main elements of a Data Warehouses
disaster recovery plan are:
- Disaster Scenario Plan
- Use cases of common disasters to use as inputs for the recovery plan
- Service Level Agree Requirements
- Stated thresholds for down times, recovery times, cost impact, etc to use as inputs for the recovery plan
- Recovery Plan
- Single document outlining who, what, where, when, and how recovery should be executed during a disaster
- Some Key Components Include
- Database Backup Strategy
- what, when, and how much data will be backed-up
- Backup Persistence Specifications
- where and how long data will be stored for retrieval
- moving backups offsite is recommended
- Emergency Contact List
- list of key resources with contact information by role to be leveraged during recovery
- Logistics Plan
- details of locations, conference lines, etc to be leveraged in case of disaster
- etc
Note, ETL
disaster recovery will be covered during the ETL portion of this blog.
Common Design
Considerations:
- What are the recovery and retentions policies governing the data in the Data Warehouse? How compliant is the Data Warehouse to these policies?
- What are the SLA requirements for recovery of the Data Warehouse? How prepared is the Data Warehouse of being able to meet the SLA requirements for recovery?
- What tools will be used to store backup data? Will the data be stored offsite in case of a catastrophic event?
- Who are the key resources that are required for the execution of the recovery plan?
- How often is the recovery plan reviewed?
- How will all data elements be restored in case of a disaster?
Survival
Tip: Similar to Data Warehouse security, disaster recovery is a key
element in Data Warehousing that doesn’t receive the attention required. Too
many Data Warehouse systems consider themselves to be downstream systems and
therefore can “rebuild” themselves from the source system in case of a
catastrophic event. It’s true that most Data Warehouses leverage database
backups from some of their data, but it’s rare that all the required data is
recoverable leveraging the database backups. Just like Data Warehouse security,
leverage a disaster recovery expert and assess the vulnerabilities of your Data
Warehouse before you need to recover from a catastrophic event. Your chances of
survival in the information world will increase drastically if you can
proactively resolve a recovery issue before a disaster strikes.
Survival
Tip: At the risk of embarrassing myself, I want to highlight how
unpredictably disasters can strike. When I was a Greenhorn in this field, I was
asked by a large and mature client to install a reporting portal tool in the
Production Data Warehouse environment. I had done this a few times before in
non-Production environments and so we all felt this was a low risk task.
Unfortunately during the installation path I selected the Production Data
Warehouse schema for the installation location and accidently dropped and
recreated the schema causing all data, sessions, etc to be instantaneously lost.
Fortunately, the client had very good recovery capabilities and the Production
Data Warehouse was restored. The incident left an unmistakable impression on
me. The thought that you never know when a disaster will strike or how it will
occur has been ingrained in me ever since this incident. It’s true that the
client left themselves vulnerable by granting me the permissions to drop the
entire schema, but this incident goes to show that disaster and recovery
planning, with period audit points, is an absolute necessity.
Design Concept: Transparency
Application: Transparency in Data Warehousing is typically centered on the ETL process that loads data into the Data Warehouse. We will discuss ETL transparency in a later blog post.
For this post,
Transparency refers to understanding how, what, and when data is accessed from
the Data Warehouse. It is recommended that you create transparency into your
Data Warehouse by ensuring that administrative users and management team can
see:
- Usage patterns of data elements
- Historical trends of usage
- Historical trends of performance statistics, such as query response time, etc
- Ranked lists of users (or applications) ranked by usage
- Ranked lists of queries ranked by performance
- Alerting on long running, dangerous, or out of control queries
- Storage availability
- List of performance bottlenecks in the Data Warehouse
- etc
Transparency for
Data Warehousing should provide insight into the operations of the Data
Warehouse and should help the owners and operators of the Data Warehouse
proactively detect, analyze, and resolve potential operations issues.
Common Design
Considerations:
- What metrics should be used to manage the operations of the Data Warehouse?
- Is there visibility into the actual values for stated SLAs?
- How can the operations and management teams detect issues before they occur?
- Can the Data Warehouse team proactively determine the potential effects of adding new/planned functionality requirements to the Data Warehouse?
Survival
Tip: Building transparency into any technology
system is a best practice. This also is true for Data Warehousing. Be sure to
include money and resources in the planning phases of building the Data Warehouse
to provide Transparency. Not building adequate transparency into your Data
Warehouse will cause a lot of pain as the Data Warehouse grows in terms of
users and complexity of requests.
Design Concept: Capacity and Scalability
Application: If Transparency has been included in the Data Warehouse design, then chances are that you are ahead of the game with regards to Capacity and Scalability. In a Data Warehouse, Capacity and Scalability refer to ensuring there are enough machine resources (CPU, RAM, Storage, Network Bandwidth) to satisfy the user requests to the Data Warehouse.
Be sure to focus
deliberate attention on the movement of data within and out of the Data
Warehouse, commonly referred to as I/O. Also, be sure to focus deliberate
attention on the persistence layer used to store data, logs, temporary files,
internal database files, etc. Historically the biggest Capacity and Scalability
issues in a Data Warehouse have been disk space and I/O.
The Capacity and
Scalability efforts for Data Warehouse environments should focus on the
following, in this order:
- Disk Space
- Amount of physical storage available to the Data Warehouse.
- Incorporate backup requirements and other offline functional disk space requirements in the disk space capacity plan.
- Disk I/O Capacity
- Data moving to the physical storage layer of the Data Warehouse
- Data Warehouse design will determine most of the I/O capacity issues
- Network I/O
- Data moving across the network to answer end user/application queries
- Memory (RAM)
- Memory is used to “cache” data in a faster persistence layer compared to disk.
- Some Data Warehouses only use memory for persistence
- CPU
- Amount of processing capabilities to analyze queries, execute data retrieval commands, aggregate data, execute result commands, etc
It is recommended
that every Data Warehouse has a Scalability plan that explicitly states
performance thresholds for each of the categories stated above. The Scalability
plan should describe the actions that should be taken when the thresholds are
reached over a sustained period of time.
Common Design
Considerations:
- Is there enough capacity to support the load placed onto the movement of data to the physical storage layer?
- Is the network able to handle the amount of data flowing over it when users access data from the Data Warehouse?
- Is there enough disk space to handle the 5 year projected growth of the Data Warehouse?
- What are the average, minimum, and peak usage patterns of the Data Warehouse?
Design Concept: Performance
Application: If you’ve followed the guidance in the Capacity and Scalability section as well as the Transparency section of this blog, then performance should not be an issue for your Data Warehouse. As mentioned in the Capacity and Scalability section above, there are two main areas where performance can suffer in a Data Warehouse environment:
- Performance of loading data into the Data Warehouse
- We will cover this topic in depth during the ETL blog.
- Performance of retrieving data from the Data Warehouse
Oddly enough the
strategies to mitigate these two main types of performance issues are very
similar. The first step in ensuring performance
in a Data Warehouse is to ensure you have visibility into the actual performance
of the Data Warehouse, i.e. Transparency. The next step is to create a Capacity
and Scalability plan by analyzing the statistics captured from the Transparency
initiative. Performance problems will exist in your Data Warehouse. Having
visibility and a deliberate process/plan to mitigate Performance issues is a
necessity for Data Warehousing.
In addition to
the Capacity and Scalability plan, here are some practical performance tips and
tricks for Data Warehousing:
- Ensure good design
- The best hardware in the world cannot improve performance if the fundamental design of a Data Warehouse is flawed.
- Refer to the tips mentioned in the Integration and Information Persistence sections of this blog for tips on proper Data Warehouse design.
- A good design should encompass:
- Minimal data movement
- Excellent Indexing
- Proper partitioning and other database level performance enhancing techniques
- Hire a great Tool Administrator
- Hire an expert who knows the most intimate details of the tool that your Data Warehouse is using for the physical storage area.
- Enforce defensive policies
- Be smart with your user policies and ensure that you have controls to “kill” harmful queries.
- Choose the right tool for the job
- Be sure to align the Data Warehouse tool suit to the use case. A “real-time” Data Warehouse shouldn’t rely on a tool built for offline processing.
- Set expectations
- Be sure to properly set expectations through SLAs and other means so end users do not perceive a performance issue if one doesn’t exist.
Common Design
Considerations:
- What are the 10 highest performance risks to the Data Warehouse?
- What are the 10 worst performance bottlenecks in the Data Warehouse?
- How can the Data Warehouse “protect itself” from potentially harmful end user processes?
- What are the performance SLA terms for the Data Warehouse?
Survival
Tip: Outside of Trust in the data, one of the highest reasons that
Data Warehouses are labeled as failures is end user performance. All Data Warehouses are conceived with good performance
in mind. However, somewhere along the
implementation tactical decisions are made, timelines are squeezed, and the
original performance strategies/assumptions are compromised. Ensuring good performance in Data Warehousing
requires the full team to drive towards this goal. Every decision during implementation should
include the question: What will this do
to performance?
Design Concept: Maintenance, Bugs, and Upgrades
Application: One of the nicest things about Analytical Data Warehouses is that they are designed and built around batch-oriented, latency-centric, data processing techniques. End users of Data Warehouse systems expect large windows of “refresh” time when the Data Warehouse is either unavailable or unreliable. The inherent, and if you’re doing your job explicitly stated, daily load windows provide excellent opportunities to performance maintenance on the system.
Be sure though
that you have an SLA in place with an explicitly stated strategy on handling:
- Routine Maintenance
- Stated allotment of time dedicated to weekly or monthly Data Warehouse maintenance activities
- A few examples of routine maintenance activities are: Backups, Purge Processes, Index Maintenance, etc
- Large Software Upgrades
- Stated time based strategy explaining when major “dot” releases will be available from the Data Warehouse “stack” vendors.
- This part of the SLA should set expectations with Data Warehouse end users, owners, and operators as to when major releases will be available from the software vendors.
- The purpose of setting expectations is to ensure deliberate planning occurs to minimize the impact of an upgrade on the business.
- A Bug/Defect Resolution Strategy
- Stated strategy with “turn-around” time expectations based on the severity of bugs.
Common Design Considerations:
- What are the business expectations for “up time” for the Data Warehouse? What is the business impact if these expectations cannot be satisfied?
- What is the required time to fix an issue by issue severity?
- What are the daily, weekly, monthly, and annual Data Warehouse maintenance tasks? How much time is required for each task? Who are the resources involved in performing each task?
Development and Implementation Concepts Applied
There are countless articles, books, blogs, white papers, etc on best practices for implementing an Analytical Data Warehouse. Most processes are implementation approaches based on either the Information Factory or Data Bus approaches. This is why it is important for Information Managers and Architects to understand these two fundamental Data Warehousing methodologies and to explicitly identify the methods from each approach to be used in the Data Warehousing ecosystem.
There are a couple of key items that must be deliberately
managed and addressed to help ensure the success of a Data Warehouse project. These
items include:
- Starting with a Project Charter
- This may seem like Project Management 101, but in a Data Warehouse clarity of vision and ownership is key to ensuring success.
- A lack of vision, something I like to call the purpose vacuum, will cause immense harm to the tactical, day-to-day implementation process of implementing a Data Warehouse.
- The vision shouldn’t be too detailed at this point, but it should be clear and include:
- The goals of the Data Warehouse
- Owner and roles for the team
- Inception level statements of requirements prioritize by business value impact
- Conceptual Diagram of Data Contents
- The risks of not starting with a clear vision are inflated costs, missed expectations, and the most detrimental issue in Data Warehousing, a lack of TRUST in the end result.
- Choosing a Flexible Development Methodology and Framework
- Data Warehousing projects are expensive due to the amount of effort it takes to secure, analyze, import, transform, and load data into the Data Warehouse.
- The effort required meeting a request from end users and system owners typically causes shock to the end users and system owners, which damages the reputation and value of the Data Warehouse.
- Selecting a development methodology as well as framework that is purposely built to support rapid deployment of change/new functionality for Data Warehouses specifically will greatly increase the chances of a successful Data Warehouse.
- Several methodologies exist for incorporating agility, not necessarily Agile, into the Data Warehouse process. Google Data Warehousing Agility for more information on this topic.
- Several tools/frameworks exist that help Data Warehouse operators react to change. Wherescape is a very interesting tool concept to help alleviate some of the pain that comes with managing a Data Warehousing environment in a fluctuating business environment. Note, I have not seen any of Wherescape products used in a production deployment as yet. http://en.wikipedia.org/wiki/Wherescape
- Rapid, Iterative, and Continuously-Communicated Requirements Gathering Process
- The purpose of Requirements gathering in a Data Warehousing initiative is to ensure that the right data is captured at the right level, i.e. grain, and presented to end users at the right time.
- To optimize the cost of development for Data Warehousing, these three questions need to be answered as efficiently as possible.
- Note the answers to these questions are going to change, always and rapidly.
- Following a traditional IT oriented process where requirements are gathered, turned into massively large Technical Specifications and then handed over to a development team is not the most efficient method to use when capturing Data Warehousing requirements.
- Requirements gathering for a Data Warehousing process should employ iterative techniques, visualizations, and when appropriate use cases to rapidly capture requirements and provide feedback to end users. It’s important to leverage a requirements process that allows for continual feedback.
- It is also very important that the Data Warehouse deployment/implementation/development methodology can empower an iterative, continuous feedback loop approach to requirements gather.
- Start Managing the Content Early
- Data/Information Management is not a sexy topic but it is incredibly value.
- Organizations that have a strong Information Management organization in place are more likely to produce successful Data Warehouses. This is because Information Management artifacts can be leveraged as solid inputs into the Data Warehouse.
- Examples of Information Management artifacts that positively impact Data Warehousing are:
- Conceptual Diagrams
- Well Defined Stewardship Community
- Reference and Master Data Management Capabilities
- Information Architecture Standards
- Data Dictionary (Business Semantics and Technical Standards)
- Know the Data
- Understanding the data is key to building a successful Data Warehouse
- Prove assumption empirically through Data Profiling prior to placing items into Production.
- Prove the Data Model will satisfy the business needs as well as what is dictated by the incoming data sources.
Survival
Tip: One of the oldest “tricks of the trade” for Data Warehousing is
to always have high level executive buy in before beginning a Data Warehouse
initiative. This is as true today as it was in the 90s when Data Warehousing
first gained wide adoption. If you are asked to participate in a Data
Warehousing initiative that is lacking in a good, capable executive sponsor
then be prepared for a long and bumpy ride that will probably end in failure.
Oh, the same advice holds true if there is no quantified business value
associated with the Data Warehousing initiative.
Opportunities for Improvement and Future Direction
The goal of Analytical Data Warehouses and Data Marts is to
enhance organizational decision making by combining data from desperate
Operational systems and transforming that data into information. The value of
Data Warehouses has been immense and correlates to the benefits of better
organizational decision making. However, the cost of delivering Data Warehouse
functionality has been very expensive for years. Also Data Warehousing systems
have been notoriously rigid in terms of the ability to react to changes in
business requirements. Both the high cost and rigidity of Data Warehouses has
eroded some of the positive financial gains that were promised from these
systems. In addition to these factors the sheer volume of data combined with
the complexity of statistical oriented data requirements is exposing some
fundamental flaws in terms of scalability to traditional Analytical Data
Warehouse methodologies.
The Analytical Data Warehouse market has identified several opportunities for improvement that are targeted at alleviating the challenges discussed in the preceding. These opportunities for improvement include:
- Reduce the reliance on specialized hardware and software (RDBMS) to minimize the cost associated with Analytical Data Warehouse.
- Enhance rapid scalability in terms of both data storage and end user processing.
- Eliminate the need to physically extract data from Operational Systems and move it into Analytical Systems.
- As discussed in this blog, this data extraction is done to shift the processing burden of analytical queries away from Operational Systems to specialized Analytical Data Warehouses with the goal of protecting the performance capabilities of the Operational Systems.
Technical innovations tagged as “Big Data”, “NoSQL”, etc are
emerging as “battle tested” solutions that are targeted at delivering on these
opportunities for improvement. In fact some newer technologies are promising to
enable hosting both large scale Operational systems and Analytical systems on a
single technical stack.
These technical innovations have introduced a high potential
for the Data Warehousing market. It is very
likely that a business problem that currently requires an Analytical Data
Warehouse solution will, in the near future, not require an Analytical Data
Warehouse as part of the solution. Think of the potential of being able to
provide the rewards of Analytical Data Warehousing without having to “deal”
with the pain of Analytical Data Warehousing. That is, not having to create
specialized data models, manage specialized infrastructures, code challenging
ETL processes, etc while enabling the same fact based decision making
capabilities.
The opportunities for improvement in Data Warehousing are
huge and are becoming more obtainable every day.
Survival
Tip: To increase your chances of survival in the information world,
pay attention to the Big Data movement and start playing with some of the new
technologies, especially Hadoop. The effort it will take to “catch-up” to
technology in the coming years is going to increase as the technology increases
and becomes less familiar to the technologies of today. It won’t take you more
than 4 – 8 hours a month to read or play with a new technology. Ideally, try to get yourself involved in
these new technologies today within your existing role either through a POC or
other prototyping effort.
Summary
In this post we took a look at the
major approaches, methodologies, architectures, design considerations,
development approaches and opportunities of improvement of Analytical Data
Warehouses. The purpose of this blog post is to provide deep guidance and
references for the professionals who work with or consume information from
Analytical Data Warehouses and Data Marts.
The next blog post in this series will
be much shorter and will explore the Operational equivalent of an Analytical
Data Warehouse, the Operational Data Store.