Thursday, May 31, 2012

Analytical Data Warehouses and Data Marts


Purpose

Provide guidance to Information Managers and Architects on design concepts, development and implementation approaches, opportunities for improvement, and future direction of Analytical Data Warehouses and Data Marts.

Background and History

The most fundamental Analytical system may perhaps be the Analytical Data Warehouse.  No other informational system is designed for such a singular analytical purpose as an Analytical Data Warehouse.  Before we provide a definition of a Data Warehouse, let’s first look at how Data Warehouses came into existence. Knowing the history of a system methodology is a great way to put its definition into context.

There is general consensus that Data Warehousing first appeared in the 1980s. The components of Data Warehousing, however, were being created as early as the 1960s. There are many different individuals credited with discovering Data Warehousing. Wikipedia has a great chronological timeline that provides a lot of context to the evolution of data warehousing methodologies. 

We have added the timeline to this blog.  Go here for the complete Wikipedia article: http://en.wikipedia.org/wiki/Data_warehouse
  • 1960s — General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts.[2]
  • 1970s — ACNielsen and IRI provide dimensional data marts for retail sales.[2]
  • 1970s — Bill Inmon begins to define and discuss the term: Data Warehouse
  • 1975 — Sperry Univac Introduce MAPPER (MAintain, Prepare, and Produce Executive Reports) is a database management and reporting system that includes the world's first 4GL. It was the first platform specifically designed for building Information Centers (a forerunner of contemporary Enterprise Data Warehousing platforms)
  • 1983 — Teradata introduces a database management system specifically designed for decision support.
  • 1983 — Sperry Corporation Martyn Richard Jones defines the Sperry Information Center approach, which whilst not being a true DW in the Inmon sense, did contain many of the characteristics of DW structures and process as defined previously by Inmon, and later by Devlin. First used at the TSB England & Wales
  • 1984 — Metaphor Computer Systems, founded by David Liddle and Don Massaro, releases Data Interpretation System (DIS). DIS was a hardware/software package and GUI for business users to create a database management and analytic system.
  • 1988 — Barry Devlin and Paul Murphy publish the article An architecture for a business and information system in IBM Systems Journal where they introduce the term "business Data Warehouse".
  • 1990 — Red Brick Systems, founded by Ralph Kimball, introduces Red Brick Warehouse, a database management system specifically for data warehousing.
  • 1991 — Prism Solutions, founded by Bill Inmon, introduces Prism Warehouse Manager, software for developing a Data Warehouse.
  • 1992 — Bill Inmon publishes the book Building the Data Warehouse.[3]
  • 1995 — The Data Warehousing Institute, a for-profit organization that promotes data warehousing, is founded.
  • 1996 — Ralph Kimball publishes the book The Data Warehouse Toolkit.[4]
  • 2000 — Daniel Linstedt releases the Data Vault, enabling real time auditable Data Warehouses warehouse.
If you have any familiarity with Data Warehousing, you will see some interesting items in this timeline.  The first item that should jump out at you is that the concepts of Dimensions and Facts, which are fundamental elements of Data Warehousing, were created prior to the Data Warehouse. Another item that you should notice is the repeated and competing contributions of two key Data Warehousing individuals, Bill Inmon and Ralph Kimball.  Bill Inmom and Ralph Kimball are considered by many to be the founders of Data Warehousing, though IBM surely disputes this fact. The reason these two distinguished IT leaders are considered the founders of Data Warehousing is that each developed a unique, practical, and somewhat dogmatic methodology for implementing Data Warehouses. We will discuss both Inmon’s and Kimball’s approaches to Data Warehousing later in this blog.

Another very interesting item that is inherent in this timeline is that it paints the picture that no innovation to Data Warehousing has occurred after the year 2000 with the introduction of the Data Vault. We will describe the Data Vault concept later. While there has been no major data modeling or methodology innovation in Data Warehousing since 2000, there have been huge technical innovations that affect the Data Warehouse. These technical innovations can be summarized into three categories:
  1. Improving Data Warehousing 
    1. Includes items such as Massively Parallel Processing, MPP, and columnar database technologies that have enabled huge gains in the performance and scalability of Data Warehousing. 
  2.  Augmenting Data Warehousing
    1. Includes items such as Big Data appliances that specialize in capturing and presenting data elements that have traditionally been too costly or unpractical to place into a Data Warehouse, like unstructured data, i.e. data not stored in a data base, or clickstream data, i.e. web site behavioral data. These items are allowing the traditional Data Warehouse to focus on what it does best while providing end users access to other forms of information. 
  3.  Replacing Data Warehousing
    1. Includes items such as NoSQL databases and in memory appliances that aim to replace the need for Data Warehousing by providing the same functionality of a Data Warehouse without all of the effort associated with traditional Data Warehouses.
As you look at the background and history of Data Warehousing you will notice a theme of technology-enabled advancement in this field. The components of an Analytical Data Warehouse were invented first. These components were born out of the academic world and were initially theoretical concepts because technology couldn’t support their realization. In the 1980’s, technology finally began to “catch-up” to the theory of Data Warehousing and individual concepts were merged together to form a practical methodology, actually 2 Inmon’s and Kimball’s versions of Data Warehousing. During the last part of the 20th century Analytical Data Warehouses were focused on storing structured, electronic data in a relational database format to provide end users fast access to aggregated information for decision making purposes. The introduction of the data vault methodology at the onset of the 21st century signaled that data centric technologies were starting to become sophisticated enough to evolve Analytical Data Warehouses past the point of being seen as a specialized decision support system to a point where they were viewed as a hybrid system that supported both analytical and operational requirements. 

We will now define the terms related to Analytical Data Warehouses keeping in mind the context and knowledge of their history and background.

Definitions

The following terms are common in Data Warehousing and need to be clearly defined/understood.  

Analytical Data Warehouse
  • Specialized data storage system that combines desperate data sources into a clean, business centric, historically accurate, user friendly, cohesive view of an enterprise’s business information. Analytical Data Warehouses turn raw data into business friendly information that allows business users to easily preform enterprise analytical functions, such as performance management or business optimization. Analytical Data Warehouses contain an enterprise wide view of information that combines many different views of business unit information.
Analytical Data Mart
  • Specialized data storage system that combines desperate data sources into a clean, business centric, historically accurate, user friendly, cohesive view of a business unit’s business information. Analytical data marts turn raw data into business friendly information that allows business users to easily preform business unit analytical functions, such as marketing segmentation or sales forecasting. Analytical data marts contain a business process specific view of information that focuses on a select few specific business processes.
Information Factory
  • Specialized data management architecture designed to manage the data created by producers, i.e. operational systems, and transferred into consumers, i.e. analytical systems.  This methodology is much broader than a logical or conceptual database design methodology as it incorporates larger, non-database data management elements such as integration, meta-data management, etc.  Data is stored in a central, normalized Enterprise Data Warehouse and distributed to business department aligned Data Marts.

Data Bus Architecture
  • Specialized data management architecture designed to support institutional decision making leveraging a coordinated collection of related business process aligned Data Marts.  One of the central themes to the Data Bus Architecture is to align Data Marts to business processes rather than business departments.  This means having a sales transaction Data Mart that is leveraged across Marketing and Stores instead of having independent Marketing and Stores Data Marts with isolated sales transaction data within them.  A Data Warehouse is created by combining the business process oriented Data Marts through the use of common (conformed) dimensions. 
Dimensional Model
  • Specialized data management design methodology that is used in analytical data systems to provide superior query performance for end users. Dimensional Models are de-normalized, meaning they do not adhere to the normalization rules that many transaction systems use. Instead data is physically stored in Fact and Dimension tables. Dimensional Models leverage a star schema approach where Fact tables are placed at the center of subject specific Dimensional Models and Dimension tables join to each Fact table. 
Fact
  • Specialized data attribute in analytical systems that typically represent a quantity, measurement or metric. Facts are typically numeric. An example of a Fact is sales, or order quantity.
Dimension
  • Specialized data attribute in analytical systems that provides descriptive context for Facts. Dimensions are typically strings or dates. An example of a Dimension is date or employee.
Atomic Data
  • Important Data Warehousing concept that describes the logical database storage structure designed to capture and retain data in its most granular form. Typically atomic data in a Data Warehouse mimics the original, pre-aggregated state of data. For example, atomic data in a sales subject area of a Data Warehouse would represent data at the sales transaction level.
Summary Data
  • Important Data Warehousing concept that describes the logical database storage structure designed to capture and retain data in an aggregated form to enable faster end user query performance. In addition to faster performance, summary data in a Data Warehouse is created to give end users access to more meaningful information compared to atomic data. Summary data is built by aggregating and summarizing atomic data.
 Data Vault Model
  • Specialized database design methodology that is designed to capture and store historically accurate data from multiple operational sources. The goal of the Data Vault is to store “all the data, all the time”. Data is preserved in its original form. There is no cleansing of data on entry. Therefore “wrong” data is deliberately stored in a Data vault. Data is stored in one of three table types; Hub, Link, or Satellite. Though Data Vaults can play a role as an Analytical Data Warehouse, they typically act as Operational Data Warehouses. They are included in this blog entry because one may encounter this methodology in a “real world” Analytical Data Warehousing situation.
Hub
  • Data Vault table type that stores natural business key data. Hubs were created based on the premise that natural business keys only change when the business physically changes. Therefore a best practice data model methodology should abstract descriptive elements from business keys and should place the business keys at the center of the data model.
Link
  • Data Vault table type that models relations, associations and transactions between Hubs.  Links could be considered an associative or many-to-many table in other data modeling methodologies. Links exist to track the historically accurate relationship between Hubs. The structure of a Data Vault comes from Hubs and Links.
Satellite
  •  Data Vault table type that contains the descriptive elements for Hubs and Links. Satellite tables store the meta-data that describes the business process that is modeled in the Hub/Link structure. 
Survival Tip:  Learn all the Analytical Data Warehouse data methodologies and apply the techniques that will work best in your organization.  In the Data Warehousing market, an argument has existed for many years over which methodology was supreme, Kimball’s methodology built on business process aligned Dimensional Modeling or Inmonn’s Information Factory methodology. The argument should have never been about Kimball vs. Inmonn or Dimensional Models vs. Information Factories. Rather the argument should be about, how much should you leverage from both methodologies to be successful. Here’s a great article, though it’s written by Kimball, on the differences between the two methodologies:  http://www.informationweek.com/news/17800088

Please note that the terms Data Warehouse and Data Mart are used interchangeably in the “real world”. Though these terms both represent similar structures and concepts, Data Warehouses have a much broader scope of data and end users compared to Data Marts. Data Warehouses typically contain multiple business unit contexts of data to provide an enterprise wide view of information.  Data Marts typically contain a single business unit or process view of information. The breadth of data contained in the data structure, not the actual size in bytes of the structure, is what drives the difference in classification between a Data Warehouse and a Data Mart. 

Role of an Analytical Data Warehouse

The role of an Analytical Data Warehouse or Analytical Data Mart is to support decision making. The methodology used to model the Data Warehouse does not change the ultimate end goal, that of decision making.  

To enable decision making, data is sourced from multiple operational systems, like HR, CRM, Sales, ERP systems. The operational data is combined and transformed into a more business friendly format to support decision making. An example of the transformation that occurs in an Analytical Data Warehouse to turn raw data into information is taking reference codes, such as UPCs, and transforming the codes into descriptive information, such as a product name.  

Both people and other software programs act as the decision makers who leverage Analytical Data Warehouse information. The information in the Analytical Data Warehouse is typically delivered through specialized software tools that provide a friendly user interface, through the use of consumer-level web technologies, data visualizations, and drag and drop interfaces. These tools are typically referred to as Business Intelligence or Decision Support tools. 

To reiterate the role of an Analytical Data Warehouse is to collect and consolidate multiple operational sources of data, transform that data into business-friendly information, and make the information available for users or other software systems to consume. All this is done to enable fact-based decision making, either by end users or automated systems.

Survival Tip:  Use Analytical Data Warehouses/Marts wisely. Their purpose is to enable decision making. They should not be placed in a mission critical, transactional i.e. “operational” business process as a real-time data provider. They should be used to present information to end users to enable effective, fact-based decision making.  Be wary of those that want to “hit the Data Warehouse” as a part of an operational business process and make it sound like it’s a good idea. 

Architecture Reference

Detecting an Analytical Data Warehouse from an architectural perspective is pretty straight forward.  Look for a data persistence layer that has multiple sources of data flowing into it and aggregated information flowing out of it. Also, the consumers of the persistence layer’s information should also help you identify an Analytical Data Warehouse. If the consumers of the information are making fact-based decisions based on the information being served up from the persistence layer, than chances are you are looking at an Analytical Data Warehouse.

The following diagram, introduced in the Analytical and Operational Systems blog post shows a typical Analytical Data Warehousing environment.


Note the multiple sources of data, the centralized persistence layer (Analytical Database), and fact-based enabling presentation layer. 

The definitions section of this blog described two main types of Analytical Data Warehouses; the Information Factory and the Data Bus Architecture. It is important to be able to detect what Analytical Data Warehousing environment you are working in as each has a unique way of persisting, presenting, consuming, processing, and managing the data and information that flows within the environment.

Data Bus Architecture


The Data Bus Architecture creates Data Warehouses by combining business process oriented Dimensional Model Data Marts. The Information Factory also leverages Dimensional Modeling in Data Marts. However, there is a big difference in how Dimensional Modeling is applied in Information Factories compared to how Dimensional Modeling is leveraged in a Data Bus Architecture.  

In a Data Bus Architecture environment the Data Marts are:

  1.  Always aligned to business processes not business departments
  2. Always related through the use of a common set of main, conformed, Dimensions
What makes the Data Bus Architecture unique is that the Data Marts are coordinated around the conformed dimensions so that a larger, enterprise view is derived from the individual, business process aligned Data Marts. Conformed dimensions provide a mechanism for an Enterprise Data Warehouse while eliminating the need to create a specialized Enterprise Data Warehouse. In other words, data is stored in Data Marts, and Data Marts are joined together logically to create a Data Warehouse.

Understanding business process oriented Dimensional Modeling is a big key to understanding the Data Bus Architecture. 

The following image shows a typical Dimensional Model. This image is source from The Kimball Groups website from one of Ralph Kimball’s original articles introducing the concept of the Data Bus Architecture and Dimensional Modeling.  

 Note:  The numbers in this diagram refer to the steps to “gracefully modify” the model, i.e. extend it by adding another dimension table to it.  

Like any Dimensional Mode there is a “star” like structure with a fact table placed in the center of the star and dimension tables placed around it. The fact table is focused on a singular business process. This Dimensional Model looks like a sales transaction Dimensional Model.

The example above is a single subject area view of a dimensional model. What is missing from this view is an additional fact tables and conformed dimensions.  

The Data Bus Architecture typically leverages an Atomic and a Summary Dimensional Model layer where atomic level data, i.e. data at the lowest grain like a sales transaction, is stored in the Atomic layer of the Dimensional Model and aggregated views of this atomic data are precalculated and stored in summary fact tables in the Summary layer to help with query performance. The following diagram shows how data flows into a typical Dimensional Model based Data Warehousing environment.


The source/staging area in this model is a landing area used as an intake mechanism for data to flow into the Dimensional Model. We will discuss specifics around ETL/Integration in a separate blog entry.
As discussed, the Atomic layer contains data at its lowest level, i.e. the lowest grain. This layer is typically the easiest to produce from an ETL perspective as the data in this layer aligns very closely with the source data coming into the Data Warehouse. The value of this layer is that it is very flexible since it is easy to summarize data up from the lowest level.  It is much more difficult to “back into” atomic data from summarized data.  

Survival Tip:  If you are going to use a Data Bus Architecture approach for a Data Mart or Data Warehouse, create an Atomic layer. Learn the value of this layer and relentlessly promote it to your team. Take every opportunity to promote the usage of this layer to the end users/tools. Make the Atomic layer the “go to” source for data for your Data Mart or Data Warehouse. Remember it is easier to summarize data “up” from the lowest level vs. trying to add “lower level” data to an existing environment. 

As discussed, the Summary layer is a logical abstraction/layer. The physical representation of this layer is sometimes persisted in tables and sometimes it exists logically in views. The purpose of this layer is to enhance the end user experience, either by improving query response time or by presenting data in a more user friendly manner compared to the Atomic layer. Data is summarized into a higher grain from the Atomic layer.  For example, if data in the Atomic layer is captured at a time grain of day, data may be summarized in the Summary level to a month grain.  In this example, the advantage of creating a Summary layer is that end user tools won’t have to create a “complex” query to summarize the day grain data to return results at the month level. Instead the end user tools can pull data directly as it exists in the summarized table, at the month level.

There is a tradeoff with placing a Summary layer into your Data Warehousing environment. The tradeoff has to do with processing time.  Essentially you are shifting the processing time it takes to run end-user queries to derive a summarized data result from the end user tool and pushing it back into the data flow, i.e. ETL, process. The tradeoff can be calculated which enables a logical decision making process when determining when to create a Summary layer. The decision can be calculated by comparing the end users experience/processing time with the amount of processing time/effort it will take to load a summary layer table as part of an ETL routine.  

The orange blocks in the Dimensional Model layers represent Conformed Dimensions. Note how they are shared among the different facts in both the atomic layer and the summary layer. There are only 3 conformed dimensions in this example and they are logically shared between the atomic layer and the summary layer.

Survival Tip:  Be conscious of how many persistence points, i.e. physical data storage points, exist in your Data Warehousing environment. Each physical persistence step in a Data Warehouse requires time, management, code, etc to move the data from one spot to another. Be paranoid about persisting data and always understand why someone is recommending persisting data as a step of data flow. If a persistence step provides no business value, then eliminate it. 

The Data Bus Architecture organizes Dimensional Models around unique business processes.  Data is presented to end users across the business process oriented Data Marts. A special planning and management tool called a Data Bus Matrix is used to understand what Data Mart’s/Fact Tables each department within an organization will use as part of their Analytical decision making toolset.  

There are several excellent books, web sites, articles, etc written about the Data Bus Architecture. This architecture reference/overview only presents some of the high points. Please research this architectural methodology if you think it will help your organization meet its goals.

Information Factory Model


From an architecture perspective, Information Factories are larger than just database models. They are all encompassing Information Management ecosystems. Dimensional Modeling techniques are used in the Data Mart layer of Information Factories, but the main persistence layer of the Information Factory is the Enterprise Data Warehouse (EDW).  

The following diagram is taken from Inmonn’s website (http://www.inmoncif.com/library/cif/ and depicts a Corporate Information Factory (CIF) in a web site environment. The diagram is a good look at the CIF as you can see several identifying characteristics of a CIF, the ODS, the EDW, the business area aligned Data Marts. Take a look at the diagram and see if you can spot any of your organization’s components within it.


The next diagram of an Information Factory is much more conceptual and shows the major components more clearly. This diagram is sourced from a very good Information Management article written by Claudia Imnhoff http://www.information-management.com/issues/19991201/1667-1.html

The article provides a good overview of the components within the Information Factory, including the ODS, Data Warehouse, Data Marts, Integration Layer, Metadata Management layer, etc,
There are 2 sure fire ways to identify if you are working within an Information Factor environment.

  1. The Enterprise Data Warehouse is in 3rd normal form (3NF)
    1. Go here for the definition of 3rd normal form: http://en.wikipedia.org/wiki/Third_normal_form
  2.  Data Marts departmentally aligned
    1. As opposed to business process aligned
If the answer to either of these questions is yes, then you are working within an Information Factory.
Survival Tip:  Introducing an Information Factory into an enterprise requires a lot of deliberate and strong executive vision and support. The fundamental changes of Information Management that are introduced in an Information Factory will cause friction within organizations. The only way to successfully navigate the cultural changes within organizations is through strong executive leadership.


Design Concepts Applied

Now that we’ve helped to classify the two main categories of Analytical Data Warehouses we will walk through the design concepts associated with them. These design concept questions are provided to help you when reviewing prebuilt systems or when designing new systems.

Go here for an overview of the definition of the different design concepts

Design Concept:  End User Access Points


Application: Most Analytical Data Warehouse systems have tools sitting in between end users and the system. There is very little direct end user access of the data held within the Data Warehouses. However, we will provide a common set of questions to help determine how end users should and will access an Analytical Data Warehouse.
Common Design Considerations:
  • What data, tables, schemas, etc will end users access directly?
  • What tools will the end users use to directly access the Data Warehouse system?
  • What type of queries will the end users run against the Data Warehouse?

Survival Tip:  Use caution when granting direct access to an Analytical Data Warehouse. It’s true that the Data Warehouse exists to provide end users access to information, but direct access to the data contained in the Data Warehouse poses all sorts of risks. A few examples of these risks are: negative performance impacts from expensive queries, data security access risks, data integrity access risks. Be sure you are very deliberative with who, how, when, and why you grant direct access to an Analytical Data Warehouse for an end user.

 Design Concept:  Integration Points


Application: Integration Points for Analytical Data Warehouses fall into two main categories, what data is coming into the Data Warehouse and what data is flowing out of the Data Warehouse. A well-built Data Warehouse will have a good understanding of the data flowing into and out of it.
Common Design Considerations:
  • What data is flowing into the Data Warehouse?
  • Is the data flowing into the Data Warehouse structured, semi-structured, or unstructured?
  • How much data is flowing into and out of the Data Warehouse?
  • Has a network capacity study been completed to ensure adequate capacity.
  • Have SLAs been put in place to provide clarity on information exchange requirements?

Survival Tip:  Beware of the Data Warehousing Trap. The Data Warehousing Trap occurs when the overwhelming majority of attention is myopically focused on getting information into a Data Warehouse. At the same time, very little attention is focused on getting information out of the Data Warehouse. The purpose of a Data Warehouse is to combine and share data to end users. Sharing data means making meaningful information easily accessible from a Data Warehouse. There seems to be a correlation with the maturity of a Data Warehouse and the likelihood that a Data Warehouse Trap gets created. Remember, information flowing out of the Data Warehouse requires just as much attention as the information flowing into a Data Warehouse.

 Design Concept:  Information Persistence


Application:  Information Persistence in a Data Warehousing environment refers to when, where, and how data is physically stored, i.e. persisted. Data is usually persisted on disk but could also be persisted in memory, tape drives, etc. We will discuss ETL persistence in a later blog. 
Big picture, the best way to persist data in a Data Warehouse is in a format that optimizes 2 things:

  • Writing large amounts of data in large singular transactions or “batches”
  • Reading large amounts of data in aggregating queries that return higher level information
A lot of attention should be given to how data is physical stored in a Data Warehouse. This is true no matter the technology stack that you are using. 

Common Design Considerations:
  • What physical storage technologies will be used for the Data Warehouse?
  • What design techniques will be used to ensure fast query response time?
  • Will summary data be created and persisted in the Data Warehouse?
  • How many times is a piece of data persisted in the Data Warehouse, either on disk or in memory?
  • When will data be purged from the Data Warehouse?

Design Concept:  Business Logic


Application: Business Logic in Data Warehousing is represented in the data model. The ETL process that transforms and populated the data model is where the business logic is implemented, but the data model itself reflects the rules, taxonomies, and end state of business logic.
Dimension hierarchies are a great example of a representation of business logic in a Data Warehouse. A hierarchy in a financial dimension, like a chart of accounts dimension, will represent the physical structures, names, and aggregation logic associated with the accounting rules used by finance to manage the General Ledger or other finance structures. 
Table granularity is another example of a representation of business logic in a Data Warehouse. If a business rule states that a POS transaction must capture sales tax per receipt item, then the grain of the POS table, Dimensional or 3NF, should represent this by storing data at a receipt line level including a tax metric attribute.
Another area of business logic that will affect the Data Warehouse is the aggregation rules that govern metrics. I wrote a blog about aggregation logic here:  http://jalsolutions.blogspot.com/2011/08/bit-of-clear-headed-thinking-grain-and.html
Having a clear understanding of the business logic used to manage and operate a business is imperative to a good Data Warehouse design. Take the time and work with business analysts and business architects to ensure you clearly understand your organization’s business logic.
Common Design Considerations:
  •  How are your organization’s key business process and logic represented in the data model(s) of the Data Warehouse?
  • What are the key business hierarchies that should be modeled in the Data Warehouse?
  • What is the fiscal calendar structure for the organization?
  • What is the lowest level of granularity needed to calculate metrics?
  • What metrics are classified as purely additive, semi-additive, or non-additive?

Survival Tip:  Finance and Operational business logic determines grain in Data Warehousing. Be sure to understand how a company calculates the key metrics it uses to measure growth and calculate compensation. Take these calculations and use them as inputs to your data modeling. If you are working in a public traded company you will be able to find copies of your company’s financial earnings statement reports. These are filled with business logic definitions and descriptions that can help you understand how data should be modeled in the Data Warehouse.  Typically these reports explain how key metrics, such as sales, products produced, or customer ads, are calculated.

 

 Design Concept:  Security


Application:  Security in a Data Warehouse deals with defending the data stored in the Data Warehouse.  It can be classified into four categories: 

  1.  Authentication
    1. How will the identity of an end user be determined and passed into the Data Warehouse?
    2. Authentication typically requires expertise outside of the Data Warehousing “space”.  Most of the time you will work with domain or network administrators to determine how Authentication will be constructed.
    3. All other security topics are reliant on successful Authentication.
  2. Access
    1. Who is allowed to access the contents of the Data Warehouse?
    2. An end user should not be able to access a Data Warehouse without being explicitly enabled to do so.
    3. Access determines how desired users will access the Data Warehouse. It also determines how undesired users will be prevented from accessing the Data Warehouse.
    4. This concept is the “first line of defense” for a Data Warehouse.
  3. Authorization
    1. What contents are end users allowed to view?
    2. An end user should not be able to view data unless they have been explicitly allowed to do so. Authorization determines what data end users can view. 
    3. It also determines what data end users are not allowed to view.
    4. This concept is the “second line of defense” for a Data Warehouse.
  4. Encryption
    1. How will sensitive data be securely persisted in the Data Warehouse?
    2. How will sensitive data be securely transmitted in a result set from the Data Warehouse?
    3. Encryption is a key defense mechanism of the data itself. This concept is mutually exclusive of the topics 1, 2, and 3 (the 3 A’s of Security). In fact, good Encryption design should assume that topics 1, 2, and 3 have all failed.
    4. This concept is the “last line of defense” for a Data Warehouse.
Security should be explicitly implemented in a Data Warehouse. This means that no one, including administrative users, should be able to access or view data without being granted explicit permissions to do so.  
Common Design Considerations:
  • What are the security policies governing the data in the Data Warehouse? How compliant is the Data Warehouse to these policies?
  • How will end user access be managed; via groups, individually, application roles, etc?
  • What level of access with administrative and developer resources have in the Data Warehouse?
  • How will end user authorization be managed; via roles, individual grants, etc?
  • What level of data authorization with administrative and developer resources have in the Data Warehouse?
  • What are the sensitive data elements maintained in the Data Warehouse?
  • How is sensitive data managed and stored in the Data Warehouse? How is sensitive data exposed to end users/applications?

Survival Tip:  Traditionally most Data Warehouses have resided behind corporate firewalls. The effect of this is that a lot of Data Warehouse managers are complacent with Security concerns, thinking that the Data Warehouse is secured because it’s behind the firewall. To put it plainly, security for Data Warehousing professionals is often an afterthought. As the responsibilities of Data Warehouses broaden outside of the corporate firewalls, the need to leverage security experts to audit and document security vulnerabilities of the Data Warehouse is increasing. It is recommended that you hire or contract a security expert to assess your Data Warehouses security vulnerabilities before you experience a security breach. Your chances of survival in the information world will increase drastically if you can proactively resolve a security issue before an undesired user exposes it.

Design Concept:  Disaster and Recovery


Application:  Fortunately disaster recovery in Data Warehouses is a time tested concept with lots of proven solutions and techniques available for reference. From a high level, disaster recovery in Data Warehousing is focused on the ability to restore the structure and content of the Data Warehouse quickly to any point in time.
Forming a disaster recovery strategy early in the Data Warehouses lifecycle is recommended. So is a deliberate and periodic audit of the strategy to ensure it is still a feasible plan in case of necessity. The main elements of a Data Warehouses disaster recovery plan are:

  • Disaster Scenario Plan
    • Use cases of common disasters to use as inputs for the recovery plan
  • Service Level Agree Requirements
    • Stated thresholds for down times, recovery times, cost impact, etc to use as inputs for the recovery plan
  • Recovery Plan
    • Single document outlining who, what, where, when, and how recovery should be executed during a disaster
    • Some Key Components Include
      • Database Backup Strategy
        • what, when, and how much data will be backed-up
      • Backup Persistence Specifications
        • where and how long data will be stored for retrieval
        • moving backups offsite is recommended
  • Emergency Contact List
    • list of key resources with contact information by role to be leveraged during recovery
  • Logistics Plan
    • details of locations, conference lines, etc to be leveraged in case of disaster
  • etc
Note, ETL disaster recovery will be covered during the ETL portion of this blog.

Common Design Considerations:
  • What are the recovery and retentions policies governing the data in the Data Warehouse? How compliant is the Data Warehouse to these policies?
  • What are the SLA requirements for recovery of the Data Warehouse? How prepared is the Data Warehouse of being able to meet the SLA requirements for recovery?
  • What tools will be used to store backup data? Will the data be stored offsite in case of a catastrophic event?
  • Who are the key resources that are required for the execution of the recovery plan?
  • How often is the recovery plan reviewed?
  • How will all data elements be restored in case of a disaster?

Survival Tip:  Similar to Data Warehouse security, disaster recovery is a key element in Data Warehousing that doesn’t receive the attention required. Too many Data Warehouse systems consider themselves to be downstream systems and therefore can “rebuild” themselves from the source system in case of a catastrophic event. It’s true that most Data Warehouses leverage database backups from some of their data, but it’s rare that all the required data is recoverable leveraging the database backups. Just like Data Warehouse security, leverage a disaster recovery expert and assess the vulnerabilities of your Data Warehouse before you need to recover from a catastrophic event. Your chances of survival in the information world will increase drastically if you can proactively resolve a recovery issue before a disaster strikes.

Survival Tip:  At the risk of embarrassing myself, I want to highlight how unpredictably disasters can strike. When I was a Greenhorn in this field, I was asked by a large and mature client to install a reporting portal tool in the Production Data Warehouse environment. I had done this a few times before in non-Production environments and so we all felt this was a low risk task. Unfortunately during the installation path I selected the Production Data Warehouse schema for the installation location and accidently dropped and recreated the schema causing all data, sessions, etc to be instantaneously lost. Fortunately, the client had very good recovery capabilities and the Production Data Warehouse was restored. The incident left an unmistakable impression on me. The thought that you never know when a disaster will strike or how it will occur has been ingrained in me ever since this incident. It’s true that the client left themselves vulnerable by granting me the permissions to drop the entire schema, but this incident goes to show that disaster and recovery planning, with period audit points, is an absolute necessity.

Design Concept:  Transparency


Application: Transparency in Data Warehousing is typically centered on the ETL process that loads data into the Data Warehouse. We will discuss ETL transparency in a later blog post. 
For this post, Transparency refers to understanding how, what, and when data is accessed from the Data Warehouse. It is recommended that you create transparency into your Data Warehouse by ensuring that administrative users and management team can see:

  • Usage patterns of data elements
  • Historical trends of usage
  • Historical trends of performance statistics, such as query response time, etc
  • Ranked lists of users (or applications) ranked by usage
  • Ranked lists of queries ranked by performance
  • Alerting on long running, dangerous, or out of control queries
  • Storage availability
  • List of performance bottlenecks in the Data Warehouse
  • etc
Transparency for Data Warehousing should provide insight into the operations of the Data Warehouse and should help the owners and operators of the Data Warehouse proactively detect, analyze, and resolve potential operations issues.

Common Design Considerations:
  • What metrics should be used to manage the operations of the Data Warehouse?
  • Is there visibility into the actual values for stated SLAs?
  • How can the operations and management teams detect issues before they occur?
  • Can the Data Warehouse team proactively determine the potential effects of adding new/planned functionality requirements to the Data Warehouse?

Survival Tip: Building transparency into any technology system is a best practice. This also is true for Data Warehousing. Be sure to include money and resources in the planning phases of building the Data Warehouse to provide Transparency. Not building adequate transparency into your Data Warehouse will cause a lot of pain as the Data Warehouse grows in terms of users and complexity of requests.

Design Concept:  Capacity and Scalability


Application: If Transparency has been included in the Data Warehouse design, then chances are that you are ahead of the game with regards to Capacity and Scalability. In a Data Warehouse, Capacity and Scalability refer to ensuring there are enough machine resources (CPU, RAM, Storage, Network Bandwidth) to satisfy the user requests to the Data Warehouse. 
Be sure to focus deliberate attention on the movement of data within and out of the Data Warehouse, commonly referred to as I/O. Also, be sure to focus deliberate attention on the persistence layer used to store data, logs, temporary files, internal database files, etc. Historically the biggest Capacity and Scalability issues in a Data Warehouse have been disk space and I/O. 
The Capacity and Scalability efforts for Data Warehouse environments should focus on the following, in this order: 
  1. Disk Space
    1. Amount of physical storage available to the Data Warehouse.
    2. Incorporate backup requirements and other offline functional disk space requirements in the disk space capacity plan.
  2. Disk I/O Capacity
    1. Data moving to the physical storage layer of the Data Warehouse
    2. Data Warehouse design will determine most of the I/O capacity issues
  3. Network I/O
    1. Data moving across the network to answer end user/application queries
  4. Memory (RAM)
    1. Memory is used to “cache” data in a faster persistence layer compared to disk.
    2. Some Data Warehouses only use memory for persistence
  5. CPU
    1. Amount of processing capabilities to analyze queries, execute data retrieval commands, aggregate data, execute result commands, etc
It is recommended that every Data Warehouse has a Scalability plan that explicitly states performance thresholds for each of the categories stated above. The Scalability plan should describe the actions that should be taken when the thresholds are reached over a sustained period of time.


Common Design Considerations:
  • Is there enough capacity to support the load placed onto the movement of data to the physical storage layer?
  • Is the network able to handle the amount of data flowing over it when users access data from the Data Warehouse?
  • Is there enough disk space to handle the 5 year projected growth of the Data Warehouse?
  • What are the average, minimum, and peak usage patterns of the Data Warehouse?

Design Concept:  Performance


Application:  If you’ve followed the guidance in the Capacity and Scalability section as well as the Transparency section of this blog, then performance should not be an issue for your Data Warehouse.  As mentioned in the Capacity and Scalability section above, there are two main areas where performance can suffer in a Data Warehouse environment: 

  1.  Performance of loading data into the Data Warehouse
    1. We will cover this topic in depth during the ETL blog.
  2. Performance of retrieving data from the Data Warehouse
Oddly enough the strategies to mitigate these two main types of performance issues are very similar.  The first step in ensuring performance in a Data Warehouse is to ensure you have visibility into the actual performance of the Data Warehouse, i.e. Transparency. The next step is to create a Capacity and Scalability plan by analyzing the statistics captured from the Transparency initiative. Performance problems will exist in your Data Warehouse. Having visibility and a deliberate process/plan to mitigate Performance issues is a necessity for Data Warehousing.
In addition to the Capacity and Scalability plan, here are some practical performance tips and tricks for Data Warehousing:

  1. Ensure good design
    1. The best hardware in the world cannot improve performance if the fundamental design of a Data Warehouse is flawed.
    2. Refer to the tips mentioned in the Integration and Information Persistence sections of this blog for tips on proper Data Warehouse design.
    3. A good design should encompass:
      1. Minimal data movement
      2. Excellent Indexing
      3. Proper partitioning and other database level performance enhancing techniques
  2. Hire a great Tool Administrator
    1. Hire an expert who knows the most intimate details of the tool that your Data Warehouse is using for the physical storage area.
  3. Enforce defensive policies
    1. Be smart with your user policies and ensure that you have controls to “kill” harmful queries.
  4. Choose the right tool for the job
    1. Be sure to align the Data Warehouse tool suit to the use case. A “real-time” Data Warehouse shouldn’t rely on a tool built for offline processing.
  5. Set expectations
    1. Be sure to properly set expectations through SLAs and other means so end users do not perceive a performance issue if one doesn’t exist.
Common Design Considerations:
  • What are the 10 highest performance risks to the Data Warehouse?
  • What are the 10 worst performance bottlenecks in the Data Warehouse?
  • How can the Data Warehouse “protect itself” from potentially harmful end user processes?
  • What are the performance SLA terms for the Data Warehouse?

Survival Tip:  Outside of Trust in the data, one of the highest reasons that Data Warehouses are labeled as failures is end user performance.  All Data Warehouses are conceived with good performance in mind.  However, somewhere along the implementation tactical decisions are made, timelines are squeezed, and the original performance strategies/assumptions are compromised.  Ensuring good performance in Data Warehousing requires the full team to drive towards this goal.  Every decision during implementation should include the question:  What will this do to performance?

Design Concept:  Maintenance, Bugs, and Upgrades


Application: One of the nicest things about Analytical Data Warehouses is that they are designed and built around batch-oriented, latency-centric, data processing techniques. End users of Data Warehouse systems expect large windows of “refresh” time when the Data Warehouse is either unavailable or unreliable. The inherent, and if you’re doing your job explicitly stated, daily load windows provide excellent opportunities to performance maintenance on the system.
Be sure though that you have an SLA in place with an explicitly stated strategy on handling:

  • Routine Maintenance
    • Stated allotment of time dedicated to weekly or monthly Data Warehouse maintenance activities
    • A few examples of routine maintenance activities are:  Backups, Purge Processes, Index Maintenance, etc
  • Large Software Upgrades
    • Stated time based strategy explaining when major “dot” releases will be available from the Data Warehouse “stack” vendors.
    • This part of the SLA should set expectations with Data Warehouse end users, owners, and operators as to when major releases will be available from the software vendors.
    • The purpose of setting expectations is to ensure deliberate planning occurs to minimize the impact of an upgrade on the business.
  • A Bug/Defect Resolution Strategy
    • Stated strategy with “turn-around” time expectations based on the severity of bugs.
Common Design Considerations:
  • What are the business expectations for “up time” for the Data Warehouse? What is the business impact if these expectations cannot be satisfied?
  • What is the required time to fix an issue by issue severity? 
  • What are the daily, weekly, monthly, and annual Data Warehouse maintenance tasks? How much time is required for each task? Who are the resources involved in performing each task?

Development and Implementation Concepts Applied


There are countless articles, books, blogs, white papers, etc on best practices for implementing an Analytical Data Warehouse. Most processes are implementation approaches based on either the Information Factory or Data Bus approaches. This is why it is important for Information Managers and Architects to understand these two fundamental Data Warehousing methodologies and to explicitly identify the methods from each approach to be used in the Data Warehousing ecosystem.
There are a couple of key items that must be deliberately managed and addressed to help ensure the success of a Data Warehouse project. These items include:

  1. Starting with a Project Charter
    1. This may seem like Project Management 101, but in a Data Warehouse clarity of vision and ownership is key to ensuring success.
    2. A lack of vision, something I like to call the purpose vacuum, will cause immense harm to the tactical, day-to-day implementation process of implementing a Data Warehouse.
    3. The vision shouldn’t be too detailed at this point, but it should be clear and include:
      1. The goals of the Data Warehouse
      2. Owner and roles for the team
      3. Inception level statements of requirements prioritize by business value impact
      4. Conceptual Diagram of Data Contents 
    4.  The risks of not starting with a clear vision are inflated costs, missed expectations, and the most detrimental issue in Data Warehousing, a lack of TRUST in the end result.
  2. Choosing a Flexible Development Methodology and Framework
    1. Data Warehousing projects are expensive due to the amount of effort it takes to secure, analyze, import, transform, and load data into the Data Warehouse. 
    2. The effort required meeting a request from end users and system owners typically causes shock to the end users and system owners, which damages the reputation and value of the Data Warehouse.
    3. Selecting a development methodology as well as framework that is purposely built to support rapid deployment of change/new functionality for Data Warehouses specifically will greatly increase the chances of a successful Data Warehouse.
    4. Several methodologies exist for incorporating agility, not necessarily Agile, into the Data Warehouse process. Google Data Warehousing Agility for more information on this topic.
    5. Several tools/frameworks exist that help Data Warehouse operators react to change.  Wherescape is a very interesting tool concept to help alleviate some of the pain that comes with managing a Data Warehousing environment in a fluctuating business environment. Note, I have not seen any of Wherescape products used in a production deployment as yet.  http://en.wikipedia.org/wiki/Wherescape 
  3. Rapid, Iterative, and Continuously-Communicated Requirements Gathering Process
    1. The purpose of Requirements gathering in a Data Warehousing initiative is to ensure that the right data is captured at the right level, i.e. grain, and presented to end users at the right time.
    2. To optimize the cost of development for Data Warehousing, these three questions need to be answered as efficiently as possible.
    3. Note the answers to these questions are going to change, always and rapidly.
    4. Following a traditional IT oriented process where requirements are gathered, turned into massively large Technical Specifications and then handed over to a development team is not the most efficient method to use when capturing Data Warehousing requirements.
    5. Requirements gathering for a Data Warehousing process should employ iterative techniques, visualizations, and when appropriate use cases to rapidly capture requirements and provide feedback to end users. It’s important to leverage a requirements process that allows for continual feedback.
    6. It is also very important that the Data Warehouse deployment/implementation/development methodology can empower an iterative, continuous feedback loop approach to requirements gather.
  4. Start Managing the Content Early
    1. Data/Information Management is not a sexy topic but it is incredibly value.
    2. Organizations that have a strong Information Management organization in place are more likely to produce successful Data Warehouses. This is because Information Management artifacts can be leveraged as solid inputs into the Data Warehouse.
    3. Examples of Information Management artifacts that positively impact Data Warehousing are:
      1. Conceptual Diagrams
      2. Well Defined Stewardship Community
      3. Reference and Master Data Management Capabilities
      4. Information Architecture Standards
      5. Data Dictionary (Business Semantics and Technical Standards)
  5. Know the Data
    1. Understanding the data is key to building a successful Data Warehouse
    2. Prove assumption empirically through Data Profiling prior to placing items into Production.
    3. Prove the Data Model will satisfy the business needs as well as what is dictated by the incoming data sources.
Survival Tip:  One of the oldest “tricks of the trade” for Data Warehousing is to always have high level executive buy in before beginning a Data Warehouse initiative. This is as true today as it was in the 90s when Data Warehousing first gained wide adoption. If you are asked to participate in a Data Warehousing initiative that is lacking in a good, capable executive sponsor then be prepared for a long and bumpy ride that will probably end in failure. Oh, the same advice holds true if there is no quantified business value associated with the Data Warehousing initiative. 

Opportunities for Improvement and Future Direction

The goal of Analytical Data Warehouses and Data Marts is to enhance organizational decision making by combining data from desperate Operational systems and transforming that data into information. The value of Data Warehouses has been immense and correlates to the benefits of better organizational decision making. However, the cost of delivering Data Warehouse functionality has been very expensive for years. Also Data Warehousing systems have been notoriously rigid in terms of the ability to react to changes in business requirements. Both the high cost and rigidity of Data Warehouses has eroded some of the positive financial gains that were promised from these systems. In addition to these factors the sheer volume of data combined with the complexity of statistical oriented data requirements is exposing some fundamental flaws in terms of scalability to traditional Analytical Data Warehouse methodologies.

The Analytical Data Warehouse market has identified several opportunities for improvement that are targeted at alleviating the challenges discussed in the preceding. These opportunities for improvement include: 

  1.  Reduce the reliance on specialized hardware and software (RDBMS) to minimize the cost associated with Analytical Data Warehouse.
  2. Enhance rapid scalability in terms of both data storage and end user processing.
  3. Eliminate the need to physically extract data from Operational Systems and move it into Analytical Systems.
    1. As discussed in this blog, this data extraction is done to shift the processing burden of analytical queries away from Operational Systems to specialized Analytical Data Warehouses with the goal of protecting the performance capabilities of the Operational Systems.
Technical innovations tagged as “Big Data”, “NoSQL”, etc are emerging as “battle tested” solutions that are targeted at delivering on these opportunities for improvement. In fact some newer technologies are promising to enable hosting both large scale Operational systems and Analytical systems on a single technical stack. 
These technical innovations have introduced a high potential for the Data Warehousing market.  It is very likely that a business problem that currently requires an Analytical Data Warehouse solution will, in the near future, not require an Analytical Data Warehouse as part of the solution. Think of the potential of being able to provide the rewards of Analytical Data Warehousing without having to “deal” with the pain of Analytical Data Warehousing. That is, not having to create specialized data models, manage specialized infrastructures, code challenging ETL processes, etc while enabling the same fact based decision making capabilities.
The opportunities for improvement in Data Warehousing are huge and are becoming more obtainable every day.
Survival Tip:  To increase your chances of survival in the information world, pay attention to the Big Data movement and start playing with some of the new technologies, especially Hadoop. The effort it will take to “catch-up” to technology in the coming years is going to increase as the technology increases and becomes less familiar to the technologies of today. It won’t take you more than 4 – 8 hours a month to read or play with a new technology.  Ideally, try to get yourself involved in these new technologies today within your existing role either through a POC or other prototyping effort.

Summary

In this post we took a look at the major approaches, methodologies, architectures, design considerations, development approaches and opportunities of improvement of Analytical Data Warehouses. The purpose of this blog post is to provide deep guidance and references for the professionals who work with or consume information from Analytical Data Warehouses and Data Marts.

The next blog post in this series will be much shorter and will explore the Operational equivalent of an Analytical Data Warehouse, the Operational Data Store.