Friday, June 1, 2012

Data Warehouse Basic Concepts:Part 1

Dimensional Database

A Data Warehouse system is used for OLAP (Online Analytical Processing) as opposed to OLTP. Because of the requirements & nature of OLAP, a Data Warehouse typically uses a relational database for storage but the database design is non-relational. Instead the design follows the Dimensional Database principle, which is very different from Relational Database principles. For example, the following are direct opposites from what is typically done in a relational database design.
  1. Maximum data redundancy
  2. Minimum joins
  3. Minimum normalization
  4. Maximum indexing
  5. Maximum derived data
Components

A Data Warehouse system typically consists of the following components:
  1. Reporting system: used by the end-users to query & view the information in the data warehouse.
  2. ETL pipeline: used to extract raw data, transform them into the dimensional database format, and load them into the data warehouse. 
  3. Data Warehouse: storage and organization of data to be queried.
Raw Data

The raw data is usually extracted from operational systems, sometimes from various sources (e.g. ERP, Accounting, CRM systems etc.).

In many cases, raw data is also consolidated in an "operational data store" (ODS) before being put through an ETL pipeline, and during this stage various operations may be performed (e.g. calculations, aggregation, reconciling & matching data from different sources etc.).

ETL

ETL stands for Extract, Transform, Load.

From the ODS to the ETL pipeline, the first place the data goes is usually the Staging area.

The raw data in the staging area is then transformed. Many operations are done here. They are listed in this article.

After transformation, they are loaded into the data warehouse according to how the fact tables are to be organized within the data warehouse.

Data Warehouse

The data warehouse is responsible for storage of transformed data ready to be queried.

The data warehouse has the following data:
  1. Fact Tables: de-normalized tables containing the facts to be queried. Also contains foreign keys to dimension values.
  2. Data Dictionary: contains information about where to lookup information (e.g. fact tables, fact table partitioning schemes, measures & dimensions etc.)
  3. Master Data: contains dimension values.
References
  1. en.wikipedia.org/wiki/Dimensional_database
  2. en.wikipedia.org/wiki/Operational_data_store
  3. Difference between ODS & Staging

No comments:

Post a Comment