Fact Table and Dimension Table Features

Dimension Table features
  • It provides the context /descriptive information for fact table measure-ments.
  • Provides entry points to data.
  • Structure of Dimension - Surrogate key , one or more other fields that compose the natural key (nk) and set of Attributes.
  • Size of Dimension Table is smaller than Fact Table.
  • In a schema more number of dimensions are presented than Fact Table.
  • Surrogate Key is used to prevent the primary key (pk) violation(store historical data).
  • Values of fields are in numeric and text representation.


Fact Table features
  • It provides measurement of an enterprise.
  • Measurement is the amount determined by observation.
  • Structure of Fact Table - foreign key (fk), Degenerated Dimension and Measurements.
  • Size of Fact Table is larger than Dimension Table.
  • In a schema less number of Fact Tables observed compared to Dimension Tables.
  • Compose of Degenerate Dimension fields act as Primary Key.
  • Values of the fields always in numeric or integer form

Facts abd Dimensions

Facts:

The fact table is at the center of a star schema and holds the primary measurement 
data. They contain the actual numerical measurements that the business is interested in.
Fact tables express the many-to-many relationships between dimensions.
A fact table typically has two types of columns: those that contain measures and those
that are foreign keys to dimension tables. Some key features of a fact table are
  • Multi part Key.  i.e. a composite key with one foreign key for each dimension.
  • Time is a always a part of the key
  • Usually numeric. Keys are surrogate integers and the measures are numeric. 
  • Typically additive. 


Dimensions:

The dimension tables are where the attributes of the dimensions of the business are stored. The best attributes are textual and discrete and used to constraint the fact table. Each of these textual descriptions helps us to describe the member of the respective dimension.  
  • They are the entry points into the fact tables. They determine the grain of the fact  table.
  • They serves as a primary source of query constraints grouping and report labels/row headers.
  • They are relatively shallow in terms of rows but are wide with many large columns.
  • They are not usually time dependent
  • Hierarchical relationships.
  • Robust dimension attributes delivers analytic slicing and dicing capabilities. 
  • Dimension tables are de-normalized. 
  • Examples of Dimensions: Employee, Time Product Customer etc

Types of Dimensions

Slowly Changing Dimensions:
Attributes of a dimension that would undergo changes over time. It depends on the business requirement whether particular attribute history of changes should be preserved in the data warehouse. This is called a Slowly Changing Attribute and a dimension containing such an attribute is called a Slowly Changing Dimension.
Rapidly Changing Dimensions:
A dimension attribute that changes frequently is a Rapidly Changing Attribute. If you don’t need to track the changes, the Rapidly Changing Attribute is no problem, but if you do need to track the changes, using a standard Slowly Changing Dimension technique can result in a huge inflation of the size of the dimension. One solution is to move the attribute to its own dimension, with a separate foreign key in the fact table. This new dimension is called a Rapidly Changing Dimension.
Junk Dimensions:
A junk dimension is a single table with a combination of different and unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk dimensions are often created to manage the foreign keys created by Rapidly Changing Dimensions.
Inferred Dimensions:
While loading fact records, a dimension record may not yet be ready. One solution is to generate an surrogate key with Null for all the other attributes. This should technically be called an inferred member, but is often called an inferred dimension.
Conformed Dimensions:
A Dimension that is used in multiple locations is called a conformed dimension. A conformed dimension may be used with multiple fact tables in a single database, or across multiple data marts or data warehouses.
Degenerate Dimensions:
A degenerate dimension is when the dimension attribute is stored as part of fact table, and not in a separate dimension table. These are essentially dimension keys for which there are no other attributes. In a data warehouse, these are often used as the result of a drill through query to analyze the source of an aggregated number in a report. You can use these values to trace back to transactions in the OLTP system.
Role Playing Dimensions:
A role-playing dimension is one where the same dimension key — along with its associated attributes — can be joined to more than one foreign key in the fact table. For example, a fact table may include foreign keys for both Ship Date and Delivery Date. But the same date dimension attributes apply to each foreign key, so you can join the same dimension table to both foreign keys. Here the date dimension is taking multiple roles to map ship date as well as delivery date, and hence the name of Role Playing dimension.
Shrunken Dimensions:
A shrunken dimension is a subset of another dimension. For example, the Orders fact table may include a foreign key for Product, but the Target fact table may include a foreign key only for ProductCategory, which is in the Product table, but much less granular. Creating a smaller dimension table, with ProductCategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the Product dimension is snowflaked, there is probably already a separate table for ProductCategory, which can serve as the Shrunken Dimension.
Static Dimensions:
Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with Status codes — or it can be generated by a procedure, such as a Date or Time dimension.

Types of Fact Tables

According to Kimball methodology includes 3 main types of fact tables:

Transaction – the most common type of fact table, used to model a specific business process (typically) at the most granular/atomic level.
E.g Daily Transactions of Sales of an item in a store
Periodic Snapshot – used to model the status of a business process at a specific point in time on a regularly recurring interval. For example, a periodic snapshot fact table might be used to track account balances on a monthly basis. In this case, a “snapshot” of the account balance would be taken at the end of each month – which represents the net of all withdrawal and deposit transactions occurring during the month. Inventory is another common scenario that makes use of periodic snapshots for tracking quantity on hand (by item) at the end of each month. In both examples, the primary “fact” (account balance and quantity on hand) in the two tables are “semi-additive” – which simply means they can’t be aggregated over time.
E.g Transactions over a period of month let's say a month
Accumulating Snapshot – model events in progress for business processes (e.g. Claims Processing for an Insurance Company) that involve a predefined series of steps (e.g. claim submitted, claim reviewed, claim approved/rejected). These tables prove useful in measuring/analyzing the duration between steps in a complete process and discovering bottlenecks.
e.g Transactions show the information of the business which has clearly defined from begin to the end
In addition to the primary fact table types (above), there are a few additional fact tables to consider:
Factless Fact Tables – used to model the events (e.g. student attendance) or conditions (e.g. product promotions). One common misconception is that these tables do not contain facts – which is only partially true. For example, if we are using a factless fact table to model “student attendance” (a textbook example of factless fact tables – pun intended!), our table may also include – in addition to the keys mapping to the student, class, and date dimensions – a single fact called “attendance_count” with the value of 1 for every row. This is not necessary, but it makes it clear what we are measuring and might even make life easier for downstream development of a cube and/or reports.
Note: a factless fact table used to model conditions is known as a coverage fact table in the Kimball methodology.

Slowly Changing Dimensions

The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a nutshell, this applies to cases where the attribute for a record varies over time. We give an example below:

  • Vinay is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in the customer lookup table has the following record:


Customer Key
Name
State
1001
Vinay
Illinois 
·     At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. now modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem.
·         
   There are in general three ways to solve this type of problem, and they are categorized as follows:
Type 1
  • The new record replaces the original record. No trace of the old record exists.
Type 2
  • A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people.
Type 3
  • The original record is modified to reflect the change.
We next take a look at each of the scenarios and how the data model and the data looks like for each of them. Finally, we compare and contrast among the three alternatives.
 

Type 1 Slowly Changing Dimension

·  In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept.

·         In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Vinay
Illinois
·         After Vinay moved from Illinois to California, the new information replaces the new record, and we have the following table:

Customer Key
Name
State
1001
Vinay
California

Advantages:
·  This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.

Disadvantages:
·   All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Vinay lived in Illinois before.
·         Usage:
·         About 50% of the time.
·         When to use Type 1:
·         Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.

Type 2 Slowly Changing Dimension

In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The new record gets its own primary key.
·         In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Vinay
Illinois

·     After Vinay moved from Illinois to California, we add the new information as a new row into the table:
Customer Key
Name
State
1001
Vinay
Illinois
1005
Vinay
California
Advantages:
-      This allows us to accurately keep all historical information.

Disadvantages:
  • This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.
  • This necessarily complicates the ETL process.
  • Usage:
  • About 50% of the time.
  • When to use Type 2:
  • Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes.
Type 3 Slowly Changing Dimension
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Vinay
Illinois
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
·                     Customer Key
·                     Name
·                     Original State
·                     Current State
·                     Effective Date
After Vinay moved from Illinois to California, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2003):

Customer Key
Name
Original State
Current State
Effective Date
1001
Vinay
Illinois
California
15-JAN-2003

Advantages:
  • This does not increase the size of the table, since new information is updated.
  • This allows us to keep some part of history.

Disadvantages:
  • Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Vinay later moves to Texas on December 15, 2003, the California information will be lost.
Usage:
  • Type 3 is rarely used in actual practice.

When to use Type 3:
Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time.

Types of Facts

Types of Facts
There are different kinds of facts:

1) Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.


2) Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.

3) Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns:
Date
Store
Product
Sales_Amount

The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represents the total sales amount for that week.

Say we are a bank with the following fact table:
Date
Account
Current_Balance
Profit_Margin

The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.

4) Derived facts
  • Derived fact is shown as a kind of fact, along with base fact.
  • A derived fact is created by an inference or a mathematical calculation from terms, facts, other derivations, or even action assertions.
  • A base fact is a fact that is a given in the world and is remembered (stored) in the system.
  • Derived facts are created by performing a mathematical calculation on a number of other facts, and are sometimes referred to as calculated facts.
  • Derived facts may or may not be stored inside the fact table.
  • Base fact is generally defined, when a fact can be picked-up from the source system without any derivation. The derived fact is a data field which is derived from the base fact

5) Textual facts
  • Textual fact consists of one or more characters (codes). They should be strictly avoided in the fact table. Textual codes such as flags and indicators should be stored in dimension tables so they can be included in queries.
  • Textual facts are non-additive, but could be used for counting.
6) Pseudo fact
  • Pseudo fact when summed, a pseudo fact gives no valid result. They typically result when you design event-based fact tables.
  • In SQL Server 2005, you cannot perform a non-logged load directly into a partition. However, you can load into a separate table that we will call the pseudo-partition. Under certain conditions, you can switch the pseudo-partition into the partitioned table as a metadata operation that occurs extremely quickly.
7) Factless fact
  • A fact table with only foreign keys and no facts is called a factless fact table.
  • Is a table which is similar to Fact Table except for having any measure; I mean that this table just has the links to the dimensions.
  • Use of Factless fact table is these tables enable you to track events; indeed they are for recording events.

Fact Table Granularity

Granularity

The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the lowest level of information that will be stored in the fact table. This constitutes two steps:
  1. Determine which dimensions will be included.
  2. Determine where along the hierarchy of each dimension the information will be kept.
The determining factors usually goes back to the requirements.

Which Dimensions To Include

Determining which dimensions to include is usually a straightforward process, because business processes will often dictate clearly what are the relevant dimensions.
For example, in an off-line retail world, the dimensions for a sales fact table are usually time, geography, and product. This list, however, is by no means a complete list for all off-line retailers. A supermarket with a Rewards Card program, where customers provide some personal information in exchange for a rewards card, and the supermarket would offer lower prices for certain items for customers who present a rewards card at checkout, will also have the ability to track the customer dimension. Whether the data warehousing system includes the customer dimension will then be a decision that needs to be made.

What Level Within Each Dimension To Include
Determining which part of hierarchy the information is stored along each dimension is not an exact science. This is where user requirement (both stated and possibly future) plays a major role.
In the above example, will the supermarket wanting to do analysis along at the hourly level? (i.e., looking at how certain products may sell by different hours of the day.) If so, it makes sense to use 'hour' as the lowest level of granularity in the time dimension. If daily analysis is sufficient, then 'day' can be used as the lowest level of granularity. Since the lower the level of detail, the larger the data amount in the fact table, the granularity exercise is in essence figuring out the sweet spot in the tradeoff between detailed level of analysis and data storage.
Note that sometimes the users will not specify certain requirements, but based on the industry knowledge, the data warehousing team may foresee that certain requirements will be forthcoming that may result in the need of additional details. In such cases, it is prudent for the data warehousing team to design the fact table such that lower-level information is included. This will avoid possibly needing to re-design the fact table in the future. On the other hand, trying to anticipate all future requirements is an impossible and hence futile exercise, and the data warehousing team needs to fight the urge of the "dumping the lowest level of detail into the data warehouse" symptom, and only includes what is practically needed. Sometimes this can be more of an art than science, and prior experience will become invaluable here. 

Dimensional Data Model

Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be stored differently in a dimensional model than in a 3rd normal form model.
To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:

Dimension: A category of information. For example, the time dimension.

Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.

Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.

Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column.

Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema.
Whether one uses a star or a snowflake largely depends on personal preference and business needs. Personally, I am partial to snowflakes, when there is a business case to analyze the information at that particular level.

Star Schema
In the star schema design, a single object (the fact table) sits in the middle and is radically connected to other surrounding objects (dimension lookup tables) like a star. Each dimension is represented as a single table. The primary key in each dimension table is related to a foreign key in the fact table.

Sample star schema



All measures in the fact table are related to all the dimensions that fact table is related to. In other words, they all have the same level of granularity.
A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.
Let's look at an example: Assume our data warehouse keeps store sales data, and the different dimensions are time, store, product, and customer. In this case, the figure on the left represents our star schema. The lines between two tables indicate that there is a primary key / foreign key relationship between the two tables. Note that different dimensions are not related to one another. 

Snowflake Schema
The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. In a star schema, each dimension is represented by a single dimensional table, whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy.




Sample snowflake schema

For example, the Time Dimension that consists of 2 different hierarchies:

1. Year → Month → Day

2. Week → Day
We will have 4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for month, a lookup table for week, and a lookup table for day. Year is connected to Month, which is then connected to Day. Week is only connected to Day. A sample snowflake schema illustrating the above relationships in the Time Dimension is shown to the right.


The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables. 

Data Warehouse Concepts

What is a Data Warehouse?

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.


In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. 



  1. Subject Oriented
  2. Integrated
  3. Nonvolatile
  4. Time Variant

Subject Oriented :

Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.



Integrated : 

Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.

Nonvolatile : 

Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.


Time Variant : 

In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.

Contrasting OLTP and Data Warehousing Environments


Figure 1-1 Contrasting OLTP and Data Warehousing Environments
Text description of dwhsg005.gif follows
One major difference between the types of system is that data warehouses are not usually in third normal form (3NF), a type of data normalization common in OLTP environments.

Data warehouses and OLTP systems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems:
  • Workload
    Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations.
    OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations.
  • Data modifications
    A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse.
    In OLTP systems, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction.
  • Schema design
    Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance.OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency.
  • Typical operations
    A typical data warehouse query scans thousands or millions of rows. For example, "Find the total sales for all customers last month."
    A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer."
  • Historical data
    Data warehouses usually store many months or years of data. This is to support historical analysis. OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.

Data Warehouse Architectures

Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are:

  • Data Warehouse Architecture (Basic)
  • Data Warehouse Architecture (with a Staging Area)
  • Data Warehouse Architecture (with a Staging Area and Data Marts)

Data Warehouse Architecture (Basic)

Figure 1-2 shows a simple architecture for a data warehouse. End users directly access data derived from several source systems through the data warehouse.

Figure 1-2 Architecture of a Data Warehouse

Text description of dwhsg013.gif follows

In Figure 1-2, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales. A summary in Oracle is called a materialized view.


Data Warehouse Architecture (with a Staging Area)

In Figure 1-2, you need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. Figure 1-3 illustrates this typical architecture.

Figure 1-3 Architecture of a Data Warehouse with a Staging Area

Text description of dwhsg015.gif follows

Figure 1-3 Architecture of a Data Warehouse with a Staging Area


Data Warehouse Architecture (with a Staging Area and Data Marts)

Figure 1-4 Architecture of a Data Warehouse with a Staging Area and Data Marts
Text description of dwhsg064.gif follows




Hope this blog will make you understand and also dont forget to visit my Facebook Page


OBIEE 11g Developement