We have fixed number of operations to be applied on operational databases and we have well defined techniques such as use normalized data,keep table small
etc. These techniques are suitable for delivering a solution. But in
case of decision support system we do not know what query and operation
need to be executed in future. Therefore techniques applied on
operational databases are not suitable for data warehouses.
In this chapter We'll focus on designing data warehousing solution built on the top open-system technologies like Unix and relational databases.
Points to remember while extract and load process:
For example in a customer profiling data warehouse in telecommunication sector it is illogical to merge list of customers at 8 pm on wednesday from a customer database with the customer subscription events up to 8 pm on tuesday. This would mean that we are finding the customers for whom there are no associated subscription.
Note: Consistency checks are executed only when all data sources have been loaded into temporary data store.
For example in a retail sales analysis data warehouse, it may be required to keep data for 3 years with latest 6 months data being kept online. In this kind of scenario there is often requirement to be able to do month-on-month comparisons for this year and last year. In this case we require some data to be restored from the archive.
In this chapter We'll focus on designing data warehousing solution built on the top open-system technologies like Unix and relational databases.
Process Flow in Data Warehouse
There are four major processes that build a data warehouse. Here is the list of four processes:- Extract and load data.
- Cleaning and transforming the data.
- Backup and Archive the data.
- Managing queries & directing them to the appropriate data sources.
Extract and Load Process
- The Data Extraction takes data from the source systems.
- Data load takes extracted data and loads it into data warehouse.
Points to remember while extract and load process:
- Controlling the process
- When to Initiate Extract
- Loading the Data
Controlling the process
Controlling the process involves determining that when to start data extraction and consistency check on data. Controlling process ensures that tools, logic modules, and the programs are executed in correct sequence and at correct time.When to Initiate Extract
Data need to be in consistent state when it is extracted i.e. the data warehouse should represent single, consistent version of information to the user.For example in a customer profiling data warehouse in telecommunication sector it is illogical to merge list of customers at 8 pm on wednesday from a customer database with the customer subscription events up to 8 pm on tuesday. This would mean that we are finding the customers for whom there are no associated subscription.
Loading the Data
After extracting the data it is loaded into a temporary data store.Here in the temporary data store it is cleaned up and made consistent.Note: Consistency checks are executed only when all data sources have been loaded into temporary data store.
Clean and Transform Process
Once data is extracted and loaded into temporary data store it is the time to perform Cleaning and Transforming. Here is the list of steps involved in Cleaning and Transforming:- Clean and Transform the loaded data into a structure.
- Partition the data.
- Aggregation
Clean and Transform the loaded data into a structure
This will speed up the queries.This can be done in the following ways:- Make sure data is consistent within itself.
- Make sure data is consistent with other data within the same data source.
- Make sure data is consistent with data in other source systems.
- Make sure data is consistent with data already in the warehouse.
Partition the data
It will optimize the hardware performance and simplify the management of data warehouse. In this we partition each fact table into a multiple separate partitions.Aggregation
Aggregation is required to speed up the common queries. Aggregation rely on the fact that most common queries will analyse a subset or an aggregation of the detailed data.Backup and Archive the data
In order to recover the data in event of data loss, software failure or hardware failure it is necessary to backed up on regular basis.Archiving involves removing the old data from the system in a format that allow it to be quickly restored whenever required.For example in a retail sales analysis data warehouse, it may be required to keep data for 3 years with latest 6 months data being kept online. In this kind of scenario there is often requirement to be able to do month-on-month comparisons for this year and last year. In this case we require some data to be restored from the archive.
Query Management Process
This process performs the following functions- This process manages the queries.
- This process speed up the queries execution.
- This Process direct the queries to most effective data sources.
- This process should also ensure that all system sources are used in most effective way.
- This process is also required to monitor actual query profiles.
- Information in this process is used by warehouse management process to determine which aggregations to generate.
- This process does not generally operate during regular load of information into data warehouse.
No comments:
Post a Comment