Extracting: Getting Data into the Data Warehouse
Subsystem 1: Data Profiling
Data profi ling plays two distinct roles: strategic and tactical.
Subsystem 2: Change Data Capture System
You must have a capability to transfer only the relevant changes to the source data since the last update. Isolating the latest source data is called change data capture (CDC). The idea behind CDC is simple enough: Just transfer the data that has changed since the last load. But building a good CDC system is not as easy as it sounds. The key goals for the change data capture subsystem are: ■ Isolate the changed source data to allow selective processing rather than a complete refresh. ■ Capture all changes (deletions, edits, and insertions) made to the source data, including changes made through nonstandard interfaces. ■ Tag changed data with reason codes to distinguish error corrections from true updates. ■ Support compliance tracking with additional metadata. ■ Perform the CDC step as early as possible, preferably before a bulk data transfer to the data warehouse.
- Audit Columns
- Timed Extracts
- Full Diff Compare
- Database Log Scraping
- Message Queue Monitoring
Subsystem 3: Extract System
Cleaning and Conforming Data
Subsystem 4: Data Cleansing System
Quality Screens The heart of the ETL architecture is a set of quality screens that act as diagnostic fi lters in the data fl ow pipelines. Each quality screen is a test. If the test against the data is successful, nothing happens and the screen has no side eff ects. But if the test fails, then it must drop an error event row into the error event schema and choose to either halt the process, send the off ending data into suspension, or merely tag the data.
Responding to Quality Events The choices are: 1) halting the process; 2) sending the off ending record(s) to a suspense fi le for later processing; and 3) merely tagging the data and passing it through to the next step in the pipeline. The third choice is by far the best choice, whenever possible.
Subsystem 5: Error Event Schema
Subsystem 6: Audit Dimension Assembler
The audit dimension is a special dimension that is assembled in the back room
by the ETL system for each fact table, as we discussed in Chapter 6: Order
Management. The audit dimension in Figure 19-2 contains the metadata context at
the moment when a specifi c fact table row is created. You might say we have elevated
metadata to real data! To visualize how audit dimension rows are created, imagine
this shipments fact table is updated once per day from a batch fi le. Suppose today
you have a perfect run with no errors fl agged. In this case, you would generate only
one audit dimension row, and it would be attached to every fact row loaded today.
All the categories, scores, and version numbers would be the same.
Subsystem 7: Deduplication System
Subsystem 8: Conforming System
Delivering: Prepare for Presentation
Subsystem 9: Slowly Changing Dimension Manager
One of the more important elements of the ETL architecture is the capability to implement slowly changing dimension (SCD) logic. The ETL system must determine how to handle an attribute value that has changed from the value already stored in the data warehouse. If the revised description is determined to be a legitimate and reliable update to previous information, the appropriate SCD technique must be applied.
- Type 1: Overwrite
- Type 2: Add New Row
- Type 3: Add New Column(Attribute)
- Type 4: Add Mini-Dimension
- Type 5: Add Mini-Dimension and Type 1 Outrigger
- Type 6: Add Type 1 Attributes to Type 2 Dimension
- Type 7: Dual Type 1 and Type 2 Dimensions
Subsystem 11: Hierarchy Manager
Snowfl akes or normalized data structures are not recommended for the presentation level. However, the use of a normalized design may be appropriate in the ETL staging area to assist in the maintenance of the ETL data fl ow for populating and maintaining the hierarchy attributes. The ETL system is responsible for enforcing the business rules to assure the hierarchy is populated appropriately in the dimension table.
Subsystem 12: Special Dimensions Manager
- Date/Time Dimensions
- Junk Dimensions
- Mini-Dimensions
- Shrunken Subset Dimensions
- Small Static Dimensions
- User Maintained Dimensions
Subsystem 13: Fact Table Builders
- Transaction Fact Table Loader
- Periodic Snapshot Fact Table Loader
- Accumulating Snapshot Fact Table Loader
Subsystem 14: Surrogate Key Pipeline
Subsystem 15: Multivalued Dimension Bridge Table Builder
The challenge for the ETL team is building and maintaining the bridge table. As multivalued relationships to the fact row are encountered, the ETL system has the choice of either making each set of observations a unique group or reusing groups when an identical set of observations occurs. Unfortunately, there is no simple answer for the right choice. In the event the multivalued dimension has type 2 attributes, the bridge table must also be time varying, such as a patient’s time variant set of diagnoses.
One of the bridge table constructs presented in Chapter 10: Financial Services was the inclusion of a weighting factor to support properly weighted reporting from the bridge table. In many cases, the weighting factor is a familiar allocation factor, but in other cases, the identifi cation of the appropriate weighting factor can be problematic because there may be no rational basis for assigning the weighting factor.
Subsystem 16: Late Arriving Data Handler
Subsystem 17: Dimension Manager System
■ Implement the common descriptive labels agreed to by the data stewards and stakeholders during the dimension design.
■ Add new rows to the conformed dimension for new source data, generating new surrogate keys.
■ Add new rows for type 2 changes to existing dimension entries, generating new surrogate keys.
■ Modify rows in place for type 1 changes and type 3 changes, without changing the surrogate keys.
■ Update the version number of the dimension if any type 1 or type 3 changes are made.
■ Replicate the revised dimension simultaneously to all fact table providers.
Subsystem 18: Fact Provider System
The fact provider is responsible for receiving conformed dimensions from the dimension managers. ■ Receive or download replicated dimension from the dimension manager.
■ In an environment in which the dimension cannot simply be replicated but must be locally updated, the fact provider must process dimension records marked as new and current to update current key maps in the surrogate key pipeline and also process any dimension records marked as new but postdated.
■ Add all new rows to fact tables after replacing their natural keys with correct surrogate keys.
■ Modify rows in all fact tables for error correction, accumulating snapshots, and late arriving dimension changes.
■ Remove aggregates that have become invalidated.
■ Recalculate aff ected aggregates. If the new release of a dimension does not change the version number, aggregates have to be extended to handle only newly loaded fact data. If the version number of the dimension has changed, the entire historical aggregate may have to be recalculated.
■ Quality ensure all base and aggregate fact tables. Be satisfi ed the aggregate tables are correctly calculated.
■ Bring updated fact and dimension tables online.
■ Inform users that the database has been updated. Tell them if major changes have been made, including dimension version changes, postdated records being added, and changes to historical aggregates.
Subsystem 19: Aggregate Builder
Aggregates are the single most dramatic way to aff ect performance in a large data warehouse environment. Aggregations are like indexes; they are specific data structures created to improve performance. Aggregates can have a signifi cant impact on performance. The ETL system needs to eff ectively build and use aggregates without causing signifi cant distraction or consuming extraordinary resources and processing cycles.
Subsystem 20: OLAP Cube Builder
The relational dimensional schema should be viewed as the foundation for OLAP cubes if you elect to include them in your architecture. The process of feeding data from the dimensional schema is an integral part of the ETL system; the relational schemas are the best and preferred source for OLAP cubes. Because many OLAP systems do not directly address referential integrity or data cleaning, the preferred architecture is to load OLAP cubes after the completion of conventional ETL processes. Note that some OLAP tools are more sensitive to hierarchies than relational schemas. It is important to strongly enforce the integrity of hierarchies within dimensions before loading an OLAP cube. Type 2 SCDs fi t an OLAP system well because a new surrogate key is just treated as a new member. Type 1 SCDs that restate history do not fi t OLAP well. Overwrites to an attribute value can cause all the cubes using that dimension to be reprocessed in the background, become corrupted, or be dropped. Read this last sentence again.
Subsystem 21: Data Propagation Manager
All the situations previously described require extraction from the DW/BI presentation server, possibly some light transformation, and loading into a target format—in other words ETL. Data propagation should be considered a part of the ETL system; ETL tools should be leveraged to provide this capability. What is different in this situation is that the requirements of the target are not negotiable; you must provide the data as specifi ed by the target.
Managing the ETL Environment
One of the goals for the DW/BI system is to build a reputation for providing timely, consistent, and reliable data to empower the business. ■ Reliability. The ETL processes must consistently run. They must run to completion to provide data on a timely basis that is trustworthy at any level of detail.
■ Availability. The data warehouse must meet its service level agreements (SLAs). The warehouse should be up and available as promised.
■ Manageability. A successful data warehouse is never done. It constantly grows and changes along with the business. The ETL processes need to gracefully evolve as well.
Subsystem 22: Job Scheduler
- Job definition
- Job scheduling
- Metadata capture
- Logging
- Notification
Subsystem 23: Backup System
Subsystem 24: Recovery and Restart System
Subsystem 25: Version Control System
Subsystem 26: Version Migration System
Subsystem 27: Workflow Monitor
■ Poorly indexed queries against a source system or intermediate table ■ SQL syntax causing wrong optimizer choice ■ Insuffi cient random access memory (RAM) causing thrashing ■ Sorting in the RDBMS ■ Slow transformation steps ■ Excessive I/O ■ Unnecessary writes followed by reads ■ Dropping and rebuilding aggregates from scratch rather than incrementally ■ Filtering (change data capture) applied too late in the pipeline ■ Untapped opportunities for parallelizing and pipelining ■ Unnecessary transaction logging especially if doing updates ■ Network traffi c and fi le transfer overhead
Subsystem 28: Sorting System
Certain common ETL processes call for data to be sorted in a particular order, such as aggregating and joining fl at fi le sources. Because sorting is such a fundamental ETL processing capability, it is called out as a separate subsystem to ensure it receives proper attention as a component of the ETL architecture. There are a variety of technologies available to provide sorting capabilities. An ETL tool can undoubtedly provide a sort function, the DBMS can provide sorting via the SQL SORT clause, and there are a number of sort utilities available.
Subsystem 29: Lineage and Dependency Analyzer
■ Lineage. Beginning with a specifi c data element in an intermediate table or BI report, identify the source of that data element, other upstream intermediate tables containing that data element and its sources, and all transformations that data element and its sources have undergone. ■ Dependency. Beginning with a specifi c data element in a source table or an intermediate table, identify all downstream intermediate tables and fi nal BI reports containing that data element or its derivations and all transformations applied to that data element and its derivations.
Subsystem 30: Problem Escalation System
Subsystem 31: Parallelizing/Pipelining System
Parallelizing is a powerful performance technique at every stage of the ETL pipeline. For example, the extraction process can be parallelized by logically partitioning on ranges of an attribute. Verify that the source DBMS handles parallelism correctly and doesn’t spawn confl icting processes. If possible, choose an ETL tool that handles parallelizing of intermediate transformation processes automatically. In some tools it is necessary to hand create parallel processes. This is fine until you add additional processors, and the ETL system then can’t take advantage of the greater parallelization opportunities unless you modify the ETL modules by hand to increase the number of parallel fl ows.
Subsystem 32: Security System
Subsystem 33: Compliance Manager
The foundation of a compliance system is the interaction of several of the subsystems already described married to a few key technologies and capabilities: ■ Lineage analysis. Show where a fi nal piece of data came from to prove the original source data plus the transformations including stored procedures and manual changes. This requires full documentation of all the transforms and the technical ability to rerun the transforms against the original data. ■ Dependency analysis. Show where an original source data element was ever used. ■ Version control. It may be necessary to rerun the source data through the ETL system in eff ect at the time, requiring the exact version of the ETL system for any given data source. ■ Backup and restore. Of course, the requested data may have been archived years ago and need to be restored for audit purposes. Hopefully, you archived the proper version of the ETL system alongside the data, so both the data and the system can be restored. It may be necessary to prove the archived data hasn’t been altered. During the archival process, the data can be hash-coded and the hash and data separated. Have the hash codes archived separately by a trusted third party. Then, when demanded, restore the original data, hash code it again, and then compare to the hash codes retrieved from the trusted third party to prove the authenticity of the data. ■ Security. Show who has accessed or modified the data and transforms. Be prepared to show roles and privileges for users. Guarantee the security log can’t be altered by using a write once media. ■ Audit dimension. The a udit dimension ties runtime metadata context directly with the data to capture quality events at the time of the load.