Data Warehousing: Fixes to Common Problems
Avoiding Potential Problems
After the initial project to deliver the data warehouse has finished, the data volumes increase over time. This often leads to ever increasing overnight load times, with the common problem that people cannot run reports until well into the working day because the warehouse is still building.
There is always a symptom of one or more of the following issues:
The initial warehouse was created as a full refresh every day
The quickest, simplest and therefore cheapest data warehouse designs just truncate and reload all or most of the data daily. Fast enough for the first few months, but a year or two later the 5-minute load times have increased to several hours. This is a form of technical debt, a shortcut which needs eventually to be addressed by implementing incremental data loading. Be aware that truncating your data every day will be causing massive fragmentation of your database files and disks and this may lead to further performance issues.
The indexing is non-existent or ill conceived
Despite improvements in database tools, efficient indexing is still easy to get wrong and can cause performance issues. While an index can improve performance of a query, the same index will also create an overhead for an insert. If databases have many indexes on each table, the chances are degrading your load times, and conversely if there is no index it will be sub optimal on query response times.
The data types are inconsistent or incorrect
Often overlooked, the data types and character sets chosen in a data warehouse can have a negative effect on performance and quality. It is common to find warehouses where the data types for a single attribute vary wildly from table to table – the same attribute being stored as a number, varchar or date in different tables. This often happens when the ETL tool can create tables automatically based upon a few rows of data.
There are two clear alternate approaches to building a data warehouse. The Corporate Information Factory (CIF) approach recommended by Bill Inmon and the requirements driven, Dimensional Methodology, Star Schema based approach recommended by Ralph Kimball.
Inmon CIF Approach
In this methodology, disposable data marts (conformed star schemas) are created only after the complete enterprise data warehouse has been created. ‘Enterprise Data Warehouses’ which are not normalized, source data from only one source system, add no real benefit and may be an unnecessary overhead. Often these types of data warehouse are the result of consultancies wanting to sell a larger and more complex project than was required.
Ralph Kimball’s Dimensional Approach
Migrating to Ralph Kimball’s Dimensional approach can help streamline and simplify a failing data warehouse. Removing the need for a 3rd Normal form enterprise data warehouse and its bottom-up requirements driven approach, means that it would reduce load times and concentrate on the business need. In Kimball’s approach, we only need a staging area in which to perform any necessary staging, integration and data quality, and a star schema area containing denormalized data in dimensions and facts. We do not have an enterprise data warehouse.
Inmon Vs Kimball
It is best suited to global companies who must integrate multiple different ERP systems and are often required to add new, yet unknown data sources following acquisitions or migrations from one ERP to another. For these companies, it makes sense to invest in a source independent enterprise database based upon their business.
For most organizations resources and time are an issue, so the enterprise data warehouse is a luxury. In most situations, we can easily integrate data from different source systems directly into the star schemas if required, without the need for an enterprise database