What are the Snowflake and Star schemas used for?
In the practice of building a data warehouse, a developer needs to go through the decision of choosing between a star or snow flake schema.
These schemas pertain to who the data will be stored for. The decision on what schema to choose impacts performance, readability and maintainability so it is probably the key choice needed to be made before a data warehouse project gets underway.
Which schema is better for performance?
The Star schema is in a more de-normalized form and hence tends to be better for performance. Along the same lines the Star schema uses less foreign keys so the query execution time is limited. In almost all cases the data retrieval speed of a Star schema has the Snowflake beat.
Which schema is better for readability?
The Star schema is easier for readability because its query structure is not as complex, on the other hand the Snowflake has a complex query structure and is tougher for readability and implementing changes. The changes to be implemented can be tougher to put into a Snowflake schema because of the tendency to have a lot of joins in the query structure. The Star schema on the other hand uses less joins and tends to have more data redundancy. So for readability the schema to go with would be the star schema.
Which schema is better for maintainability?
Maintainability for a data warehouse is heavily dependent on the amount of redundant data. The more redundancies the more places the maintenance needs to take place. Out of the two schemas the Snowflake has the least data redundancies so is hence the more maintainable choice.
Snowflake vs Star Schema
Now comes a major question that a developer has to face before starting to design a data warehouse. Snowflake or Star schema? We’ve gone over the difference and the choice needs to be made on a case by case basis, but two important factors outside of the above, which are more personal choices, are the number of dimensions in your data and the size of the data.
If the data is relatively small and the end result is more of a DataMart than a data warehouse, then the choice tends to lean towards Star schema. Along the same line, if the relationships inside the data are simple and don’t have many to many relationships then the choice tends to lean towards Star Schema. On the other hand, if you are building a bigger solution with many to many relationships then going with the Snowflake is your best bet.
Another thing that needs to be considered is the number of dimensions in your dimension table. If a single dimension requires more than one table, it’s better to use the Snowflake schema. For example, a star schema would use one date dimension but a Snowflake, can have Dimension date tables that extends out to dimension day of the week, quarter, month…etc. If these branches or snowflakes are needed than the Star schema isn’t the way to go.
Bad habits while implementing Snowflake and Star schemas
There are three key phases to building a data warehouse: planning, implementation, and documentation.
The first stage in creating a data warehouse is planning. Sometimes, especially with deadlines, not enough time is spent on this stage, this can become a routine and is a bad habit.
Besides that, there are three bad habits that tend to occur where the planning stage is done:
- Not enough business cases are considered, the more business cases that are considered for the data warehouse the more functional and useful a data warehouse will be.
- The future growth forecast of the company isn’t taken into account, the size of the organization and its data plays a crucial role in selecting which schema will be used. With scalability not being considered, the data warehouse may have a short shelf life and end up costing more than it helps.
- The current architecture of the data used by the company not being considered. With this some questions tend to go unanswered questions like:
- Is the data available?
- Is the data able to be consolidated?
- Is the current infrastructure able to handle the data warehouses load onto the network?
Implementing a data warehouse is the most technical part of the process and a bad habit is having a haphazard implementation. To avoid this, developers need to pay attention to detail and follow the plan they made during the planning process to a tee. An extra sense of attention to detail needs to be given during the implementation process. Another bad habit that can be found during this phase is the content of the created objects, clear names should be given to the fact and dimension tables. Also the fields within these tables need to be clearly named this will help later down the road in making reports as well as ad-hoc querying the database.
The worst mistake made in the production of a data warehouse is documentation or more accurately the lack there of.
A data warehouse is made to be accessed to write queries from, to make reports from and to organize information. If the setup of the data warehouse is not documented, then this makes the whole process harder and not user-friendly. BI report developers have a harder time accessing the data and using it to make the solutions that the data warehouse was made to provide in the first place.
At the same time the maintenance of the data warehouse becomes harder as the organizational structure, teams and developers change over the course of time. A simple change could take months to implement because no documentation was provided. Losing the company money, time, resources and a business decision making advantage.
The documentation is a vital key to building a successful lasting data warehouse. The documentation should include notes on architecture and database diagrams for organization and querying/updating purposes. Also, the documentation should thoroughly cover the ETL, from an xml standpoint as well as the servers (data sources), parameters, data cleansing techniques and data matching techniques. Lastly it should cover any special features such as slow changing dimensions that were used. This makes it easier for developers to make changes, analysts to create reports, and the organization to see where this data warehouse fits into the scheme of their organization.
These questions not being considered can lead to a data warehouse that begins to be implemented but doesn’t end up being finished. All of the above are bad habits that if not considered can lead to a huge waste of time and resources and in the worst case a failed data warehouse.
Written by Sukhmani Bains