Sunday, October 26, 2008
High Performance OLAP Cubes Presentation - Sydney
Saturday, October 25, 2008
Skinny Fact Tables are good
- Dimensionalise data. For example, if a column contains a character string such as 'Successful' or 'Unsuccessful', then put those strings into a dimension table and simply store the numeric foreign key, such as 1 or 2. If this sort of opportunity exists, it will have the biggest performance impact, and it can actually make the ETL faster, if that is where you do the dimensionalisation. The ETL can be faster because the fact table being loaded by the rdbms is thinner and requires less IO.
- Use appropriate and concise numeric and date-time data types. For example, if you are storing a number, put it in a numeric column and make it as small as possible without risking overflow. For example, if you are storing a number between 1 and 10, just use a tinyint (or equivalent in other rdbms). If you are storing an integer or decimal, then put the number in an integer or numeric column, not a floating point data type. Similarly for date and times.
- Use VARCHAR() where appropriate. Don't use CHAR(50) where VARCHAR(50) will store the same information in less space. Also, when you are loading VARCHAR columns ensure that you are not padding the column with spaces. This can happen with some ETL tools.
- Avoid VARCHAR(1). For a 1 byte column you are requiring 2 bytes for the length, which you already know is 1.
- Avoid redundant data. That is any data that can be derived (calculated) from other columns in the same record. For example, if there is a column TransactionDateTime, there is no need for other columns for TransactionDate, TransactionTime, TransactionYear, TransactionDayOfWeek, TransactionDateKey etc. All of those columns can be derived from TransactionDateTime using functions such as DatePart(hh,TransactionDateTime)*100+DatePart(mi,TransactionDateTime) to get the 24 hour minute etc. I make a great use of these functions to get foreign keys into datetime dimension tables that might have time granularity down to 5 minute, 15 minute, 1 hour levels of granularity etc. Of course, I have designed the DateTime dimension table to have a primary key that is derived from the DateTime and not synthetic. So it is just a matter of creating a view over the table, which adds in the derived columns. I am not a great fan of synthetic primary keys when there is a natural primary key. A view with these functions tends to be much faster than a SELECT from a fat table with a materialised copy of the redundant columns. Simply due to a reduction in IO and putting less pressure on the rdbms data cache.
- Most rdbms' have compression available for columns and data pages. SQL Server added a very good implementation this in SQL Server 2008. Since very large fact tables tend to be written once, it generally pays to have compression at the column and page level. With numeric data, I am experiencing about 20% - 30% compression ratios. The cpu cost is almost undetectable (even lower than cpu from read locks).
Keeping the fact tables as skinny as possible makes it much faster to process cubes with very large numbers of fact records. There are also ancilliary benefits such as having a smaller database to backup, overall rdbms performance is better with less data cache pressure, and less network traffic between your cube and data warehouse.
Without very skinny fact tables, I would not have been able to maintain the live weblogs and Perfmon demonstrations on http://RichardLees.com.au/Sites/Demonstrations The Perfmon rdbms is adding about 30,000 records/minute (500 inserts/second) and is only possible on my sata disk drives because the fact tables are skinny. All these inserts and continual cube processing is happening on two desktop PCs with single sata drives.
Sunday, October 12, 2008
Windows Perfmon in a Cube
On this site I have made the information available via Office PerformancePoint. I am a great fan of PerformancePoint since it provides true thin-client OLAP cube browsing and easy to create dynamic dashboards. For example on the Perfmon Dashboard, click on Process to see a Perfmon dashboard that aims to present performance information for Windows processes (applications). The chart at the top left shows the most recent three hours on columns. On rows there are a number of KPI's such as % Processor Time and IO Data Bytes/sec. In the value display, the chart shows the Windows Process that is the highest consumer of this resource for that hour. So if you have a scarse resource, such as memory, you can immediately see which processes are top consumers. If there is a KPI, for which you are keen to know more, simply click on the KPI name, for example, click on IO Data Bytes/sec. Notice how the charts on the right hand side have now adapted for the KPI in focus.
Wednesday, October 8, 2008
Digital Dashboards on Mobile Devices
Note; the real challenge with mobile devices is security. While an Internet page can be restricted for a challenge/response authentication, this is not so simple with mobile devices. If anyone knows of a simple workaround to mobile security, I would like to hear about it.
For more live Business Intelligence demonstrations, please see http://RichardLees.com.au/Sites/Demonstrations
Tuesday, October 7, 2008
TechEd SSAS 2008 Peformance
Sunday, October 5, 2008
Can SQL Server SSIS and AS process 500 records/second?
See http://RichardLees.com.au/sites/Demonstrations/Shared%20Documents/Windows%20Performance%20Monitor/Perfmon.aspx This site is processing over 500 records/second every second of the day. In fact this 500 records/second is small fry and can be achieved with a couple of desktop PCs. Here is a jpg of the current view. If you can read the image, it says that the current server time is Oct 5, 19:19, and the cube is only 19 seconds behind. The top left scorecard is telling us that 31,308 records were processed for the last 60 seconds, and the rest of the scorecard is providing indicators for the activity in this 60 seconds (Actual). The Target numbers are the average figures for the last hour. The charts on the right are providing trends for the last 60 minutes. Note how the last point on the x axis is 19, that is 19:19. Some of the charts are interactive, so you can drill down and around the cube in an ad hoc fashing.
Also see the SQL Server dashboard, which is based on the same cube, and offers more interactive scorecards and charts. For example,when you focus on any of the KPIs, the chart, at top right, will dynamically change to display trend lines for this KPI.
This PerformancePoint dashboard is giving us close to real time information, on a database that has over 500 inserts/second. Interestingly the dashboard is telling us the performance activity of the SQLS, SSIS and SSAS processes. See how the machines (they aren't real servers, just our two home PCs) are averaging <10%
Go to http://RichardLees.com.au/Sites/Demonstrations to see this and other demonstrations live.
Saturday, October 4, 2008
Live Excel Services Demonstration
http://RichardLees.com.au/sites/Demonstrations/Pages/ExcelServices.aspx
There are a couple of other Excel Services demonstrations on the site, which I find more interesting because they are connected to cubes that are continually updating. These are the Web and Perfmon cubes. If you are interested in real time BI, you should try out the Perfmon Excel Services demonstration. That cube is never more than 180 seconds behind real time, with over 30,000 new facts every 60 seconds.
Friday, October 3, 2008
An Alternative to Sydney Commuting Traffic
An Alternative to a physical Distinct Count Measure
That is not to say that you shouldn't use DistinctCount, it can work well and be extremely useful. However, you should be aware that there is an alternative, which has its own pros and cons. The alternative is to create a dimension on the key if you haven't got one already, then simply create a Calculated Member that references the dimension, I.E
count(nonempty([DimName].[HierarchyName].[LevelName].members,[Measures].[MyMeasure]))
The advantage of this alternative is that the cost of the distinct count is only experienced at the time of querying. Cube processing and other queries will not be impacted by the calculated member. This is unlike a distinct count, which is why we often put a Distinct Count measure in its own measure group. You can see a demonstration of the calculated member version of distinct count on http://RichardLees.com.au/sites/Demonstrations/Pages/TS_IISLog.aspx The Distinct Resources and Distinct ClientIPs are both calculated members and perform very well.
When I was helping a developer with a performance problem and suggested this some time ago, the developer informed me that my alternative would never work on his cube because there were over 10 Million distinct values and the cost of building a dimension that big would be prohibitive. That showed me his lack of understanding of Distinct Count, since his cube with Distinct Count would have the 10 Million values (and subsets of the 10 Million) at every cell and aggregation, so the 10 Million values would be repeated throughout the cube.
By the way, I am always telling Analysis Services cube designers that they should hone their MDX skills. The better your MDX skills the better cubes you will design. This is very similar to a relational database. You wouldn't expect someone to design a high performing relational database if they weren't masters of SQL.