Sunday, October 26, 2008

High Performance OLAP Cubes Presentation - Sydney

Anyone interested in designing high performing OLAP cubes (and to some extent data warehouses) might want to come along to my public presentation at the Sydney SQL Server session on November 12th, 2008. The presentation is based on a presentation I gave recently at Microsoft TechEd and is free to attend. During this session I try and give away as many insights as I can into the techniques and knowledge I have learnt that enable me to design high performing business intelligence solutions. I will cover the new performance features of SQL Server 2008, but the presentation is broader than that.

For more information and to register goto http://www.sqlserver.org.au/events/ViewEvent.aspx?EventId=352 To download my TechEd slide deck goto http://RichardLees.com.au/Pages/Downloads.aspx

Saturday, October 25, 2008

Skinny Fact Tables are good

Many of my customers have billions of rows in their fact tables, which means there cube processing time is, at least partly, dependent on how long it takes the relational database manager to scan the data from disk. With most rdbms' (eg SQL Server, Oracle, DB2 etc.) this ends up as how fast the disk subsystem can transfer the data. So one way of improving performance without changing hardware is to trim up the fact table. Naturally, incremental processing is useful for these very large fact tables, but everything will go faster with skinier fact tables.

Here are a few ways that you can trim the fat out of your fact tables


  1. Dimensionalise data. For example, if a column contains a character string such as 'Successful' or 'Unsuccessful', then put those strings into a dimension table and simply store the numeric foreign key, such as 1 or 2. If this sort of opportunity exists, it will have the biggest performance impact, and it can actually make the ETL faster, if that is where you do the dimensionalisation. The ETL can be faster because the fact table being loaded by the rdbms is thinner and requires less IO.


  2. Use appropriate and concise numeric and date-time data types. For example, if you are storing a number, put it in a numeric column and make it as small as possible without risking overflow. For example, if you are storing a number between 1 and 10, just use a tinyint (or equivalent in other rdbms). If you are storing an integer or decimal, then put the number in an integer or numeric column, not a floating point data type. Similarly for date and times.


  3. Use VARCHAR() where appropriate. Don't use CHAR(50) where VARCHAR(50) will store the same information in less space. Also, when you are loading VARCHAR columns ensure that you are not padding the column with spaces. This can happen with some ETL tools.


  4. Avoid VARCHAR(1). For a 1 byte column you are requiring 2 bytes for the length, which you already know is 1.


  5. Avoid redundant data. That is any data that can be derived (calculated) from other columns in the same record. For example, if there is a column TransactionDateTime, there is no need for other columns for TransactionDate, TransactionTime, TransactionYear, TransactionDayOfWeek, TransactionDateKey etc. All of those columns can be derived from TransactionDateTime using functions such as DatePart(hh,TransactionDateTime)*100+DatePart(mi,TransactionDateTime) to get the 24 hour minute etc. I make a great use of these functions to get foreign keys into datetime dimension tables that might have time granularity down to 5 minute, 15 minute, 1 hour levels of granularity etc. Of course, I have designed the DateTime dimension table to have a primary key that is derived from the DateTime and not synthetic. So it is just a matter of creating a view over the table, which adds in the derived columns. I am not a great fan of synthetic primary keys when there is a natural primary key. A view with these functions tends to be much faster than a SELECT from a fat table with a materialised copy of the redundant columns. Simply due to a reduction in IO and putting less pressure on the rdbms data cache.


  6. Most rdbms' have compression available for columns and data pages. SQL Server added a very good implementation this in SQL Server 2008. Since very large fact tables tend to be written once, it generally pays to have compression at the column and page level. With numeric data, I am experiencing about 20% - 30% compression ratios. The cpu cost is almost undetectable (even lower than cpu from read locks).

Keeping the fact tables as skinny as possible makes it much faster to process cubes with very large numbers of fact records. There are also ancilliary benefits such as having a smaller database to backup, overall rdbms performance is better with less data cache pressure, and less network traffic between your cube and data warehouse.


Without very skinny fact tables, I would not have been able to maintain the live weblogs and Perfmon demonstrations on http://RichardLees.com.au/Sites/Demonstrations The Perfmon rdbms is adding about 30,000 records/minute (500 inserts/second) and is only possible on my sata disk drives because the fact tables are skinny. All these inserts and continual cube processing is happening on two desktop PCs with single sata drives.

Sunday, October 12, 2008

Windows Perfmon in a Cube

Windows Performance Monitor (Perfmon) has been a essential source of performance information for Windows servers for a long time. I remember using it on Windows NT back in 1995. Such a useful tool, and yet it doesn't seem to have been enhanced much since that first version. I have suggested to Microsoft that it gets a revamp, so the performance data is optionally stored in an OLAP cube if there is a SQL Server database nearby. SQL Server comes with fantastic OLAP technology, which faccilitates analysis of these large data volumes.

For an example of what I am talking about go to http://RichardLees.com.au/Sites/Demonstrations and click on the Perfmon Dashboard. This will take you to a dashboard of Perfmon data from my home servers. (There are just two servers currently, but there could be any number since machine is a dimension.) First notice how the information is very close to real time. The Scorecard tells you the number of seconds difference between the last data capture and the system time. On my system, this is currently set to be <=180 seconds. On this dashboard you will get highlight numbers for CPU consumption, IO latency, Process activity, Memory consumption and meta information on the Perfmon data. There is a lot more data, information and analysis available from the cube. This information can be ad hoc browsed via any OLAP cube browser, for example Excel, Excel Services, SQL Reporting Services, Proclarity, PerformancePoint, ThinSlicer and thousands of other non-Microsoft browsers.

On this site I have made the information available via Office PerformancePoint. I am a great fan of PerformancePoint since it provides true thin-client OLAP cube browsing and easy to create dynamic dashboards. For example on the Perfmon Dashboard, click on Process to see a Perfmon dashboard that aims to present performance information for Windows processes (applications). The chart at the top left shows the most recent three hours on columns. On rows there are a number of KPI's such as % Processor Time and IO Data Bytes/sec. In the value display, the chart shows the Windows Process that is the highest consumer of this resource for that hour. So if you have a scarse resource, such as memory, you can immediately see which processes are top consumers. If there is a KPI, for which you are keen to know more, simply click on the KPI name, for example, click on IO Data Bytes/sec. Notice how the charts on the right hand side have now adapted for the KPI in focus.
There are several other tabs on this dashboard, and there are other dashboards (SQL Server, for example) based on the same Perfmon cube, which you can browse from the Internet.
Since the data is real and close to real-time, I often use to to demonstrate the value of OLAP and PerformancePoint to colleagues and customers. My wife and children frequently use one of the monitored services to do work and homework. That makes it even more interesting, as during the demonstration we can see what work, or which game is being played on the server. (Fifa World Cup Soccer has been a top resource consumer of late.)
Cubes really make presenting and browsing this sort of performance information much much easier. I am sure that one day Microsoft will automatically (optionally) put this perfmon data in a cube for us. In the meantime, it is not too difficult, and certainly worthwhile, for us to do that for ourselves.

Wednesday, October 8, 2008

Digital Dashboards on Mobile Devices

I have surprised many people by showing them live digital dashboards on mobile devices simply from SQL Server Reporting Services (naturally with an OLAP cube underpinning). See http://RichardLees.com.au/Pages/m.aspx for a live demonstration. The dashboard isn't that attractive, but nothing is on a mobile device. The point is that dashboards can be easily delivered to mobile devices and they can be close to real time. The picture on the right is a dashboard that I have just surfed to, and it contains information as of 60 seconds ago - that is up to and including 13:55 today. It is telling me what the processor utilisations and disk performance metrics are for my two home servers. Furthermore, any one of the metric headings can be clicked on for more detailed information on that metric.

Note; the real challenge with mobile devices is security. While an Internet page can be restricted for a challenge/response authentication, this is not so simple with mobile devices. If anyone knows of a simple workaround to mobile security, I would like to hear about it.

For more live Business Intelligence demonstrations, please see http://RichardLees.com.au/Sites/Demonstrations

Tuesday, October 7, 2008

TechEd SSAS 2008 Peformance

For those of you who came to my TechEd presentation on SQL Server 2008 Analysis Services Performance, the PowerPoint slidedeck can be found here http://RichardLees.com.au/Pages/Downloads.aspx Naturally most of the demonstrations were from http://RichardLees.com.au/Sites/Demonstrations

Sunday, October 5, 2008

Can SQL Server SSIS and AS process 500 records/second?

Absolutely yes.
See http://RichardLees.com.au/sites/Demonstrations/Shared%20Documents/Windows%20Performance%20Monitor/Perfmon.aspx This site is processing over 500 records/second every second of the day. In fact this 500 records/second is small fry and can be achieved with a couple of desktop PCs. Here is a jpg of the current view. If you can read the image, it says that the current server time is Oct 5, 19:19, and the cube is only 19 seconds behind. The top left scorecard is telling us that 31,308 records were processed for the last 60 seconds, and the rest of the scorecard is providing indicators for the activity in this 60 seconds (Actual). The Target numbers are the average figures for the last hour. The charts on the right are providing trends for the last 60 minutes. Note how the last point on the x axis is 19, that is 19:19. Some of the charts are interactive, so you can drill down and around the cube in an ad hoc fashing.

Also see the SQL Server dashboard, which is based on the same cube, and offers more interactive scorecards and charts. For example,when you focus on any of the KPIs, the chart, at top right, will dynamically change to display trend lines for this KPI.

This PerformancePoint dashboard is giving us close to real time information, on a database that has over 500 inserts/second. Interestingly the dashboard is telling us the performance activity of the SQLS, SSIS and SSAS processes. See how the machines (they aren't real servers, just our two home PCs) are averaging <10%

Go to http://RichardLees.com.au/Sites/Demonstrations to see this and other demonstrations live.


Saturday, October 4, 2008

Live Excel Services Demonstration

Excel Services is a Rich Way of Delivery Thin Client BI

I am a great fan of thin client delivery tools for business intelligence (BI). This is the way we can deliver BI tools to the masses and make significant impacts on organsiations and the way people make decisions. Excel Services (a component of Windows SharePoint) is one of these thin client delivery vehicles. The client requires no client side installs, just IE or compatible browser. The Excel spreadsheet is partially funcational, and what I like to do is connect it to an OLAP cube that is continually updating. All the Excel functionality and the OLAP query is happening on the server, with just the rendering sent down to the client.

Here is a simple demonstration that you can try out. http://RichardLees.com.au/sites/Demonstrations/Pages/NZCensus_Excel.aspx It is Excel Services with an OLAP cube of an old New Zealand census. Notice how you see a page that looks like Excel, and there is a table and chart displayed with data coming from an OLAP cube. What is interesting is that you can now interact with the OLAP cube. Try clicking on the dropdown button next to Freedom (that's Kiwi for Marital Status). You will be able to select one or more Freedoms, which will result in an updated table and chart. You can try combinations of the filters, for example, married people in the Tasman region who are employed. Notice how fast it is to render the new chart, even though this involves a full Internet round trip with an OLAP query and excel rendering happening on the server.

http://RichardLees.com.au/sites/Demonstrations/Pages/ExcelServices.aspx
There are a couple of other Excel Services demonstrations on the site, which I find more interesting because they are connected to cubes that are continually updating. These are the Web and Perfmon cubes. If you are interested in real time BI, you should try out the Perfmon Excel Services demonstration. That cube is never more than 180 seconds behind real time, with over 30,000 new facts every 60 seconds.

One of the reasons why some Microsoft customers really like Excel Services is that once the OLAP cube has been built (and scheduled for continual updating as in this demonstration) someone with just Excel skills can be responsible for creating and maintaining the "reports" or Excel worksheets. Since all Excel Services does is display Excel spreadsheets via thin client. The Excel author only needs to work normally in Excel, then rather than saving, they publish to PerformancePoint. Quite simple really.
For more real-time and live BI demonstrations go to http://RichardLees.com.au/sites/Demonstrations

Friday, October 3, 2008

An Alternative to Sydney Commuting Traffic

Typically I ride my motorbike to client sites in the Sydney region. The motorbike is fantastic on Sydney roads, as there are many bus and transit lanes that bikes can use. Currently I am currently working across Sydney harbour at Double Bay. It takes about 1:15 minutes to drive a car from Manly to Double Bay at peak hour, or about 45 minutes on the motorbike. However, I have found another way to beat the Sydney traffic - kayak. The kayak takes about 60 minutes, depending on tide and wind. Since I often go out for a kayak before work on Manly Beach, it seemed logical to simply get out of bed later and kayak across the harbour.

I think I am going to enjoy commuting this summer.

An Alternative to a physical Distinct Count Measure

Many SQL Server OLAP developers learn the hard way that a DistinctCount measure can be expensive on performance - in query time, processing time and cube size. A Distinct Count measure might be appropriate for your cube, but you should be aware that every cell, and every aggregation will need to contain all the distinct values at that intersection. That's why a physical DistinctCount measure can be so expensive. It's also why there is a lot of literature on how to optimise the performance of a Distinct Count measure.

That is not to say that you shouldn't use DistinctCount, it can work well and be extremely useful. However, you should be aware that there is an alternative, which has its own pros and cons. The alternative is to create a dimension on the key if you haven't got one already, then simply create a Calculated Member that references the dimension, I.E

count(nonempty([DimName].[HierarchyName].[LevelName].members,[Measures].[MyMeasure]))

The advantage of this alternative is that the cost of the distinct count is only experienced at the time of querying. Cube processing and other queries will not be impacted by the calculated member. This is unlike a distinct count, which is why we often put a Distinct Count measure in its own measure group. You can see a demonstration of the calculated member version of distinct count on http://RichardLees.com.au/sites/Demonstrations/Pages/TS_IISLog.aspx The Distinct Resources and Distinct ClientIPs are both calculated members and perform very well.

When I was helping a developer with a performance problem and suggested this some time ago, the developer informed me that my alternative would never work on his cube because there were over 10 Million distinct values and the cost of building a dimension that big would be prohibitive. That showed me his lack of understanding of Distinct Count, since his cube with Distinct Count would have the 10 Million values (and subsets of the 10 Million) at every cell and aggregation, so the 10 Million values would be repeated throughout the cube.

By the way, I am always telling Analysis Services cube designers that they should hone their MDX skills. The better your MDX skills the better cubes you will design. This is very similar to a relational database. You wouldn't expect someone to design a high performing relational database if they weren't masters of SQL.

Thursday, October 2, 2008

Web Analytics using SQL Server and PerformancePoint

The default page for my live demonstrations http://RichardLees.com.au/Sites/Demonstrations is currently showing a PerformancePoint dashboard of the web statistics of the site. SQL Server Integration Services is regularly (aprox every 15 minutes) incrementally loading new web log records into SQL Server and incrementally processing the OLAP cube. The PerformancePoint dashboard is continuously available and will show the latest (0-15 mintes of real-time) web analytics information.




For example this is the scorecard showing key information for the last 3 days. This image shows the web data is up to and including the 12th hour of October 2nd. Note, that is Sydney Australia time, so if you are looking from Europe or America, it will appear to be in the future. This isn't the future, it is the current time in Sydney.


The key metrics shown include the number of web hits, average response time for .aspx resources, error rate, number of distinct client sessions, MB transferred and cpu utilisation on my two machines.

There are several other charts and tables on the dashboard. A couple of my favourites are the (bottom left) table of Cities by Hour for the last 3 days. This table shows the top city for each hour over the last 3 days. For example, on this day, New York generated the most activity to my site from 9AM to 12AM, with Bologna, Dublin, Sacramento, Toronto and others featuring earlier on this day.


Another one of my favourite charts on this dashboard is the Top Referrers Last 3 Months. This table shows an ordered list of the sites that refer the most traffic to my site.


Many of the charts and tables are interactive, so you can right click on numbers, heading, columns, lines etc to drill down, drill up, drill across, drill through, change chart type, pivot etc. Essentially, when the chart/table is interactive, you can perform ad hoc queries as you would in any thin client cube browser. Try it out. http://RichardLees.com.au/Sites/Demonstrations Every few minutes, you will also see that the data is incrementally changing.


The purpose behind this demonstration is not to show you web analytics of my site (although I find that useful) it is to show you how powerful SQL Server and PerformancePoint are in creating a dynamic and useful business intelligence solution. The only software employed on this site is Microsoft SQL Server, Microsoft PerformancePoint and Microsoft SharePoint.