When you’re querying 10,000 rows of data you can be sloppy. It doesn’t actually matter how inefficiently you write your queries, they’ll run in a reasonable amount of time and you’ll extract the insight you needed. That’s because 10,000 rows is tiny and you don’t need much compute power to get those numbers crunched.
However, when you start querying 1 billion rows, things start to get interesting. Your ‘Select *’ statement is a big no-no when you’ve got 100 columns and 1 billion rows – you need to think smart. You need to really streamline your queries to ensure reasonable execution time & resource usage.
Extract only what’s required
Firstly and most simply – reduce the number of columns in your queries & utilize where statements wherever possible. As a very simple example, let’s say that you have a 4 column table called ‘WebStats’, with the columns: Date, TotalPageviews, TotalSessions and BounceRate.
To extract data for reporting to show daily pageviews over the 10 days, you could write a statement along the lines of SELECT * FROM WebStats – you’d get exactly what you needed. You could pull it into your chosen visualization tool & you could draw lots of pretty charts.
However, you’re only looking at pageviews, so you could write a statement like SELECT Date, Pageviews FROM WebStats – this would be far more refined as you’re not pulling back the TotalSessions & BounceRate fields unnecessarily.
But still, this statement has extracted all dates -we can further refine it to ONLY pull back the dates in question: SELECT Date, Pageviews FROM WebStats WHERE Date Between XXXXXX and XXXXX. This keeps the query as small as possible and ensures it runs nice and quickly.
For those of you that don’t already know Techopedia defines data enrichment as ‘processes used to enhance, refine or otherwise improve raw data‘.
So, why would you want to enrich data? Well, running large join operations or on-the-fly calculations are incredibly inefficient and will result in your queries taking an age to execute. The preferred method is to enrich the data during ETL or shortly after. This way, you’re just querying data that’s already in a table rather than calculating & creating new fields during query execution.
By doing this, you’ll speed up the time to execute reports, which will enable you to provide insight to the business in a much more timely manner – and that’s what reports are for, right?
So is that it?
Well no, there are loads of other ways to optimize your queries but these are the low hanging fruit – the things that we can implement and see rapid gains. Other changes may result in lots of marginal gains – still worth having, but let’s get these big ticket items out the way first!
Ultimately it comes down to a couple of key things that you can easily influence:
- Data preparation is incredibly important
- Planning queries & the fields you’ll need is vital
If you are able to plan your queries ensuring that the minimal number of fields are used and you’re able to do all the heavy lifting during ETL rather than during query execution, you’ll be golden.
If you’re using a tool such as Tableau or Qlikview, it may also be worthwhile assessing the overheads incurred by calculating / aggregating and joining within the respective applications – it may well be that running two separate queries and joining them in Tableau is faster / more efficient than joining them as a single SQL query. Probably not but maybe….