Some things I learnt about SQL (programming)
Page is part of Articles in which you can submit an article
written by owen on 2015-Jun-16.
I am gonna try my best to impart some of the knowledge I have learnt while using and experimenting with SQL over the years. SQL is a way to store and extract data from a RDMS. SQL is rigid because it has a specific purpose. SQL exists to store and output rows of information as quickly as possible - that is it. It does not care what you are doing with the information or how you are using it. All it cares about are keys, indexes, rows and columns. Most relational databases management systems (DBMS) actually care more about rows than it does about columns but that is a whole other story.
I teach by example whenever I can. So I am gonna drop some T-SQL on a particular problem but it should be pretty easy to follow regardless of the DBMS that you use. I am gonna incrementally change the query in the hopes of making it faster as I go along. I will avoid going into proprietary tricks.
Any SQL query you write should be written with the assumption that one day you will have a billion records and it will take 10 days to show just one record. Just because a query you write now is fast it might not be fast forever. So eliminating slow downs early is the best way to ensure that your queries are as efficient as they can be. Throwing hardware at a slow SQL query/database is money down the drain.
Let us begin
For the purpose of this article I am going to aggregate/sum a table of payments. The example table "PAYMTBL" has 1 million records in it with 20 columns. The table stores payment records over a period of several years. I needed a bar chart for a presentation so I wrote a query to SUM() the PAYMTBL_AMOUNT and convert the PAYMTBL_TRAN_DATE to a MON-YYYY string, and to Count(1) to count the number of transactions in the month of Jan-2015;
/* Query v1 A */
SELECT
SUM(PAYMTBL_AMOUNT),
TO_CHAR( PAYMTBL_TRAN_DATE, 'MON-YY'), COUNT(*)
FROM PAYMTBL
WHERE 1=1
AND PAYMTBL_TRAN_DATE BETWEEN '01-JAN-2015' AND '31-JAN-2015'
GROUP BY TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YY')
ORDER BY 2
Now we could stop here if you wish but a skilled SQL writer wants to write the fastest most accurate SQL possible. Ideally the fastest query is no query at all. But since we need the information we need to write the query to get it. Technically you could run the query over night and cache the results to display at a later point but caching is for babies and is a sure sign of weakness somewhere in the system.
So we push forward. The thing to note about in SQL is that anything you do besides SELECT 1 from TABLE will make your SQL slow down. Any function you call, any grouping or order_by will slows the query down. Physics, bus speed, light speed, ram speed, disk IO, stuff like that. You want the SQL to do as little work as possible. The GROUP BY, TO_CHAR, SUM, COUNT(1) are all giving the SQL engine work to do. You could in theory just do a;
/* Query v1 B */
SELECT * FROM PAYMTBL
WHERE 1=1
AND PAYMTBL_TRAN_DATE BETWEEN '01-JAN-2015' AND '31-JAN-2015'
But the above code would return a 100 thousand rows (with 20 columns each) that I would have to process in my application code. Do not forget that the data also take up RAM on the computer; 100k could easily be 30 mb of RAM depending on the size of each record. Yes, I would get the rows quickly but I would never be able to process and group them faster than the RDBMS/SQL engine. WHY couldn't I do it faster? The answer is simple. The SQL database is a product of years of research and testing.
Moving on, seeing that I absolutely need the GROUP BY, TO_CHAR, SUM, COUNT(1) features of the SQL database. What else could I optimize? I decided to take the PAYMTBL table out of the main query. WHY? because any functions you have in the main query will cause the SQL engine to do work on all the rows that are in the tables that are on the same level. The less rows that it has to check the faster the query (indexs, and other things help too but less rows is what you really need).
The below query v2 runs faster;
/* Query v2 */
SELECT
SUM(PAYMTBL_AMOUNT),
TO_CHAR( PAYMTBL_TRAN_DATE, 'MON-YY'), COUNT(1)
FROM (
select * from PAYMTBL WHERE 1=1
AND PAYMTBL_TRAN_DATE BETWEEN '01-JAN-2015' AND '31-JAN-2015'
) X
GROUP BY TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YY')
Why does this query run faster? It is simple really, this query is doing less work than the previous. It all comes back to the how many rows the SQL engine has to "touch" with the GROUP BY, TO_CHAR, SUM, COUNT(1). Lets say that only 100k out of the 1 m rows match the condition payment tran date between '01-JAN-2014' AND '31-JAN-2015'. By picking up 100k rows first we save the SQL engine from calling the GROUP_BY TO_CHAR() on 1million rows. This is why this second query is faster. Its faster because it is more specific. The more specific you make your SQL the faster it will be.
We could stop there but no, we must go on because today we have a 1 million rows, next year we will have 2 million. Unless you do not care for efficiency? If you save your users 10 minutes everyday, you can save them 2 days in a year. Your CSS3 animations are wasting my time.
/* Query v3 */
SELECT
SUM(PAYMTBL_AMOUNT),
TO_CHAR( PAYMTBL_TRAN_DATE, 'MON-YY'), COUNT(1)
FROM (
select PAYMTBL_AMOUNT, PAYMTBL_TRAN_DATE from PAYMTBL
WHERE 1=1
AND PAYMTBL_TRAN_DATE BETWEEN '01-JAN-2014' AND '31-JAN-2015'
) X
GROUP BY TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YY')
In query v3 we replace the SELECT * with only the columns we need. The SQL engine is also affected by the amount of columns returned. So since our PAYMTBL table has 20 columns we do not need to bring back all 20 of them. 20 columns times 100k rows is a waste of time and memory space. Remember physics, time and space. Software is affected by time and space despite. So specifying that we need only PAYMTBL_AMOUNT, PAYMTBL_TRAN_DATE actually makes the query run faster. It is not magic if you think about it.
/* Query v4 */
SELECT
SUM(PAYMTBL_AMOUNT),
TO_CHAR( PAYMTBL_TRAN_DATE, 'MON-YYYY'), COUNT(1)
FROM (
select PAYMTBL_AMOUNT, PAYMTBL_TRAN_DATE from PAYMTBL
WHERE 1=1 AND TRUNC(PAYMTBL_TRAN_DATE) BETWEEN '01-JAN-2015' AND '31-JAN-2015'
) X
GROUP BY TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YYYY')
In query v4 I decided to try adding TRUNC(PAYMTBL_TRAN_DATE) to the inner query. The query slows down. WHY? You can look up full table scan. It is often best not to modify the columns in the WHERE clause because changing/converting them makes it harder for the SQL engine to be sure of what the possible values are suppose to be. This goes back to proper DB design for reporting.
Because of the specificness of SQL the engine is able to do certain tricks that are harder to accomplish when you mess around with the data that it is using. It has something to do with Determinism and Predictability. If you are looking for a date which is 01-Jan-2014 the SQL/DB engine can automagically assume that the set of records before 2013 are unecessary, BUT if you convert a date column to a string the SLQ/DB engine will have to use string logic which doesn't have the same level of rigidity. Hence you get a slower query and possibly full table scans.
So I scraped that idea and move on;
/* Query v5 */
SELECT
SUM(PAYMTBL_AMOUNT),
MD,
COUNT(1)
FROM (
select PAYMTBL_AMOUNT, PAYMTBL_TRAN_DATE,
TO_CHAR( PAYMTBL_TRAN_DATE, 'MON-YYYY') as "MD" from PAYMTBL WHERE 1=1
AND PAYMTBL_TRAN_DATE BETWEEN '01-JAN-2015' AND '31-JAN-2015'
) X
GROUP BY MD
In the Query v5 I realised that I can move the TO_CHAR function into the subquery. This is a an example of a pre-emptive optimization that you can do before you run your queries. I could have added a column with a MON-YYYY value so that I would not have to do it when I run the query. After testing I realised that I ended up with a faster query. Thats another thing with SQL; YOU NEED TO SPEND TIME TESTING on your target RDMS. And you may get different speeds from MYSQL, that you get from ORACLE or MSSQL or Postgres because all these databases optimize in different wants and try to cheat whenever they can. But one thing is sure you cannot cheat physics.
Anyway why is Query v5 faster? The reason it is faster is because the GROUP BY has to do work when it is calling TO_CHAR() by moving it to the inner query we prepare the data beforehand so all the GROUP_BY has to do is group, sum, count and nothing else.
There is also something else to note; TO_CHAR() in a basic SELECT query is faster than a TO_CHAR() that is in SELECT-GROUP_BY. Also TO_CHAR() in a WHERE clause is even faster than both. So the part of the query in which you are calling your function is also important. The WHERE clause is often evaluated before the database starts looking for data while the SELECT and the GROUP_BY are usually operating on data while it is being gathered. Understanding what is happening in the SQL engine is critical to writing big complicated queries. Some RDMS have a EXPLAIN feature that can tell you want the query is doing.
/* Query v6 */
SELECT
SUM(PAYMTBL_AMOUNT),
MD, COUNT(1)
FROM (
select PAYMTBL_AMOUNT, PAYMTBL_TRAN_DATE,
TO_CHAR( PAYMTBL_TRAN_DATE, 'MON-YYYY') as "MD" from PAYMTBL
WHERE 1=1 AND PAYMTBL_TRAN_DATE BETWEEN TO_DATE('01-JAN-2015') AND TO_DATE('31-JAN-2015')
) X
GROUP BY MD
This resulted in a slower query x3. I am not sure why but this might be an example of me doing something that the database that I am using doesn't particularly like me doing. It is even worse if you try to TO_DATE(PAYMTBL_TRAN_DATE) like 47 times worse. full table scan converting all the values in the column. You will learn to avoid certain things the more you experiment. It is often not a bug in the RDMS but a feature. Its hard to explain, the RDMS does somethings well by sacrificing other things.
/* Query v7 */
SELECT
SUM(PAYMTBL_AMOUNT),
MD, COUNT(1)
FROM (
select PAYMTBL_AMOUNT, PAYMTBL_TRAN_DATE,
TO_CHAR( PAYMTBL_TRAN_DATE, 'MON-YYYY') as "MD" from PAYMTBL WHERE 1=1
AND PAYMTBL_TRAN_DATE >= '01-JAN-2015' AND PAYMTBL_TRAN_DATE <= '31-JAN-2015'
) X
GROUP BY MD
This resulted in a slower query by x2. The less-than/greater-than signs are slower because they create a problem of Mutual exclusivity. The database has to do more work compared to doing a BETWEEN. A BETWEEN gives the SQL engine a specific hint to what you want while at the same time limiting the range it needs to check. Using these specific conditions make its easier for the SQL engine to know what you want, resulting in a faster query.
Final Query
In the final query I added a order by to keep my years in order and I removed the PAYMTBL_TRAN_DATE column because I really did not need to see it. I also changed the date range to encompass the period I needed for my bar chart.
/* Query v8 */
SELECT
SUM(PAYMTBL_AMOUNT),
MD, COUNT(1)
FROM (
select PAYMTBL_AMOUNT,
TO_CHAR( PAYMTBL_TRAN_DATE, 'MON-YYYY') as "MD" from PAYMTBL WHERE 1=1
AND PAYMTBL_TRAN_DATE BETWEEN TO_DATE('01-JAN-2014') AND TO_DATE('01-FEB-2015')
ORDER BY 2 DESC
) X
GROUP BY MD
Conclusion
A SQL query is basically a small program that you write to tell the database what you want. The more specific you make it, the more it efficient it will be. SQL is not Object Oriented in fact it can be confusing for OOP programmers. The database wants to keep its structure constant so that it can be fast. OOP on the other hand is more concerned with modeling, flexibility and abstraction. Speed is not a concern when you are shifting classes around in a pool. SQL is not returning objects - it is returning data arrays or set - NOT a graph. You can imagine a database as a skycraper with floors while objects are more like a tree with leaves - two different things, with different trade offs.
SQL is not particularly hard to learn. It has a very limited synthax but it is very useful because what it does is so domain specific. You will learn about SQL by using it alot and gaining experience in effective ways of over coming various challenges. Some databases offer different features and have advantages over others. Column databases for example are faster than row databases because they are optimized for processing columns of data and its near impossible to make a row database faster than a column database (I have tried, physics will not have it).
P.S. Things to note (caveats)
Your database is not a cache for your object model. Do not let application logic slip into your database design.
- The more specific you make your SQL, the faster it will be.
- Some databases, depending on how old they are won't allow certain optimizations like sub-queries.
- Avoid doing things that the RDMS does not like. Often times you will only find this out in testing.
- Pre-emptive data optimization is a good thing. If you need to report a piece of data in a certain format it is often better to create a dedicated column for that formate instead of generating it at run time. for ex. TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YYYY') could be a static column.
- The people who code the SQL database are better programmers than you and are focused on doing a specific thing. Plus they know more about the data than you do, they specialize in storing and retreiving the data. You would have to load and parse all the rows. If you are willing to do that you might as well store it yourself, and make your own database (which is sometimes better as in the case of files, pictures and PDFs). Do not put big files in the database, it will be a pain for everybody to manage. The benefits are most times short term.
- When using SQL you should always know how big your table is, how big it is gonna get and how many columns you have. This is key because SQL operates in a confined space; the more data you have, the slower your query will run - its PHYSICS. Phyiscs rules us all and we must respect it. So do not ignore the fact that you might run out of storage or that your SQL will get slow over time. If you are adding 10 records a day, and each record is 10 kb you should know how big your database is gonna be in 2 years and have a plan to deal with it.
- The database exists to store and extract data. Hence whenever you design a database table you should have in the back of your mind how efficiently the database will be extracted and the nature of the reports that will be generated from said data. You cannot design a database in isolation without thinking about what reports will be needed from the data design. Over Normalization is a waste of time if all your reports are slow and take 2 days to run.
- The example query is limited to a month so that I could quickly test the effect that small SQL changes have on the data that I am testing. Your own testing of large vs small data sets might yeild different results but in most cases a bad query on a few rows will be exponetially worst on 1 million rows - NEVER BETTER.
- The WHERE CONDITION PAYMTBL_TRAN_DATE BETWEEN '01-JAN-2015' AND '31-JAN-2015' actually means '01-JAN-2015 12:00 AM' To '31-JAN-2015 12:00AM'. This means that I won't get any data for the day of 31-JAN-2015. A more accurate range would be to use PAYMTBL_TRAN_DATE BETWEEN '01-JAN-2015' AND TO_DATE( '31-JAN-2015' || ' 23:59:59','DD-MON-YYYY HH24:MI:SS'). The fastest range would be BETWEEN '01-JAN-2015' AND '01-FEB-2015' but this method has user interface complications which you will have to figure out on your own.
Addendum
Inevitably some queries have to be optimized for speed over flexibility. You may need a separate query by end of month aggregation versus day-to-day querying. You might not be able to have your cake and eat it too.
permanent link. Find similar posts in Articles.
comments
Cool article Owen, very informative. Some people like myself use trunc(PAYMTBL_TRAN_DATE) BETWEEN '01-JAN-2015' AND '31-JAN-2015' or to_char on the date in the where clause. I actually didn't realise it made the query slower. You learn something new everyday. Thumbs up!
by Colleen H 2015-Jun-18
TRUNC is not all bad it does have a purpose when used. It takes off the time part of the date making the comparison a whole date without the time getting mixed in. Without trunc you would have to use '31-JAN-2015 23:59:59' to get the last hours included in the results. So don't go changing stuff that works without knowing the side-effects and testing. If it aint broke.....
by owen 2015-Jun-18