written by owen, published 2007-Jan-01, comment
I've been playing around with the site again, minor changes still. You will notice that there is more ajax all over the place - not the soap. If it gets annoying just tell me.
Smilies
You will also notice that now you can add smilies by clicking the smilie face beside the comment box. Madbull yanked that one out of me. They have nice little descriptions so that you know what each smilie means. When you click on one the box disappears - tell me if this gets annoying.
Profiles
Your last 5 comments show up on your profile now.
Forum
The Forum is all ajax-y as well. It might even be easier to read the site in the forum, try it.
post a comment or feature request 8
written by owen, published 2006-Sep-24, comment
I am gonna try my best to impart some of the knowledge I have learnt while using and experimenting with SQL over the years. SQL a way to store and extra data from a RDMS. SQL is rigid because it has a specific purpose. SQL exists to store and output rows of information as quickly as possible - thats it. It does not care what you are doing with the information or how you are using it. All it cares about are keys, indexs, rows and columns. Most relational databases management systems (DBMS) actually care more about rows than it does about columns but that is a whole other story.
I teach by example whenever I can. So I am gonna drop some T-SQL on a particular problem but it should be pretty easy to follow regardless of the DBMS that you use. I am gonna change the query little by little in the hopes of making it faster as I go along. Any SQL query you write should be written with the assumption that one day you will have a billion records and it will take 2 days to run. Because a query you write now is fast but will not be fast forever. So elimating slow downs early is the best way to ensure that your queries are as efficient as they can be.
Lets begin
For the purpose of this article I am going to aggregate a table of payments. The example table "PAYMTBL" has 1 million records in it with 20 columns. The table stores payment records over a period of several years. I needed a bar chart for a presentation so I wrote a query to SUM() the PAYMTBL_AMOUNT and convert the PAYMTBL_TRAN_DATE to a MON-YYYY string, and I do a Count(1) to cound the number of transactions in the month of Jan-2015;
/* query v1 A */
SELECT
SUM(PAYMTBL_AMOUNT),
TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YY'), COUNT(1)
FROM PAYMTBL
WHERE 1=1
AND PAYMTBL_TRAN_DATE BETWEEN '01-JAN-2015' AND '31-JAN-2015'
GROUP BY TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YY')
ORDER BY 2
Now you could stop here if you wish but a skilled SQL writer wants to write the fastest most accurate SQL possible. Ideally the fastest query is no query at all. But since we need the information we need to write the query to get it. Technically you could run the query over night and cache the results to display at a later point but caching is for babies and is a sure sign of weakness.
So we push forward. The thing to note about in SQL is that anything you do besides SELECT 1 from TABLE will make your sql slow down. Any function you call, any grouping or order by slows the query down. Physics, bus speed, ram speed, disk IO, stuff like that. You want the SQL to do as little work as possible. The GROUP BY, TO_CHAR, SUM, COUNT(1) are all giving the SQL engine work to do. You could in theory just do a;
/* query v1 B */
SELECT * FROM PAYMTBL
WHERE 1=1
AND PAYMTBL_TRAN_DATE BETWEEN '01-JAN-2015' AND '31-JAN-2015'
But the above code would return a 100 thousand rows (with 20 columns each) that I would have to process in my application code. Do not forget that the data also take up RAM on the computer; 100k could easily be 30 mb of RAM depending on the size of each record. Yes, I would get the rows quickly but I would never be able to process and group them faster than the RDBMS/SQL engine. WHY couldn't I do it faster? The answer is simple. The SQL database is a product of years of research and testing.
Moving on, seeing that I absolutely need the GROUP BY, TO_CHAR, SUM, COUNT(1) features of the SQL database. What else could I optimize? I decided to take the PAYMTBL table out of the main query. WHY? because any functions you have in the main query will cause the SQL engine to do work on all the rows that are in the tables that are on the same level. The less rows that it has to check the faster the query (indexs, and other things help too but less rows is what you really need).
The below query v2 runs faster;
/* query v2 */
SELECT
SUM(PAYMTBL_AMOUNT),
TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YY'), COUNT(1)
FROM (
select * from PAYMTBL WHERE 1=1
AND PAYMTBL_TRAN_DATE BETWEEN '01-JAN-2015' AND '31-JAN-2015'
)
GROUP BY TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YY')
Why does this query run faster? It is simple really, this query is doing less work than the previous. It all comes back to the how many rows the SQL engine has to "touch" with the GROUP BY, TO_CHAR, SUM, COUNT(1). Lets say that only 100k out of the 1 m rows match the condition "WHERE 1=1
AND PAYMTBL_TRAN_DATE BETWEEN '01-JAN-2014' AND '31-JAN-2015' ". By picking up 100k rows first we save the SQL engine from calling the GROUP_BY TO_CHAR() on 1million rows. This is why this second query is faster. Its faster because it is more specific. The more specific you make your SQL the faster it will be.
We could stop there but no, we must go on because today we have a 1 million rows, next year we will have 2 million. Unless you do not care for efficiency?
/* query v3 */
SELECT
SUM(PAYMTBL_AMOUNT),
TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YY'), COUNT(1)
FROM (
select PAYMTBL_AMOUNT, PAYMTBL_TRAN_DATE from PAYMTBL
WHERE 1=1
AND PAYMTBL_TRAN_DATE BETWEEN '01-JAN-2014' AND '31-JAN-2015'
)
GROUP BY TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YY')
In query v3 we replace the SELECT * with only the columns we need. The SQL engine is also affected by the amount of columns returned. So since our PAYMTBL table has 20 columns we do not need to bring back all 20 of them. 20 columns times 100k rows is a waste of time and memory space. Remember physics, time and space. Software is affected by time and space despite. So specifying that we need only PAYMTBL_AMOUNT, PAYMTBL_TRAN_DATE actually makes the query run faster. It is not magic if you think about it.
/* query v4 */
SELECT
SUM(PAYMTBL_AMOUNT),
TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YYYY'), COUNT(1)
FROM (
select PAYMTBL_AMOUNT, PAYMTBL_TRAN_DATE from PAYMTBL
WHERE 1=1 AND TRUNC(PAYMTBL_TRAN_DATE) BETWEEN '01-JAN-2015' AND '31-JAN-2015' )
GROUP BY TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YYYY')
In query v4 I decided to try adding TRUNC(PAYMTBL_TRAN_DATE) to the inner query. The query slows down. WHY? You can look up full table scan. It is often best not to modify the columns in the WHERE clause because changing/converting them makes it harder for the SQL engine to be sure of what the possible values are suppose to be. This goes back to proper DB design for reporting.
Because of the specificness of SQL the engine is able to do certain tricks that are harder to accomplish when you mess around with the data that it is using. It has something to do with Determinism and Predictability. If you are looking for a date which is 01-Jan-2014 the SQL/DB engine can automagically assume that the set of records before 2013 are unecessary, BUT if you convert a date column to a string the SLQ/DB engine will have to use string logic which doesn't have the same level of rigidity. Hence you get a slower query and possibly full table scans.
So I scraped that idea and move on;
/* query v5 */
SELECT
SUM(PAYMTBL_AMOUNT),
MD,
COUNT(1)
FROM (
select PAYMTBL_AMOUNT, PAYMTBL_TRAN_DATE,
TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YYYY') as "MD" from PAYMTBL WHERE 1=1
AND PAYMTBL_TRAN_DATE BETWEEN '01-JAN-2015' AND '31-JAN-2015'
)
GROUP BY MD
In the Query v5 I realised that I can move the TO_CHAR function into the subquery. This is a an example of a pre-emptive optimization that you can do before you run your queries. I could have added a column with a MON-YYYY value so that I would not have to do it when I run the query. After testing I realised that I ended up with a faster query. Thats another thing with SQL; YOU NEED TO SPEND TIME TESTING on your target RDMS. And you may get different speeds from MYSQL, that you get from ORACLE or MSSQL or Postgres because all these databases optimize in different wants and try to cheat whenever they can. But one thing is sure you cannot cheat physics.
Anyway why is Query v5 faster? The reason it is faster is because the GROUP BY has to do work when it is calling TO_CHAR() by moving it to the inner query we prepare the data beforehand so all the GROUP_BY has to do is group, sum, count and nothing else.
There is also something else to note; TO_CHAR() in a basic SELECT query is faster than a TO_CHAR() that is in SELECT-GROUP_BY. Also TO_CHAR() in a WHERE clause is even faster than both. So the part of the query in which you are calling your function is also important. The WHERE clause is often evaluated before the database starts looking for data while the SELECT and the GROUP_BY are usually operating on data while it is being gathered. Understanding what is happening in the SQL engine is critical to writing big complicated queries. Some RDMS have a EXPLAIN feature that can tell you want the query is doing.
/* query v6 */
SELECT
SUM(PAYMTBL_AMOUNT),
MD, COUNT(1)
FROM (
select PAYMTBL_AMOUNT, PAYMTBL_TRAN_DATE,
TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YYYY') as "MD" from PAYMTBL
WHERE 1=1 AND PAYMTBL_TRAN_DATE BETWEEN TO_DATE('01-JAN-2015') AND TO_DATE('31-JAN-2015')
)
GROUP BY MD
This resulted in a slower query x3. I am not sure why but this might be an example of me doing something that the database that I am using doesn't particularly like me doing. It is even worse if you try to TO_DATE(PAYMTBL_TRAN_DATE) like 47 times worse. full table scan converting all the values in the column. You will learn to avoid certain things the more you experiment. It is often not a bug in the RDMS but a feature. Its hard to explain, the RDMS does somethings well by sacrificing other things.
/* query v7 */
SELECT
SUM(PAYMTBL_AMOUNT),
MD, COUNT(1)
FROM (
select PAYMTBL_AMOUNT, PAYMTBL_TRAN_DATE,
TO_CHAR(PAYMTBL_TRAN_DATE,'MON-YYYY') as "MD" from PAYMTBL WHERE 1=1
AND PAYMTBL_TRAN_DATE > '01-JAN-2015' AND PAYMTBL_TRAN_DATE < '31-JAN-2015'
)
GROUP BY MD
This resulted in a slower query by x2. The less-than/greater-than signs are slower because they create a problem of Mutual exclusivity. The database has to do more work compared to doing a BETWEEN. A BETWEEN gives the SQL engine a specific hint to what you want while at the same time limiting the range it needs to check hence resulting in a faster query. Using these specific classes make its easier for the SQL engine to know what you want.
Conclusion
A SQL query is basically a small program that you write to tell the database what you want. The more specific you make it, the more it efficient it will be. SQL is not Object Oriented in fact it can be confusing for OOP programmers. The database wants to keep its structure constant so that it can be fast. OOP on the other hand is more concerned with modeling, flexibility and abstraction. Speed is not a concern when you are shifting classes around in a pool. SQL is not returning objects - it is returning data arrays or set - NOT a graph. You can imagine a database as a skycraper with floors while objects are more like a tree with leaves - two different things, with different trade offs.
SQL is not particularly hard to learn. It has a very limited synthax but it is very useful because what it does is so domain specific. You will learn about SQL by using it alot and gaining experience in effective ways of over coming various challenges. Some databases offer different features and have advantages over others. Column databases for example are faster than row databases because they are optimized for processing columns of data and its near impossible to make a row database faster than a column database (I have tried, physics will not have it).
P.S. Things to note (caveats)
post a comment or feature request 2
written by owen, published 2006-Apr-02, comment
I made some minor changes to the comment system;
Quick Edits: After you make a comment you can edit it by clicking the "Edit" link. Works even when you are not logged in.
Redux: You Name / Website / Email address is saved when you first make a comment or answer a question. So if you are in the habit of answering multiple questions you won't have to type that in again. Works especially for those who are not are not logged in.
Color Coding: If you are logged in and you make a comment it will show up in a different color. This however is totally superficial - not deep or penetrating emotionally or intellectually.
Pictures: You can now have the option to add pictures to your comments. Click the options text.
History: If you have made a couple comments you will notice is a link in the "message add successfully" box. Click the history link and you will see all you comment and the comment of people post comments after yours. Makes rebuttals easier to track. Keep the cussing out to a minimum.
post a comment or feature request 13