Click here to monitor SSC
  • Av rating:
  • Total votes: 257
  • Total comments: 10
Fabiano Amorim

Window Functions in SQL Server

27 October 2011

When SQL Server introduced Window Functions in SQL Server 2005, it was done in a rather tentative way, with only a handful of functions being introduced. This was frustrating, as they remove the last excuse for cursor-based operations by providing aggregations over  a partition of the result set, and imposing an ordered sequence over a partition. Now, with SQL Server 2012, we are soon to enjoy a full range of Window Functions. They are going to make for some much simpler SQL queries…

Hi Folks, I’m going to start a new series of three articles about Window functions, starting off by explaining what they are, and how they are used to provide details of an aggregation.

Window functions

Window functions belong to a type of function known as a ‘set function’, which means a function that applies to a set of rows. The word ‘window’ is used to refer to the set of rows that the function works on.

Windowing functions were added to the standard SQL:2003 that is managed by the ISO and it was specified in more detail in SQL:2008 For some time, other DBMSs such as Oracle, Sybase and DB2 have had support for window functions. Even the open source RDBMS PostgreSQL has a full implementation. SQL Server has had only a partial implementation up to now, but it is coming in SQL 2012.

One of the most important benefits of window functions is that we can access the detail of the rows from an aggregation. To see an example, let’s first suppose we have this table and data:

USE tempdb
GO
IF OBJECT_ID('TestAggregation') IS NOT NULL
  
DROP TABLE TestAggregation
GO
CREATE TABLE TestAggregation (ID INT, Value Numeric(18,2))
GO

INSERT INTO TestAggregation (ID, Value)
VALUES(1, 50.3), (1, 123.3), (1, 132.9),
      (
2, 50.3), (2, 123.3), (2, 132.9), (2, 88.9),
      (
3, 50.3), (3, 123.3);
GO

This is what the data looks like:

SELECT * FROM TestAggregation

the initial data

If we sum the column Value grouping by ID, by using a conventional GROUP BY, we would have the following query and result:

SELECT ID, SUM(Value)
  
FROM TestAggregation
GROUP BY ID;

The Query Result

Here we can see a sample of the partitions of rows, or sets that the SUM aggregation function is working on.

The aggregation function

In the blue we have partition 1, green as partition 2 and red as partition 3. Because we applied the aggregation function in the column value, grouping the results by ID, we the lose the details of the data. In this case the details are the values of the columns of the rows in the partitions 1, 2 and 3.

Let’s suppose I need to write a query to return the total value of sales, the average value of sales and the quantity of sales for each ID, and still return the actual values of the rows, then we might think that we could use something like this to return this data:

SELECT ID,
      
Value,
      
SUM(Value) AS "Sum"
      
AVG(Value) AS "Avg"
      
COUNT(Value) AS "Quantity"   FROM TestAggregation
GROUP BY ID;

Msg 8120, Level 16, State 1, Line 2

Column 'TestAggregation.Value' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Unfortunately it is against the way that aggregations work. If you group by something, then you lose access to the details.

A very commonly used alternative is to write every aggregation into a subquery, and then correlate with the main query using a join, something like:

SELECT TestAggregation.ID,
      
TestAggregation.Value,
      
TabSum."Sum"
       TabAvg."Avg"
       TabCount."Quantity"   

  FROM TestAggregation
 INNER JOIN (SELECT ID, SUM(Value) AS "Sum"               
               FROM TestAggregation
              
GROUP BY ID) AS TabSum
    
ON TabSum.ID = TestAggregation.ID
 INNER JOIN (SELECT ID, AVG(Value) AS "Avg"               
               FROM TestAggregation
              
GROUP BY ID) AS TabAvg
    
ON TabAvg.ID = TestAggregation.ID
 INNER JOIN (SELECT ID, COUNT(Value) AS "Quantity"               
               FROM TestAggregation
              
GROUP BY ID) AS TabCount
    
ON TabCount.ID = TestAggregation.ID

Data from this query

But a neater and faster solution is this…

SELECT
  
TestAggregation.ID,
  
TestAggregation.Value,
  
AggregatedValues.[Sum],
  
AggregatedValues.[Avg],
  
AggregatedValues.Quantity
FROM   TestAggregation
INNER JOIN
  
(      
  
SELECT ID,
        
SUM(Value) AS "Sum"
        
AVG(Value) AS "Avg"
        
COUNT(Value) AS "Quantity"     FROM TestAggregation
  
GROUP BY ID) AggregatedValues
ON AggregatedValues.ID=TestAggregation.ID

A very elegant alternative is to use the clause OVER() implemented in the aggregation functions, it allow-us to access the details of the rows that have been aggregated.

For instance we could try to get the result we want with this …

SELECT ID,
      
Value,
      
SUM(Value) OVER() AS "Sum"
      
AVG(Value) OVER() AS "Avg"
      
COUNT(Value) OVER() AS "Quantity"   FROM TestAggregation

…but, with this query, we didn’t return the expected results. In fact, we returned the aggregate values for the entire table rather than for each ID. Using the clause OVER() with the aggregation function (SUM, AVG and COUNT) we have accessed the details of the grouped data along with a total aggregation of the data, but, in In fact, we want the aggregates for the data grouped by ID. To do this, we should use the clause PARTITION BY clause. For instance:

SELECT ID,
      
Value,
      
SUM(Value) OVER(PARTITION BY ID) AS "Sum"
      
AVG(Value) OVER(PARTITION BY ID) AS "Avg"
      
COUNT(Value) OVER(PARTITION BY ID) AS "Quantity"
  
FROM TestAggregation

The same data, but from a neater query

Now we can see the same results as in the subquery alternative, but with a much more simple and elegant code.

In the following picture we can see that even the results are grouped by ID we can see access the details of the aggregated partitions through the clause OVER and PARTITION BY.

Set based thinking

It can get tedious to hear about this set base thing, but in fact it’s not so easy to think set based. Sometimes it’s very hard to find a set based solution to a query we have. Window Functions give us more set-based solutions to awkward problems.

The main point of windowing functions it’s that they were created to work with a set. SQL Server never was good on processing queries row by row, that’s why you always hearing that ‘cursors are evil’, and are ‘not good for performance’, ‘you should avoid them’ and so on. SQL Server was not built to work row by row.

Let’s illustrate this. SQL Server takes an average of 50 seconds to run a loop of 1 hundred million times against 100 milliseconds of a Win32 app.

DECLARE @i INT = 0, @Time Time(3) = GETDATE()
WHILE @i < 100000000
BEGIN
  SET
@i += 1;
END

SELECT
CONVERT(Time(3), GETDATE() - @Time) AS "Total Time to run the Loop"

Total time to run the loop

Delphi Code:

PROCEDURE TForm1.Button1Click(Sender: TObject);
Var
  i : Integer;
  
Tempo : TDateTime;
BEGIN
  
i := 0;
  
Tempo := Now();
  
WHILE i < 100000000 do
  
BEGIN
    
inc(i);
  
END;

Time to run the Delphi code

Of course it is an unfair comparison. The compiler of a win32 application is totally different from SQL Server, what I wanted to show here is the fact that SQL Server was not supposed to run row by row.

I once was in London doing training with my colleague Itzik Ben-Gan when I remember he said: “There is no row-by-row code that you cannot run an equivalent set-based version, the problem is that you didn’t figured out how to do it”. Yes it’s a heavy phrase but, who could tell Itzik that this is not true? Not I!

Window functions on SQL Server 2005 and 2008

Since SQL Server 2005 we have had support for some window functions, they are: ROW_NUMBER, RANK, DENSE_RANK and NTILE.

In this first article we’ll review how these functions works, and how they can help-us to write better and efficient set-based codes.

Test database

To test the functions we’ll use a table called Tab1. The code to create the table is the following:

USE TempDB
GO
IF OBJECT_ID('Tab1') IS NOT NULL
  
DROP TABLE Tab1
GO
CREATE TABLE Tab1 (Col1 INT)
GO

INSERT INTO Tab1 VALUES(5), (5), (3) , (1)
GO

Row_Number()

The ROW_NUMBER function is used to generate a sequence of numbers based in a set in a specific order, in easy words, it returns the sequence number of each row inside a set in the order that you specify.

For instance:

-- RowNumber
SELECT Col1,
      
ROW_NUMBER() OVER(ORDER BY Col1 DESC) AS "ROW_NUMBER()"   
  FROM Tab1

data from this query

The column called “ROW_NUMBER()” is one of a series of numbers created in the order of Col1 descending. The clause OVER(ORDER BY Col1 DESC) is used to specify the order of the sequence for which the number should be created. It is necessary because rows in a relational table have no ‘natural’ order.

Rank() & Dense_Rank()

Return the position in a ranking for each row inside a partition. The ranking is calculated by 1 plus the number of previews rows.

It’s important to mention that the function RANK returns the result with a GAP after a tie, whereas the function DENSE_RANK doesn’t. To understand this better, let’s see some samples.

-- Rank
SELECT Col1,
      
RANK() OVER(ORDER BY Col1 DESC) AS "RANK()"   FROM Tab1
GO

-- Dense_Rank
SELECT Col1,
      
DENSE_RANK() OVER(ORDER BY Col1 DESC) AS "DENSE_RANK"   FROM Tab1

Results of a tie

Notice that in the RANK result, we have the values 1,1,3 and 4. The value Col1 = “5” is duplicated so any ordering will produce a ‘tie’ for position between them. They have the same position in the rank, but, when the ordinal position for the value 3 is calculated, this position isn’t 2 because the position 2 was already used for the value 5, in this case the an GAP is generated and the function returns the next value for the tank, in this case the value 3.

NTILE()

The NTILE function is used for calculating summary statistics. It distributes the rows within an ordered partition into a specified number of “buckets” or groups. The groups are numbered, starting at one. For each row, NTILE returns the number of the group to which the row belongs. It makes it easy to calculate n-tile distributions such as percentiles.

Let’s see a sample:

-- NTILE
SELECT Col1,
      
NTILE(3) OVER(ORDER BY Col1 DESC) AS "NTILE(3)"
  
FROM Tab1

NTILE Results

In the result above we can see that 4 rows were divided by 3 it’s 1, the remaining row is added in the initial group. Let’s see another sample without remained rows.

-- NTILE
SELECT Col1,
      
NTILE(2) OVER(ORDER BY Col1 DESC) AS "NTILE(2)"
  
FROM Tab1

It is good practice to order on a unique key, ensure that there are more buckets than rows, and to have an equal number of rows in each bucket

The Power of Window functions

Now let’s see some examples where window functions could be used to return some complex queries.

Example 1

Let’s suppose I need to write a query to return those employees that receive more than his department’s average.

Let’s start by creating two sample tables called Departamentos (departments in Portuguese) and Funcionarios (employees in Portuguese):

IF OBJECT_ID('Departamentos') IS NOT NULL
  
DROP TABLE Departamentos
GO
CREATE TABLE Departamentos (ID      INT IDENTITY(1,1) PRIMARY KEY,
                            
Nome_Dep VARCHAR(200))
GO

INSERT INTO Departamentos(Nome_Dep)
VALUES('Vendas'), ('TI'), ('Recursos Humanos')
GO

IF OBJECT_ID('Funcionarios') IS NOT NULL
  
DROP TABLE Funcionarios
GO
CREATE TABLE Funcionarios (ID      INT IDENTITY(1,1) PRIMARY KEY,
                          
ID_Dep  INT,
                          
Nome    VARCHAR(200),
                          
Salario Numeric(18,2))
GO

INSERT INTO Funcionarios (ID_Dep, Nome, Salario)
VALUES(1, 'Fabiano', 2000), (1, 'Amorim', 2500), (1, 'Diego', 9000),
      (
2, 'Felipe', 2000), (2, 'Ferreira', 2500), (2, 'Nogare', 11999),
      (
3, 'Laerte', 5000), (3, 'Luciano', 23500), (3, 'Zavaschi', 13999)
GO

This is what the data looks like:

Data from this query

To write this query I could do something like this:

SELECT Departamentos.Nome_Dep,  
      
Funcionarios.Nome AS Funcionario,
      
Funcionarios.Salario,
      
AVG(Funcionarios.Salario) OVER(PARTITION BY Departamentos.Nome_Dep) "Média por Departamento"
       Salario
- AVG(Funcionarios.Salario) OVER(PARTITION BY Departamentos.Nome_Dep) "Diferença de Salário"   FROM Funcionarios
INNER JOIN Departamentos
    
ON Funcionarios.ID_Dep = Departamentos.ID
ORDER BY 5 DESC

The departmental averages

As we can see, the employees Luciano, Nogare and Diego are receiving much more than the average of their department.

Using the OVER clause, I was able to partition the data per each department, and then I could access the average with the details of the salary.

You can avoid using window functions by using a query like this.

SELECT Departamentos.Nome_Dep,  
      
Funcionarios.Nome AS Funcionario,
      
Funcionarios.Salario,
      
[Média por Departamento],
      
Salario - [Média por Departamento] AS [Diferença de Salário]
FROM Funcionarios
INNER JOIN Departamentos
    
ON Funcionarios.ID_Dep = Departamentos.ID
INNER JOIN (
SELECT ID_Dep, AVG(Funcionarios.Salario) AS [Média por Departamento]
FROM Funcionarios
GROUP BY ID_Dep)[Média]
ON [Média].ID_Dep=Funcionarios.ID_Dep
ORDER BY [Diferença de Salário] DESC

Example 2

Another very common problem is the “running totals”. To show this sample I’ll use a very simple data. Let’s started by creating a table called tblLancamentos where I’ve a column with a date and a column with a numeric value.

IF OBJECT_ID('tblLancamentos') IS NOT NULL
  
DROP TABLE tblLancamentos
GO

-- Tabela de Lançamentos para exemplificar o Subtotal
CREATE TABLE tblLancamentos (DataLancamento  Date,
                            
ValorLancamento FLOAT)
GO

-- Insere os registros
INSERT INTO tblLancamentos VALUES ('20080623',100)
INSERT INTO tblLancamentos VALUES ('20080624',-250)
INSERT INTO tblLancamentos VALUES ('20080625',380)
INSERT INTO tblLancamentos VALUES ('20080626',200)
INSERT INTO tblLancamentos VALUES ('20080627',-300)
GO

One of the alternatives to return the running total of the column ValorLancamento ordered by DataLancamento is to write a query like the following:

SELECT DataLancamento,
      
ValorLancamento,
      (
SELECT SUM(ValorLancamento)
        
FROM tblLancamentos
        
WHERE DataLancamento <= QE.DataLancamento) AS Saldo
FROM tblLancamentos AS QE

In the query above we join the table with the aggregation to return the total. Another easier, quicker and more elegant alternative would be:

SELECT DataLancamento,
      
ValorLancamento,
      
SUM(ValorLancamento) OVER(ORDER BY DataLancamento) AS Saldo
  
FROM tblLancamentos AS QE

The same data from a simpler query

In the query above I’m using the clause OVER() with an ORDER BY. Unfortunately this will only possible with next version of SQL Server, SQL Server 2012.

Conclusion

In the next article I’ll show in more details the limitations of the window functions on SQL Server 2008 R2, and compare the performance of a running aggregation on SQL Server 2012 (Denali).

In the final and third article I’ll show in details how the window frame works. Also I’ll show how the windowing functions LEAD, LAG, FIRST_VALUE, LAST_VALUE, PERCENT_RANK and CUME_DIST works in SQL Server.

That’s all folks, see you soon with the second part of this article.

Fabiano Amorim

Author profile:

Fabiano is fascinated by the SQL Server Query Processor and the way it works to optimize queries, procedures and functions. He graduated as a Technical Processor from Colégio Bezerra de Menezes, SP- Brazil, and has worked for several years with SQL Server, focusing in creating Data Warehouses and optimizing T-SQL codes for many companies in Brazil and Argentina. Fabiano is a SQL Server MVP, MCP for SQL Server 2000, MCTS and MCITP Data Base Developer for SQL Server 2005 and 2008. He also is actively involved in SQL Server community though forums such as MSDN and TechNet Brazil, writes articles for Simple-Talk and SQL Server Magazine Brazil, and he also presents online Webcasts and In-Person events for Microsoft Brazil. His blog is on http://blogfabiano.com

Search for other articles by Fabiano Amorim

Rate this article:   Avg rating: from a total of 257 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: excellent explanation
Posted by: Dante Rosati (not signed in)
Posted on: Monday, October 31, 2011 at 3:42 AM
Message: Great Fabiano!!!!

Subject: Great post
Posted by: Jes Samsøe Hansen (not signed in)
Posted on: Monday, October 31, 2011 at 7:30 AM
Message: I'm really looking forward to parts 2 and 3. Let it be sooner rather than later.

Subject: N/A
Posted by: Anonymous (not signed in)
Posted on: Monday, October 31, 2011 at 7:33 AM
Message: Oracle have this kind of thing 10 years ago.
SQL Server just copy it.

Subject: Excellent post
Posted by: Mitchell Land (not signed in)
Posted on: Monday, October 31, 2011 at 7:59 AM
Message: Great information and well presented.

Subject: .
Posted by: mcflyamorim (view profile)
Posted on: Monday, October 31, 2011 at 8:34 AM
Message: Thanks for the comments guys, I expect to publish the next 2 articles on next two weeks.

N/A, It’s not coping anything, it’s supporting the Standard (ISO), it’s supposed to do it anyway. Also, I’m glad we are coping this from Oracle , let’s copy the best thinks for the competitor and make it even better on SQL Server ;-)

Cheers

Subject: English please
Posted by: Anonymous (not signed in)
Posted on: Monday, October 31, 2011 at 4:07 PM
Message: Don't use Portuguese. It makes examples a lot harder to use. You thought that will be cool or what? It is not!

Subject: SEQUEL is a great language.
Posted by: Jes Samsøe Hansen (not signed in)
Posted on: Tuesday, November 1, 2011 at 8:15 AM
Message: SEQUEL is written and spoken with a simple and declarative tongue with only a few main verbs, prepositions and nouns.
The meaning - in between the sometimes many lines – SHALL be derived from Boolean and algebraic logic and WILL be appreciated by those who like that sort of challenge.
Now that's really cool!
Post scriptum: N/A and Anonymous were newer keywords in this universal language.

Subject: Excellent
Posted by: Pushkar (not signed in)
Posted on: Wednesday, November 2, 2011 at 3:33 AM
Message: Simple and straight explanation.

Subject: Fantastic Article
Posted by: Simon Jones (not signed in)
Posted on: Wednesday, November 9, 2011 at 3:54 PM
Message: Thanks very much Fabiano,

Great article well illustrated an clearly explained.

You have made a fairly complex subject (for me anyway) straightforward to understand.

Probably the best SQL tutorial I have ever seen.

Thanks again.

Simon

Subject: simple and clear
Posted by: Red (view profile)
Posted on: Monday, June 24, 2013 at 1:11 PM
Message: I'm really looking forward to other parts, thank you.

 
Simple-Talk Database Delivery

DLM
Patterns & Practices Library

Visit our patterns and practices library to learn more about database lifecycle management.

Find out how to automate the process of building, testing and deploying your database changes to reduce risk and make rapid releases possible.

Get started

Phil Factor
How to Build and Deploy a Database from Object-Level Source in a VCS

It is easy for someone who is developing a database to shrug and say 'if only my budget would extend to buying fancy... Read more...

 View the blog

Top Rated

SQL Server Security Audit Basics
 SQL Server Server Audit has grown in functionality over the years but it can be tricky to maintain and... Read more...

The PoSh DBA: Assigning Data to Variables Via PowerShell Common Parameters
 Sometimes, it is the small improvements in a language that can make a real difference. PowerShell is... Read more...

Issue Tracking for Databases
 Any database development project will be hard to manage without a system for reporting bugs in the... Read more...

Releasing Databases in VSTS with Redgate SQL CI and Octopus Deploy
 You can still do Database Lifecycle Management (DLM) workflows in the hosted version of Team foundation... Read more...

Jodie Beay and the Production Database Drift
 You make an example database, like NorthWind or WidgetDev in order to test out your deployment system... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

Temporary Tables in SQL Server
 Temporary tables are used by every DB developer, but they're not likely to be too adventurous with... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.