The use of window functions and CTEs in MySQL 8.0 to implement a cumulative total without hacks /Blog of the Flant company /Habr <{short}> <{full}> <div class="post__text post__texthtml post__text_v1" id="postcontentbody"> <img src="ht
Approx. transl. : In this article, the team leader of the UK company Ticketsolve shares a solution to his very specific problem, while demonstrating general approaches to creating socalled accumulating functions using modern MySQL 8.0 features. Its listings are clear and provided with detailed explanations, which helps to understand the essence of the problem, even for those who did not dive into it so deeply.
A common strategy for performing updates using cumulative functions in MySQL is using custom variables and the pattern. UPDATE[]SET mycol = (@myvar: = EXPRESSION (@myvar, mycol))
This pattern does not work well with the optimizer (leading to nondeterministic behavior), so they decided to abandon it. The result is a kind of emptiness because (relatively) complex logic is now more difficult to implement (at least with the same simplicity).
The article will discuss two ways to implement it: using window functions (canonical approach) and using recursive CTEs (general table expressions). previous publication on this topic
The same is true for window functions: I'll comment on queries /concepts in detail, but a general idea still doesn't hurt. A huge number of books and publications are devoted to window functions (which is why I still have not written about them); however, in most examples, calculations are performed either on financial results or on demographic indicators. However, in this article I will use a real case.
For software, I recommend using MySQL ??? (but not required). All expressions must run in the same console in order to reuse @venue_id
In the software world, there is a famous architectural dilemma: should logic be implemented at the application level or at the database level? Although this is a perfectly valid question, in our case, I proceed from the assumption that the logic is should be stay at base level; the reason for this may be, for example, speed requirements (as was the case in our case).
Problem
In this task, we allocate seats in a certain hall (theater).
For business purposes, each location needs to be assigned a socalled "grouping"  an additional number that represents it.
Here is the algorithm for determining the grouping value:
 start at 0 and top left;
 if there is an empty space between the current and the previous one, or this is a new row, then we add 2 to the previous value (if this is not the absolute first place), otherwise, we increase the value by 1;
 assign a grouping to a place;
 go to a new place in the same row or to the next row (if the previous one is over) and repeat from point 2; we continue everything until the places run out.
Algorithm in pseudocode:
current_grouping = 0
for each row:
for each number:
if (is_there_a_space_after_last_seat or is_a_new_row) and is_not_the_first_seat:
current_grouping + = 2
else
current_grouping + = 1
seat.grouping = current_grouping
In real life, we want the configuration on the left to give the values on the right:
x → ??? ???
y ╭───┬───┬───╮ ╭───┬───┬───╮
↓ 0 │ x │ x │ │ │ 1 │ 2 │ │
├───┼───┼───┤ ├───┼───┼───┤
1 │ x │ │ x │ │ 4 │ │ 6 │
├───┼───┼───┤ ├───┼───┼───┤
2 │ x │ │ │ │ 8 │ │ │
╰───┴───┴───╯ ╰───┴───┴───╯
Preparation
Let the base table have the following minimalist structure:
CREATE TABLE seats (
Id INT AUTO_INCREMENT PRIMARY KEY,
Venue_id INT,
Y INT,
X INT,
`Row` VARCHAR (16), number3 INT108. 3r8. . UNIQUE venue_id_y_x (venue_id, y, x)
);
We don't really need columns
row
and number
On the other hand, we do not want to use a table whose records are completely contained in the index (just to be closer to real problems).
Based on the diagram above, the coordinates of each location are (y, x):
 (? 0), (? 1)
 (? 0), (? 2)
 (? 0)
Please note that we are using at as the first coordinate, since it makes it easier to keep track of the rows.
You must load a large enough number of records to prevent the optimizer from finding unexpected short paths. Of course, we use recursive CTEs:
INSERT INTO seats (venue_id, y, x, `row`, number)
WITH RECURSIVE venue_ids (id) AS
(
SELECT 0
UNION ALL
SELECT id + 1 FROM venue_ids WHERE id + 1 < 100000
)
SELECT /* + SET_VAR (cte_max_recursion_depth = 1M) * /
v.id,
c.y, c.x,
CHAR (ORD ('A') + FLOOR (RAND () * 3) USING ASCII) `row`,
FLOOR (RAND () * 3) `number`
FROM venue_ids v
JOIN (
VALUES
ROW (? 0),
ROW (? 1),
ROW (? 0),
ROW (? 2),
ROW (231108) 3r .) c (y, x)
;
ANALYZE TABLE seats;
A few notes:
 Here, CTE is used in an interesting (hopefully) way: each loop represents a venue_id, but since we want multiple locations to be generated for each venue, we do a cross join with the table containing the locations.
 Used the row constructor from v??? (
VALUES ROW ()
) To represent a (joinable) table without actually creating it.  Generates random values for row and number as placeholders.
 For the sake of simplicity, we did not do any optimizations (for example, data types are wider than necessary; indexes are added before inserting records, etc.)
The old approach is
The good old approach is straightforward and straightforward:
SET @venue_id = 5000;  arbitrary venue id; any (stored) id will do
SET @grouping = 1;
SET @y = 1;
SET @x = 1;
WITH seat_groupings (id, y, x, `grouping`, tmp_y, tmp_x) AS
(
SELECT
Id, y, x,
@Grouping: = @grouping + 1 + (seats.x> @x + 1 OR seats.y! = @Y),
@Y: = seats. y,
@x: = seats.x
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
)
UPDATE
seats s
JOIN seat_groupings sg USING (id)
SET s.grouping = sg.grouping
;
 Query OK, 5 rows affected, 3 warnings (??? sec)
Well that was easy (but don't forget the warnings)!
A small digression: in this case, I use the properties of Boolean arithmetic. The following expressions are equivalent:
SELECT seats.x> @x + 1 OR seats.y! = @Y `increment`;
SELECT IF (
Seats.x> @x + 1 OR seats.y! = @Y,
?
0
) `Increment`;
Some find this intuitive, others not; it's a matter of taste. From now on I will use a more compact expression.
Let's see the result:
SELECT id, y, x, `grouping` FROM seats WHERE venue_id = @venue_id ORDER BY y, x;
 +  +  +  +  +
  id  y  x  grouping 
 +  +  +  +  +
  24887  0  0  1 
  27186  0  1  2 
  29485  1  0  4 
  31784  1  2  6 
  34083  2  0  8 
 +  +  +  +  +
Great approach!
Alas, it has a "minor" drawback: it works great except when it doesn't work
This is because the query optimizer does not necessarily perform calculations from left to right, so assignments (: =) can be performed in the wrong order, leading to the wrong result. People often face this problem after updating MySQL.
In MySQL 8.? this functionality is indeed deprecated:
 Run immediately after UPDATE.

SHOW WARNINGSG
 *************************** 1. row ****************** *********
 Level: Warning
 Code: 1287
 Message: Setting user variables within expressions is deprecated and will be removed in a future release. Consider alternatives: 'SET variable = expression, ', or 'SELECT expression (s) INTO variables (s)'.
[]
Well, let's fix the situation!
Modern Approach # 1: Window Functions
The introduction of window functions has been a highly anticipated event in the MySQL world.
Generally speaking, the "sliding" nature of window functions works well with cumulative functions. However, some complex cumulative functions require the results of the last expression — functionality that windowing functions do not support because they operate on columns.
This does not mean that the problem cannot be solved: it just needs to be rethought.
In our case, the task can be divided into two parts. The grouping for each location can be considered as the sum of two values:
 the ordinal number of each place,
 the cumulative value of the increments of all locations preceding this one.
Those familiar with windowing functions will recognize the typical patterns here.
The sequence number of each seat is a builtin function:
ROW_NUMBER () OVER
But with the cumulative value, everything is much more interesting To calculate it, we perform two actions:
 counting the increment for each location and entryWe put it in the table (or CTE),
 then, for each location, we use the window function to sum the increments for that location.
Let's take a look at SQL:
WITH
increments (id, increment) AS
(
SELECT
Id,
X> LAG (x, ? x  1) OVER tzw + 1 OR y! = LAG (y, ? y) OVER tzw
FROM seats
WHERE_ venue_id = @ venue_id
WINDOW tzw AS (ORDER BY y, x)
)
SELECT
s.id, y, x,
ROW_NUMBER () OVER tzw + SUM (increment) OVER tzw `grouping`
FROM seats s
JOIN increments i USING (id)
WINDOW tzw AS (ORDER BY y, x)
;
 +  +  +  +  +
  id  y  x  grouping 
 +  +  +  +  +
  24887  0  0  1 
  27186  0  1  2 
  29485  1  0  4 
  31784  1  2  6 
  34083  2  1  8 
 +  +  +  +  +
Great!
(Note that I am omitting the UPDATE from now on for the sake of simplicity.)
Let's analyze the request.
Highlevel logic
Next CTE is (edited) :
SELECT
id,
x> LAG (x, ? x  1) OVER tzw + 1 OR y! = LAG (y, ? y) OVER tzw `increment`
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x)
;
 +  +  +
  id  increment 
 +  +  +
  24887  0 
  27186  0 
  29485  1 
  31784  1 
  34083  1 
 +  +  +
… Calculates the increments for each location compared to the previous one (more about
LAG ()
 later). It works on every record and the one that precedes it and is not cumulative.Now, to calculate the cumulative increments, we'll just use a window function to calculate the sum up to and including each location:
 (CTE here )
SELECT
s.id, y, x,
ROW_NUMBER () OVER tzw `pos.`,
SUM (increment) OVER tzw `cum.incr.`
FROM seats s
JOIN increments i USING (id)
WINDOW tzw AS (ORDER BY y, x);
 +  +  +  +  +  +
  id  y  x  pos.  cum.incr.  (grouping)
 +  +  +  +  +  +
  24887  0  0  1  0  = 1 + 0 (curr.)
  27186  0  1  2  0  = 2 + 0 (# 24887) + 0 (curr.)
  29485  1  0  3  1  = 3 + 0 (# 24887) + 0 (# 27186) + 1 (curr.)
  31784  1  2  4  2  = 4 + 0 (# 24887) + 0 (# 27186) + 1 (# 29485) + 1 (curr.)
  34083  2  1  5  3  = 5 + 0 (# 24887) + 0 (# 27186) + 1 (# 29485) + 1 (# 31784) ↵
 +  +  +  +  +  + + 1 (curr.)
Window function LAG ()
The LAG function in its simplest form (
LAG (x)
) Returns the previous value for a specified column. The classic inconvenience with such functions is handling the first entry (s) in the window. Since there is no previous record, they return NULL. In the case of LAG, you can specify the desired value as the third parameter: LAG (x, ? x  1)  default is `x 1`
LAG (y, ? y)  default is `y`
By specifying the default values, we ensure that the very first place within the window's bounds will have the same logic as for the place following the other (x1) and without changing the row (y).
An alternative solution is to use
IFNULL
, however, the expressions in this case are quite cumbersome:  Both are correct, but terrible!

IFNULL (x> LAG (x) OVER tzw + 1 OR y! = LAG (y) OVER tzw, 0)
IFNULL (x> LAG (x) OVER tzw + ? FALSE) OR IFNULL (y! = LAG (y) OVER tzw, FALSE)
Second parameter in
LAG ()
Is the number of positions to move backward within the window; 1 is the previous value (it is also the default).Technical aspects
Named Windows
Our query uses the same window many times. The following two queries are formally equivalent:
SELECT
id,
x> LAG (x, ? x  1) OVER tzw + 1
OR y! = LAG (y, ? y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x);
SELECT
id,
x> LAG (x, ? x  1) OVER (ORDER BY y, x) + 1
OR y! = LAG (y, ? y) OVER (ORDER BY y, x)
FROM seats
WHERE venue_id = @venue_id;
However, the second can lead to suboptimal behavior (which I have encountered  at least in the past): the optimizer can consider the windows to be independent and calculate each separately. For this reason, I advise you to always use named windows (at least when they are repeated).
PARTITION BY statement
Usually windowing functions are performed on a partition. In our case, it will look like this:
SELECT
id,
x> LAG (x, ? x  1) OVER tzw + 1
OR y! = LAG (y, ? y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (PARTITION BY venue_id ORDER BY y, x);  here!
Since the window matches the full set of records (which is filtered by the condition
WHERE
), We do not need to specify it (the partition).But if this query had to be run on the entire table
seats
, then you would have to make sure that the window is reset for every venue_id
Sorting
In the request
ORDER BY
set at window level: SELECT
id,
x> LAG (x, ? x  1) OVER tzw + 1
OR y! = LAG (y, ? y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x)
In this case, window sorting is separate from SELECT. It is very important! The behavior of this request is
SELECT
id,
x> LAG (x, ? x  1) OVER tzw + 1
OR y! = LAG (y, ? y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS ()
ORDER BY y, x
… undefined. Let's turn to See manual :
The query result strings are determined from the FROM clause after the WHERE, GROUP BY, and HAVING clauses have been executed, and the execution within the window occurs before ORDER BY, LIMIT, and SELECT DISTINCT.
Some considerations
In general terms, for this type of problem, it makes sense to calculate the state change for each record and then sum them  instead of representing each record as a function of the previous one.
This solution is more complex than the functionality it replaces, but at the same time it is reliable. Alas, this approach is not always possible or easy to implement. This is where recursive CTEs come into play.
Modern approach # 2: recursive CTEs
This approach requires a bit of trickery due to the limited capabilities of the CTE in MySQL. On the other hand, it is a onesizefitsall, direct solution, so it does not require any rethinking of a global approach.
Let's start with a simplified version of the final request:
 `p_` stands for` Previous` and makes it easier to understand the conditions

WITH RECURSIVE groupings (p_id, p_venue_id, p_y, p_x, p_grouping) AS
(
(
SELECT id, venue_id, y, x, 1
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
LIMIT 1 08.) r3r3 . SELECT
s.id, s.venue_id, sy, sx,
p_grouping + 1 + (sx> p_x + 1 OR sy! = P_y)
FROM groupings, seats s
WHERE s.venue_id = p_venue_id AND (sy, sx)> (p_y, p_x)
ORDER BY s.venue_id, sy, sx
LIMIT 1
)
SELECT * FROM groupings;
Bingo! This query is (relatively) simple, but more importantly, it expresses the cumulative grouping function in the simplest way possible:
p_grouping + 1 + (s.x> p_x + 1 OR s.y! = p_y)
 which is equivalent to the following:
@grouping: = @grouping + 1 + (seats.x> @x + 1 OR seats.y! = @y),
@y: = seats.y,
@x: = seats.x
The logic is clear even for those who are not too familiar with CTE. The first row is the first seat in the hall, in order:
SELECT id, venue_id, y, x, 1
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
LIMIT 1
In the recursive part, we iterate:
SELECT
s.id, s.venue_id, s.y, s.x,
p_grouping + 1 + (s.x> p_x + 1 OR s.y! = p_y)
FROM groupings, seats s
WHERE s.venue_id = p_venue_id AND (s.y, s.x)> (p_y, p_x)
ORDER BY s.venue_id, s.y, s.x
LIMIT 1
Condition
WHERE
together with operators ORDER BY
and LIMIT
just finds the next location, that is, the location with the same venue_id
but with b about the largest coordinates (x, y) in the sequence (venue_id, x, y).Part
s.venue_id
in the sort expression is very important! It allows us to use an index.Operator
SELECT
: performs accumulation (calculates
(p_) grouping
),  passes the values for the current location (
s.id
,s.venue_id
,s.y
,s.x
) to the next loop.
We choose
FROM groupings
to fulfill the requirements for CTE recursiveness.What is interesting here is that we use a recursive CTE as an iterator, forming a selection from table
groupings
in a recursive subquery and concatenating it with seats
to find data for postprocessing. JOIN
is formally cross, but due to the operator. LIMIT
only one record is returned.Working version
Unfortunately, the above query doesn't work as
ORDER BY
not currently supported in recursive subqueries. Also, the semantics are LIMIT
in the form in which it is used here is different from the typical one, which is applies to external request :LIMIT is now supported by[]The impact on the resulting dataset is the same as using LIMIT with external SELECT
However, this is not such a serious problem. Let's take a look at the working version:
WITH RECURSIVE groupings (p_id, p_venue_id, p_y, p_x, p_grouping) AS
(
(
SELECT id, venue_id, y, x, 1
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
LIMIT 1 08.) 3r3r 310r3r3 . SELECT
S.id, s.venue_id, sy, sx,
P_grouping + 1 + (sx> p_x + 1 OR sy! = P_y)
FROM groupings, seats s WHERE s.id = (
SELECT si.id
FROM seats si
WHERE si.venue_id = p_venue_id AND (si.y, si.x)> (p_y, p_x)
ORDER BY si.venue_id, si.y, si.x 3r3r???r3r31108.)
)
SELECT * FROM groupings;
 +  +  +  +  +
  p_id  p_y  p_x  p_grouping 
 +  +  +  +  +
  24887  0  0  1 
  27186  0  1  2 
  29485  1  0  4 
  31784  1  2  6 
  34083  2  0  8 
 +  +  +  +  +
It is a little uncomfortable to use a subquery, but this approach works and the boilerplate is minimal here, since multiple expressions are required anyway.
Here, instead of carrying out the ordering and limiting associated with the union
groupings
and seats
, we do this within a subquery and pass it to the outer query, which then selects only the target record.Reflections on performance
Let's examine the query execution plan using EXPLAIN ANALYZE:
mysql> EXPLAIN ANALYZE WITH RECURSIVE groupings[]
> Table scan on groupings (actual time = ?????? rows = 5 loops = 1)
> Materialize recursive CTE groupings (actual time = ?????? rows = 5 loops = 1)
> Limit: 1 row (s) (actual time = ?????? rows = 1 loops = 1)
> Index lookup on seats using venue_id_y_x (venue_id = (@ venue_id)) (cost = ??? rows = 5) (actual time = ?????? rows = 1 loops = 1)
> Repeat until convergence
> Nested loop inner join (cost = ??? rows = 2) (actual time = ?????? rows = 2 loops = 2)
> Scan new records on groupings (cost = ??? rows = 2) (actual time = ?????? rows = 2 loops = 2)
> Filter: (s.id = (select # 5)) (cost = ??? rows = 1) (actual time = ?????? rows = 1 loops = 5)
> Singlerow index lookup on s using PRIMARY (id = (select # 5)) (cost = ??? rows = 1) (actual time = ?????? rows = 1 loops = 5)
> Select # 5 (subquery in condition; dependent)
> Limit: 1 row (s) (actual time = ?????? rows = 1 loops = 9)
> Filter: ((si.y, si.x)> (groupings.p_y, groupings.p_x)) (cost = ??? rows = 5) (actual time = ?????? rows = 1 loops = 9)
> Index lookup on si using venue_id_y_x (venue_id = groupings.p_venue_id) (cost = ??? rows = 5) (actual time = ?????? rows = 4 loops = 9)
The plan is in line with expectations. In this case, the basis of the optimal plan lies in index searches:
> Nested loop inner join (cost = ??? rows = 2) (actual time = ?????? rows = 2 loops = 2)
> Singlerow index lookup on s using PRIMARY (id = (select # 5)) (cost = ??? rows = 1) (actual time = ?????? rows = 1 loops = 5)
> Index lookup on si using venue_id_y_x (venue_id = groupings.p_venue_id) (cost = ??? rows = 5) (actual time = ?????? rows = 4 loops = 9)
of paramount importance. The performance will drop significantly if you scan the indexes (that is, linearly scan the index records instead of looking for the necessary ones at once).
So for this strategy to work, the linked indices need to be in place and used by the optimizer to the maximum effect
If the restrictions are lifted in the future, the need to use a subquery will disappear, which will greatly simplify the task for the optimizer.
Alternative for nonoptimal plans
In case the optimal plan cannot be determined, use a temporary table:
CREATE TEMPORARY TABLE selected_seats (
Id INT NOT NULL PRIMARY KEY,
Y INT,
X INT,
UNIQUE (y, x)
)
SELECT id, y, x
FROM seats WHERE venue_id = @venue_id;
WITH RECURSIVE
groupings (p_id, p_y, p_x, p_grouping) AS
(
(
SELECT id, y, x, 1
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
LIMIT 1
) <{short}> <{full}>
[b] Approx. transl.
: In this article, the team leader of the UK company Ticketsolve shares a solution to his very specific problem, while demonstrating general approaches to creating socalled accumulating functions using modern MySQL 8.0 features. Its listings are clear and provided with detailed explanations, which helps to understand the essence of the problem, even for those who did not dive into it so deeply.
A common strategy for performing updates using cumulative functions in MySQL is using custom variables and the pattern. UPDATE[]SET mycol = (@myvar: = EXPRESSION (@myvar, mycol))
This pattern does not work well with the optimizer (leading to nondeterministic behavior), so they decided to abandon it. The result is a kind of emptiness because (relatively) complex logic is now more difficult to implement (at least with the same simplicity).
The article will discuss two ways to implement it: using window functions (canonical approach) and using recursive CTEs (general table expressions). previous publication on this topic
The same is true for window functions: I'll comment on queries /concepts in detail, but a general idea still doesn't hurt. A huge number of books and publications are devoted to window functions (which is why I still have not written about them); however, in most examples, calculations are performed either on financial results or on demographic indicators. However, in this article I will use a real case.
For software, I recommend using MySQL ??? (but not required). All expressions must run in the same console in order to reuse @venue_id
In the software world, there is a famous architectural dilemma: should logic be implemented at the application level or at the database level? Although this is a perfectly valid question, in our case, I proceed from the assumption that the logic is should be stay at base level; the reason for this may be, for example, speed requirements (as was the case in our case).
Problem
In this task, we allocate seats in a certain hall (theater).
For business purposes, each location needs to be assigned a socalled "grouping"  an additional number that represents it.
Here is the algorithm for determining the grouping value:
start at 0 and top left;
if there is an empty space between the current and the previous one, or this is a new row, then we add 2 to the previous value (if this is not the absolute first place), otherwise, we increase the value by 1;
assign a grouping to a place;
go to a new place in the same row or to the next row (if the previous one is over) and repeat from point 2; we continue everything until the places run out.
Algorithm in pseudocode:
current_grouping = 0
for each row:
for each number:
if (is_there_a_space_after_last_seat or is_a_new_row) and is_not_the_first_seat:
current_grouping + = 2
else
current_grouping + = 1
seat.grouping = current_grouping
In real life, we want the configuration on the left to give the values on the right:
x → ??? ???
y ╭───┬───┬───╮ ╭───┬───┬───╮
↓ 0 │ x │ x │ │ │ 1 │ 2 │ │
├───┼───┼───┤ ├───┼───┼───┤
1 │ x │ │ x │ │ 4 │ │ 6 │
├───┼───┼───┤ ├───┼───┼───┤
2 │ x │ │ │ │ 8 │ │ │
╰───┴───┴───╯ ╰───┴───┴───╯
Preparation
Let the base table have the following minimalist structure:
CREATE TABLE seats (
Id INT AUTO_INCREMENT PRIMARY KEY,
Venue_id INT,
Y INT,
X INT,
`Row` VARCHAR (16), number3 INT108. 3r8. . UNIQUE venue_id_y_x (venue_id, y, x)
);
We don't really need columns row
and number
On the other hand, we do not want to use a table whose records are completely contained in the index (just to be closer to real problems).
Based on the diagram above, the coordinates of each location are (y, x):
(? 0), (? 1)
(? 0), (? 2)
(? 0)
Please note that we are using at as the first coordinate, since it makes it easier to keep track of the rows.
You must load a large enough number of records to prevent the optimizer from finding unexpected short paths. Of course, we use recursive CTEs:
INSERT INTO seats (venue_id, y, x, `row`, number)
WITH RECURSIVE venue_ids (id) AS
(
SELECT 0
UNION ALL
SELECT id + 1 FROM venue_ids WHERE id + 1 < 100000
)
SELECT /* + SET_VAR (cte_max_recursion_depth = 1M) * /
v.id,
c.y, c.x,
CHAR (ORD ('A') + FLOOR (RAND () * 3) USING ASCII) `row`,
FLOOR (RAND () * 3) `number`
FROM venue_ids v
JOIN (
VALUES
ROW (? 0),
ROW (? 1),
ROW (? 0),
ROW (? 2),
ROW (231108) 3r .) c (y, x)
;
ANALYZE TABLE seats;
A few notes:
Here, CTE is used in an interesting (hopefully) way: each loop represents a venue_id, but since we want multiple locations to be generated for each venue, we do a cross join with the table containing the locations.
Used the row constructor from v??? ( VALUES ROW ()
) To represent a (joinable) table without actually creating it.
Generates random values for row and number as placeholders.
For the sake of simplicity, we did not do any optimizations (for example, data types are wider than necessary; indexes are added before inserting records, etc.)
The old approach is
The good old approach is straightforward and straightforward:
SET @venue_id = 5000;  arbitrary venue id; any (stored) id will do
SET @grouping = 1;
SET @y = 1;
SET @x = 1;
WITH seat_groupings (id, y, x, `grouping`, tmp_y, tmp_x) AS
(
SELECT
Id, y, x,
@Grouping: = @grouping + 1 + (seats.x> @x + 1 OR seats.y! = @Y),
@Y: = seats. y,
@x: = seats.x
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
)
UPDATE
seats s
JOIN seat_groupings sg USING (id)
SET s.grouping = sg.grouping
;
 Query OK, 5 rows affected, 3 warnings (??? sec)
Well that was easy (but don't forget the warnings)!
A small digression: in this case, I use the properties of Boolean arithmetic. The following expressions are equivalent:
SELECT seats.x> @x + 1 OR seats.y! = @Y `increment`;
SELECT IF (
Seats.x> @x + 1 OR seats.y! = @Y,
?
0
) `Increment`;
Some find this intuitive, others not; it's a matter of taste. From now on I will use a more compact expression.
Let's see the result:
SELECT id, y, x, `grouping` FROM seats WHERE venue_id = @venue_id ORDER BY y, x;
 +  +  +  +  +
  id  y  x  grouping 
 +  +  +  +  +
  24887  0  0  1 
  27186  0  1  2 
  29485  1  0  4 
  31784  1  2  6 
  34083  2  0  8 
 +  +  +  +  +
Great approach!
Alas, it has a "minor" drawback: it works great except when it doesn't work
This is because the query optimizer does not necessarily perform calculations from left to right, so assignments (: =) can be performed in the wrong order, leading to the wrong result. People often face this problem after updating MySQL.
In MySQL 8.? this functionality is indeed deprecated:
 Run immediately after UPDATE.

SHOW WARNINGSG
 *************************** 1. row ****************** *********
 Level: Warning
 Code: 1287
 Message: Setting user variables within expressions is deprecated and will be removed in a future release. Consider alternatives: 'SET variable = expression, ', or 'SELECT expression (s) INTO variables (s)'.
[]
Well, let's fix the situation!
Modern Approach # 1: Window Functions
The introduction of window functions has been a highly anticipated event in the MySQL world.
Generally speaking, the "sliding" nature of window functions works well with cumulative functions. However, some complex cumulative functions require the results of the last expression — functionality that windowing functions do not support because they operate on columns.
This does not mean that the problem cannot be solved: it just needs to be rethought.
In our case, the task can be divided into two parts. The grouping for each location can be considered as the sum of two values:
the ordinal number of each place,
the cumulative value of the increments of all locations preceding this one.
Those familiar with windowing functions will recognize the typical patterns here.
The sequence number of each seat is a builtin function:
ROW_NUMBER () OVER
But with the cumulative value, everything is much more interesting To calculate it, we perform two actions:
counting the increment for each location and entryWe put it in the table (or CTE),
then, for each location, we use the window function to sum the increments for that location.
Let's take a look at SQL:
WITH
increments (id, increment) AS
(
SELECT
Id,
X> LAG (x, ? x  1) OVER tzw + 1 OR y! = LAG (y, ? y) OVER tzw
FROM seats
WHERE_ venue_id = @ venue_id
WINDOW tzw AS (ORDER BY y, x)
)
SELECT
s.id, y, x,
ROW_NUMBER () OVER tzw + SUM (increment) OVER tzw `grouping`
FROM seats s
JOIN increments i USING (id)
WINDOW tzw AS (ORDER BY y, x)
;
 +  +  +  +  +
  id  y  x  grouping 
 +  +  +  +  +
  24887  0  0  1 
  27186  0  1  2 
  29485  1  0  4 
  31784  1  2  6 
  34083  2  1  8 
 +  +  +  +  +
Great!
(Note that I am omitting the UPDATE from now on for the sake of simplicity.)
Let's analyze the request.
Highlevel logic
Next CTE is (edited) :
SELECT
id,
x> LAG (x, ? x  1) OVER tzw + 1 OR y! = LAG (y, ? y) OVER tzw `increment`
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x)
;
 +  +  +
  id  increment 
 +  +  +
  24887  0 
  27186  0 
  29485  1 
  31784  1 
  34083  1 
 +  +  +
… Calculates the increments for each location compared to the previous one (more about LAG ()
 later). It works on every record and the one that precedes it and is not cumulative.
Now, to calculate the cumulative increments, we'll just use a window function to calculate the sum up to and including each location:
 (CTE here )
SELECT
s.id, y, x,
ROW_NUMBER () OVER tzw `pos.`,
SUM (increment) OVER tzw `cum.incr.`
FROM seats s
JOIN increments i USING (id)
WINDOW tzw AS (ORDER BY y, x);
 +  +  +  +  +  +
  id  y  x  pos.  cum.incr.  (grouping)
 +  +  +  +  +  +
  24887  0  0  1  0  = 1 + 0 (curr.)
  27186  0  1  2  0  = 2 + 0 (# 24887) + 0 (curr.)
  29485  1  0  3  1  = 3 + 0 (# 24887) + 0 (# 27186) + 1 (curr.)
  31784  1  2  4  2  = 4 + 0 (# 24887) + 0 (# 27186) + 1 (# 29485) + 1 (curr.)
  34083  2  1  5  3  = 5 + 0 (# 24887) + 0 (# 27186) + 1 (# 29485) + 1 (# 31784) ↵
 +  +  +  +  +  + + 1 (curr.)
Window function LAG ()
The LAG function in its simplest form ( LAG (x)
) Returns the previous value for a specified column. The classic inconvenience with such functions is handling the first entry (s) in the window. Since there is no previous record, they return NULL. In the case of LAG, you can specify the desired value as the third parameter:
LAG (x, ? x  1)  default is `x 1`
LAG (y, ? y)  default is `y`
By specifying the default values, we ensure that the very first place within the window's bounds will have the same logic as for the place following the other (x1) and without changing the row (y).
An alternative solution is to use IFNULL
, however, the expressions in this case are quite cumbersome:
 Both are correct, but terrible!

IFNULL (x> LAG (x) OVER tzw + 1 OR y! = LAG (y) OVER tzw, 0)
IFNULL (x> LAG (x) OVER tzw + ? FALSE) OR IFNULL (y! = LAG (y) OVER tzw, FALSE)
Second parameter in LAG ()
Is the number of positions to move backward within the window; 1 is the previous value (it is also the default).
Technical aspects
Named Windows
Our query uses the same window many times. The following two queries are formally equivalent:
SELECT
id,
x> LAG (x, ? x  1) OVER tzw + 1
OR y! = LAG (y, ? y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x);
SELECT
id,
x> LAG (x, ? x  1) OVER (ORDER BY y, x) + 1
OR y! = LAG (y, ? y) OVER (ORDER BY y, x)
FROM seats
WHERE venue_id = @venue_id;
However, the second can lead to suboptimal behavior (which I have encountered  at least in the past): the optimizer can consider the windows to be independent and calculate each separately. For this reason, I advise you to always use named windows (at least when they are repeated).
PARTITION BY statement
Usually windowing functions are performed on a partition. In our case, it will look like this:
SELECT
id,
x> LAG (x, ? x  1) OVER tzw + 1
OR y! = LAG (y, ? y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (PARTITION BY venue_id ORDER BY y, x);  here!
Since the window matches the full set of records (which is filtered by the condition WHERE
), We do not need to specify it (the partition).
But if this query had to be run on the entire table seats
, then you would have to make sure that the window is reset for every venue_id
Sorting
In the request ORDER BY
set at window level:
SELECT
id,
x> LAG (x, ? x  1) OVER tzw + 1
OR y! = LAG (y, ? y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x)
In this case, window sorting is separate from SELECT. It is very important! The behavior of this request is
SELECT
id,
x> LAG (x, ? x  1) OVER tzw + 1
OR y! = LAG (y, ? y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS ()
ORDER BY y, x
… undefined. Let's turn to See manual :
The query result strings are determined from the FROM clause after the WHERE, GROUP BY, and HAVING clauses have been executed, and the execution within the window occurs before ORDER BY, LIMIT, and SELECT DISTINCT.
Some considerations
In general terms, for this type of problem, it makes sense to calculate the state change for each record and then sum them  instead of representing each record as a function of the previous one.
This solution is more complex than the functionality it replaces, but at the same time it is reliable. Alas, this approach is not always possible or easy to implement. This is where recursive CTEs come into play.
Modern approach # 2: recursive CTEs
This approach requires a bit of trickery due to the limited capabilities of the CTE in MySQL. On the other hand, it is a onesizefitsall, direct solution, so it does not require any rethinking of a global approach.
Let's start with a simplified version of the final request:
 `p_` stands for` Previous` and makes it easier to understand the conditions

WITH RECURSIVE groupings (p_id, p_venue_id, p_y, p_x, p_grouping) AS
(
(
SELECT id, venue_id, y, x, 1
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
LIMIT 1 08.) [b] r3r3 . SELECT
s.id, s.venue_id, sy, sx,
p_grouping + 1 + (sx> p_x + 1 OR sy! = P_y)
FROM groupings, seats s
WHERE s.venue_id = p_venue_id AND (sy, sx)> (p_y, p_x)
ORDER BY s.venue_id, sy, sx
LIMIT 1
)
SELECT * FROM groupings;
Bingo! This query is (relatively) simple, but more importantly, it expresses the cumulative grouping function in the simplest way possible:
p_grouping + 1 + (s.x> p_x + 1 OR s.y! = p_y)
 which is equivalent to the following:
@grouping: = @grouping + 1 + (seats.x> @x + 1 OR seats.y! = @y),
@y: = seats.y,
@x: = seats.x
The logic is clear even for those who are not too familiar with CTE. The first row is the first seat in the hall, in order:
SELECT id, venue_id, y, x, 1
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
LIMIT 1
In the recursive part, we iterate:
SELECT
s.id, s.venue_id, s.y, s.x,
p_grouping + 1 + (s.x> p_x + 1 OR s.y! = p_y)
FROM groupings, seats s
WHERE s.venue_id = p_venue_id AND (s.y, s.x)> (p_y, p_x)
ORDER BY s.venue_id, s.y, s.x
LIMIT 1
Condition WHERE
together with operators ORDER BY
and LIMIT
just finds the next location, that is, the location with the same venue_id
but with b about the largest coordinates (x, y) in the sequence (venue_id, x, y).
Part s.venue_id
in the sort expression is very important! It allows us to use an index.
Operator SELECT
:
performs accumulation (calculates (p_) grouping
),
passes the values for the current location ( s.id
, s.venue_id
, s.y
, s.x
) to the next loop.
We choose FROM groupings
to fulfill the requirements for CTE recursiveness.
What is interesting here is that we use a recursive CTE as an iterator, forming a selection from table groupings
in a recursive subquery and concatenating it with seats
to find data for postprocessing.
JOIN
is formally cross, but due to the operator. LIMIT
only one record is returned.
Working version
Unfortunately, the above query doesn't work as ORDER BY
not currently supported in recursive subqueries. Also, the semantics are LIMIT
in the form in which it is used here is different from the typical one, which is applies to external request :
LIMIT is now supported by[]The impact on the resulting dataset is the same as using LIMIT with external SELECT
However, this is not such a serious problem. Let's take a look at the working version:
WITH RECURSIVE groupings (p_id, p_venue_id, p_y, p_x, p_grouping) AS
(
(
SELECT id, venue_id, y, x, 1
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
LIMIT 1 08.) 3r3r 310r3r3 . SELECT
S.id, s.venue_id, sy, sx,
P_grouping + 1 + (sx> p_x + 1 OR sy! = P_y)
FROM groupings, seats s WHERE s.id = (
SELECT si.id
FROM seats si
WHERE si.venue_id = p_venue_id AND (si.y, si.x)> (p_y, p_x)
ORDER BY si.venue_id, si.y, si.x 3r3r???r3r31108.)
)
SELECT * FROM groupings;
 +  +  +  +  +
  p_id  p_y  p_x  p_grouping 
 +  +  +  +  +
  24887  0  0  1 
  27186  0  1  2 
  29485  1  0  4 
  31784  1  2  6 
  34083  2  0  8 
 +  +  +  +  +
It is a little uncomfortable to use a subquery, but this approach works and the boilerplate is minimal here, since multiple expressions are required anyway.
Here, instead of carrying out the ordering and limiting associated with the union groupings
and seats
, we do this within a subquery and pass it to the outer query, which then selects only the target record.
Reflections on performance
Let's examine the query execution plan using EXPLAIN ANALYZE:
mysql> EXPLAIN ANALYZE WITH RECURSIVE groupings[]
> Table scan on groupings (actual time = ?????? rows = 5 loops = 1)
> Materialize recursive CTE groupings (actual time = ?????? rows = 5 loops = 1)
> Limit: 1 row (s) (actual time = ?????? rows = 1 loops = 1)
> Index lookup on seats using venue_id_y_x (venue_id = (@ venue_id)) (cost = ??? rows = 5) (actual time = ?????? rows = 1 loops = 1)
> Repeat until convergence
> Nested loop inner join (cost = ??? rows = 2) (actual time = ?????? rows = 2 loops = 2)
> Scan new records on groupings (cost = ??? rows = 2) (actual time = ?????? rows = 2 loops = 2)
> Filter: (s.id = (select # 5)) (cost = ??? rows = 1) (actual time = ?????? rows = 1 loops = 5)
> Singlerow index lookup on s using PRIMARY (id = (select # 5)) (cost = ??? rows = 1) (actual time = ?????? rows = 1 loops = 5)
> Select # 5 (subquery in condition; dependent)
> Limit: 1 row (s) (actual time = ?????? rows = 1 loops = 9)
> Filter: ((si.y, si.x)> (groupings.p_y, groupings.p_x)) (cost = ??? rows = 5) (actual time = ?????? rows = 1 loops = 9)
> Index lookup on si using venue_id_y_x (venue_id = groupings.p_venue_id) (cost = ??? rows = 5) (actual time = ?????? rows = 4 loops = 9)
The plan is in line with expectations. In this case, the basis of the optimal plan lies in index searches:
> Nested loop inner join (cost = ??? rows = 2) (actual time = ?????? rows = 2 loops = 2)
> Singlerow index lookup on s using PRIMARY (id = (select # 5)) (cost = ??? rows = 1) (actual time = ?????? rows = 1 loops = 5)
> Index lookup on si using venue_id_y_x (venue_id = groupings.p_venue_id) (cost = ??? rows = 5) (actual time = ?????? rows = 4 loops = 9)
of paramount importance. The performance will drop significantly if you scan the indexes (that is, linearly scan the index records instead of looking for the necessary ones at once).
So for this strategy to work, the linked indices need to be in place and used by the optimizer to the maximum effect
If the restrictions are lifted in the future, the need to use a subquery will disappear, which will greatly simplify the task for the optimizer.
Alternative for nonoptimal plans
In case the optimal plan cannot be determined, use a temporary table:
CREATE TEMPORARY TABLE selected_seats (
Id INT NOT NULL PRIMARY KEY,
Y INT,
X INT,
UNIQUE (y, x)
)
SELECT id, y, x
FROM seats WHERE venue_id = @venue_id;
WITH RECURSIVE
groupings (p_id, p_y, p_x, p_grouping) AS
(
(
SELECT id, y, x, 1
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
LIMIT 1
) <{short}>
S.id, sy, sx,
P_grouping + 1 + (sx> p_x + 1 OR sy! = P_y)
FROM groupings, seats s WHERE s.id = (
SELECT ss.id
FROM selected_seats ss
WHERE (ss.y, ss.x)> (p_y, p_x)
ORDER BY ss.y, ss.x
LIMIT 1
)
)
SELECT * FROM groupings;
Even if index scans pass in this query, they cost a lot, since table selected_seats
very small.
Conclusion
I am very pleased that the efficient but flawed workflow can now be replaced with the fairly simple functionality introduced in MySQL 8.0.
In the meantime, the development of new features for 8.0 continues, which makes the already successful release even better.
Successful recursion!
P.S. from the translator
Read also on our blog:
" Upgrading MySQL (Percona Server) from 5.7 to ???r3r31102. ";
"
Databases and Kubernetes (review and video of the report) ";
" More developers should know this about databases. ".
It may be interesting
kylanao2ts
Author24072020, 16:24
Publication DateMySQL / SQL / Assembler / 1CBitrix / C# / XML / IT Standards / Yii / CodeIgniter / High performance / Website development / PostgreSQL / Increase Conversion Rate / Payment systems / System Programming / Silverlight
Category Comments: 0
 Views: 36