PostgreSQL并行查詢(xún)體驗(yàn)

IT那活兒發(fā)布于2023-01-11 13:20 / 1496人閱讀

點(diǎn)擊上方藍(lán)字關(guān)注我們

背景介紹

數(shù)據(jù)庫(kù)查詢(xún)性能是評(píng)價(jià)數(shù)據(jù)庫(kù)的核心指標(biāo)，隨著現(xiàn)代服務(wù)器的資源配置越來(lái)越強(qiáng)大，為充分發(fā)揮硬件性能提升數(shù)據(jù)庫(kù)查詢(xún)性能，可以通過(guò)把操作分解成多個(gè)并行執(zhí)行來(lái)實(shí)現(xiàn)。

什么是并行查詢(xún)？

并行查詢(xún)可以利用多個(gè)CPU同時(shí)參與查詢(xún)處理工作，充分發(fā)揮硬件性能，以提高運(yùn)行速度。

如何使用并行查詢(xún)？

在測(cè)試的版本（Version 12.1）中，數(shù)據(jù)庫(kù)默認(rèn)啟用了2路并行。

此功能主要涉及以下幾個(gè)參數(shù)和默認(rèn)值：

max_worker_processes：數(shù)據(jù)庫(kù)支持的最大后臺(tái)進(jìn)程數(shù)，默認(rèn)值為8。

max_parallel_workers：數(shù)據(jù)庫(kù)支持的最大并行查詢(xún)數(shù)，實(shí)際分配的數(shù)量受上面的max_worker_processes參數(shù)限制，默認(rèn)值為8。

max_parallel_workers_per_gather：目前可以啟動(dòng)的并行數(shù)，實(shí)際分配的數(shù)量受上面的max_parallel_workers參數(shù)限制，max_parallel_workers_per_gather默認(rèn)值為2。

worker進(jìn)程數(shù)量越多，查詢(xún)性能越高嗎？

不是，目前在筆者的測(cè)試場(chǎng)景中發(fā)現(xiàn)，worker進(jìn)程可以設(shè)置的很高，但是數(shù)據(jù)庫(kù)會(huì)根據(jù)實(shí)際需要分配worker，比如設(shè)置了max_parallel_workers_per_gather=16，但實(shí)際上執(zhí)行計(jì)劃中可能分配不到這么多的worker。性能不會(huì)隨著并行度的提升而線(xiàn)性增加。建議設(shè)置的數(shù)值不超過(guò)服務(wù)器CPU數(shù)量。

如果服務(wù)器資源比較緊張，建議在個(gè)別重要查詢(xún)的會(huì)話(huà)級(jí)別（如夜間批量同步任務(wù)）中多帶帶提高并發(fā)數(shù)。

怎么看到并行的任務(wù)？

在Linux中可以通過(guò)ps和top命令查看到相應(yīng)的進(jìn)程和相關(guān)連的worker進(jìn)程：

系統(tǒng)中有以下進(jìn)程，第1個(gè)是leader進(jìn)程，2個(gè)worker進(jìn)程：

postgres 3279 2158 2 01:59 ? 00:04:51 postgres: postgres music [local] SELECT

postgres 12501 2158 93 04:47 ? 00:00:02 postgres: parallel worker for PID 3279

postgres 12502 2158 93 04:47 ? 00:00:02 postgres: parallel worker for PID 3279

top中也可以抓的到進(jìn)程的性能數(shù)據(jù)：

top - 04:51:16 up 6:24, 2 users, load average: 0.10, 0.06, 0.10

Tasks: 147 total, 4 running, 143 sleeping, 0 stopped, 0 zombie

%Cpu(s): 46.0 us, 6.5 sy, 0.0 ni, 47.4 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st

KiB Mem : 14.0/3861364 [|||||||||||||| ]

KiB Swap: 20.1/2097148 [|||||||||||||||||||| ]

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

3279 postgres 20 0 404368 144000 140976 R 71.4 3.7 5:00.69 postgres: postgres music [local] SELECT

12693 postgres 20 0 400824 6056 4608 R 71.4 0.2 0:02.15 postgres: parallel worker for PID 3279

12694 postgres 20 0 400824 6056 4608 R 70.8 0.2 0:02.13 postgres: parallel worker for PID 3279

下面用圖來(lái)描述一下相關(guān)進(jìn)程的關(guān)系，在多路并發(fā)的情況下，會(huì)根據(jù)并發(fā)數(shù)創(chuàng)建新的worker進(jìn)程來(lái)分擔(dān)并發(fā)任務(wù)：

能否看到更詳細(xì)的執(zhí)行情況？

能。在執(zhí)行計(jì)劃中可以看得到。

我們看一個(gè)實(shí)際的執(zhí)行計(jì)劃：

首先將并行度提升到6路：

music=# set max_parallel_workers_per_gather = 6;

SET

看一下實(shí)際的執(zhí)行計(jì)劃

music=# EXPLAIN (ANALYZE,VERBOSE) SELECT * FROM rock WHERE name = Bon_Jovi;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------

Gather (cost=1000.00..471830.76 rows=1 width=25) (actual time=0.270..2811.900 rows=1 loops=1)

Output: id, name, create_time

Workers Planned: 6

Workers Launched: 6

-> Parallel Seq Scan on public.rock (cost=0.00..470830.66 rows=1 width=25) (actual time=2374.703..2775.358 rows=0 loops=7)

Output: id, name, create_time

Filter: ((rock.name)::text = Bon_Jovi::text)

Rows Removed by Filter: 7142857

Worker 0: actual time=2756.906..2756.906 rows=0 loops=1

Worker 1: actual time=2765.581..2765.581 rows=0 loops=1

Worker 2: actual time=2769.584..2769.584 rows=0 loops=1

Worker 3: actual time=2778.855..2778.855 rows=0 loops=1

Worker 4: actual time=2752.952..2752.952 rows=0 loops=1

Worker 5: actual time=2799.014..2799.014 rows=0 loops=1

Planning Time: 0.048 ms

Execution Time: 2811.918 ms

從下往上看，最下層的節(jié)點(diǎn)，顯示了6個(gè)Worker的執(zhí)行信息信息，包括啟動(dòng)時(shí)間、總時(shí)間等。他們都屬于并行順序查詢(xún)節(jié)點(diǎn)（Parallel Seq Scan開(kāi)頭那一行）。再往上提示，一共規(guī)劃了6個(gè)節(jié)點(diǎn)，實(shí)際分配了6個(gè)節(jié)點(diǎn)。第一行是Gather節(jié)點(diǎn)的成本和時(shí)間等信息，Gather節(jié)點(diǎn)匯總了各個(gè)節(jié)點(diǎn)讀取的數(shù)據(jù)。

下面在開(kāi)頭的圖里進(jìn)一步詳細(xì)描述一下該語(yǔ)句的執(zhí)行詳情：

設(shè)置了并行就會(huì)啟動(dòng)并行查詢(xún)？

有時(shí)候會(huì)出現(xiàn)設(shè)置了并發(fā)但是不會(huì)真正啟動(dòng)并發(fā)進(jìn)程的情況：

music=# EXPLAIN SELECT * FROM rock WHERE id = 1;

QUERY PLAN

---------------------------------------------------------------------------------------

Index Scan using idx_rock_id on rock (cost=0.56..8.58 rows=1 width=25)

Index Cond: (id = 1)

(2 rows)

music=# EXPLAIN (ANALYZE,VERBOSE) SELECT * FROM rock WHERE id = 1;

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------

Index Scan using idx_rock_id on public.rock (cost=0.56..8.58 rows=1 width=25) (actual time=0.087..0.088 rows=1 loops=1)

Output: id, name, create_time

Index Cond: (rock.id = 1)

Planning Time: 0.079 ms

Execution Time: 0.111 ms

(5 rows)

采用了索引掃描，由于cost的值比較低，盡管有設(shè)置并發(fā)，但是cost未觸發(fā)到并發(fā)所需的條件，所以說(shuō)不會(huì)啟用并發(fā)。也有可能因?yàn)樵u(píng)估不需要那么多的并發(fā)而啟動(dòng)一部分?jǐn)?shù)量的worker（后文有演示）

觸發(fā)并發(fā)的條件有以下兩個(gè)，當(dāng)以查詢(xún)的代價(jià)觸發(fā)到以下兩個(gè)參數(shù)的值時(shí)，才會(huì)觸發(fā)并行：

parallel_setup_cost = 1000

parallel_tuple_cost = 0.1

在本次查詢(xún)中，索引掃描的cost值總體才到8.58，因此未觸發(fā)并行機(jī)制。

同樣，在bitmap等查詢(xún)場(chǎng)景中，成本達(dá)不到cost值觸發(fā)條件的話(huà)，均不會(huì)啟動(dòng)并發(fā)查詢(xún)，如：

在筆者使用體驗(yàn)過(guò)程中，發(fā)現(xiàn)也有并發(fā)度按需分配的情況：

music=# set max_parallel_workers_per_gather = 6;

SET

music=# EXPLAIN ANALYSE SELECT * FROM rock WHERE id = 1 OR id = 2;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------

Bitmap Heap Scan on rock (cost=9.15..17.16 rows=2 width=25) (actual time=0.019..0.020 rows=2 loops=1)

Recheck Cond: ((id = 1) OR (id = 2))

Heap Blocks: exact=1

-> BitmapOr (cost=9.15..9.15 rows=2 width=0) (actual time=0.015..0.015 rows=0 loops=1)

-> Bitmap Index Scan on idx_rock_id (cost=0.00..4.57 rows=1 width=0) (actual time=0.013..0.013 rows=1 loops=1)

Index Cond: (id = 1)

-> Bitmap Index Scan on idx_rock_id (cost=0.00..4.57 rows=1 width=0) (actual time=0.002..0.002 rows=1 loops=1)

Index Cond: (id = 2)

Planning Time: 0.059 ms

Execution Time: 0.040 ms

(10 rows)

索引并行掃描的場(chǎng)景觸發(fā)并行的情況可以看一下以下案例：

music=# EXPLAIN (ANALYZE,VERBOSE) SELECT count(*) FROM rock WHERE id < 10000000;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------

Finalize Aggregate (cost=279145.34..279145.35 rows=1 width=8) (actual time=3482.259..3482.259 rows=1 loops=1)

Output: count(*)

-> Gather (cost=279144.71..279145.32 rows=6 width=8) (actual time=3453.374..3487.724 rows=7 loops=1)

Output: (PARTIAL count(*))

Workers Planned: 6

Workers Launched: 6

-> Partial Aggregate (cost=278144.71..278144.72 rows=1 width=8) (actual time=3436.943..3436.943 rows=1 loops=7)

Output: PARTIAL count(*)

Worker 0: actual time=3427.536..3427.536 rows=1 loops=1

Worker 1: actual time=3440.951..3440.951 rows=1 loops=1

Worker 2: actual time=3433.833..3433.833 rows=1 loops=1

Worker 3: actual time=3438.579..3438.579 rows=1 loops=1

Worker 4: actual time=3438.378..3438.379 rows=1 loops=1

Worker 5: actual time=3426.513..3426.513 rows=1 loops=1

-> Parallel Index Only Scan using idx_rock_id on public.rock (cost=0.56..273988.62 rows=1662437 width=0) (

actual time=2.701..3281.001 rows=1428571 loops=7)

Output: id

Index Cond: (rock.id < 10000000)

Heap Fetches: 9999999

Worker 0: actual time=0.605..3271.022 rows=1466562 loops=1

Worker 1: actual time=0.170..3286.350 rows=1449726 loops=1

Worker 2: actual time=0.424..3291.647 rows=1301496 loops=1

Worker 3: actual time=3.191..3258.715 rows=1658712 loops=1

Worker 4: actual time=12.856..3273.552 rows=1509750 loops=1

Worker 5: actual time=1.608..3284.193 rows=1294323 loops=1

Planning Time: 2.203 ms

Execution Time: 3488.023 ms

(26 rows)

從下往上看分別進(jìn)行了：并行索引掃描---并行聚合---最終聚合

首先啟動(dòng)了6個(gè)worker進(jìn)程進(jìn)行了索引條件掃描，在數(shù)據(jù)庫(kù)估算開(kāi)銷(xiāo)時(shí)認(rèn)為每一個(gè)worker的掃描成本都高于1000，因此觸發(fā)了并發(fā)。當(dāng)每一個(gè)worker將數(shù)據(jù)掃描后，進(jìn)行分別處理后，最終給gather匯總整合輸出。

執(zhí)行過(guò)程可參考下圖：

設(shè)置了并行參數(shù)后，一定會(huì)啟動(dòng)并行嗎？

然后對(duì)幾類(lèi)常見(jiàn)的查詢(xún)場(chǎng)景進(jìn)行了測(cè)試，除了在上面的順序掃描、索引掃描之外，在聚合、bitmap、嵌套循環(huán)、merge join、hash join和索引創(chuàng)建場(chǎng)景，在數(shù)據(jù)量足夠大的場(chǎng)景下，均可觸發(fā)啟動(dòng)并行。

一定會(huì)啟動(dòng)設(shè)定的并行度嗎？

不一定，其中在嵌套循環(huán)（Nest loop）場(chǎng)景中，發(fā)現(xiàn)盡管我們?cè)O(shè)置了6路并發(fā)，但是實(shí)際根據(jù)需要僅啟動(dòng)了5路，在實(shí)戰(zhàn)中遇到此類(lèi)情況，不要疑惑，也屬正常，勤儉節(jié)約是美德：

music=# show max_parallel_workers_per_gather;

max_parallel_workers_per_gather

---------------------------------

(1 row)

music=# EXPLAIN (ANALYZE,VERBOSE) SELECT pop.name FROM rock,pop WHERE rock.id = pop.id AND pop.id < 10000;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------

-------------------

Gather (cost=1000.57..174222.54 rows=18002 width=13) (actual time=5.881..3276.311 rows=19998 loops=1)

Output: pop.name

Workers Planned: 5

Workers Launched: 5

-> Nested Loop (cost=0.56..171422.34 rows=3600 width=13) (actual time=3.189..3258.998 rows=3333 loops=6)

Output: pop.name

Worker 0: actual time=2.591..3259.895 rows=1850 loops=1

Worker 1: actual time=0.180..3251.631 rows=4081 loops=1

Worker 2: actual time=1.344..3261.433 rows=555 loops=1

Worker 3: actual time=8.603..3262.411 rows=3330 loops=1

Worker 4: actual time=0.821..3259.297 rows=4623 loops=1

-> Parallel Seq Scan on public.pop (cost=0.00..141721.20 rows=3600 width=17) (actual time=1.020..3223.285 rows=3333 loops

=6)

Output: pop.id, pop.name

Filter: (pop.id < 10000)

Rows Removed by Filter: 2663334

Worker 0: actual time=0.054..3237.921 rows=1850 loops=1

Worker 1: actual time=0.049..3215.862 rows=4081 loops=1

Worker 2: actual time=0.102..3236.592 rows=555 loops=1

Worker 3: actual time=0.296..3235.327 rows=3330 loops=1

Worker 4: actual time=0.055..3188.732 rows=4623 loops=1

-> Index Only Scan using idx_rock_id on public.rock (cost=0.56..8.24 rows=1 width=4) (actual time=0.008..0.010 r

ows=1 loops=19998)

Output: rock.id

Index Cond: (rock.id = pop.id)

Heap Fetches: 19998

Worker 0: actual time=0.011..0.011 rows=1 loops=1850

Worker 1: actual time=0.008..0.008 rows=1 loops=4081

Worker 2: actual time=0.044..0.044 rows=1 loops=555

Worker 3: actual time=0.007..0.007 rows=1 loops=3330

Worker 4: actual time=0.006..0.015 rows=1 loops=4623

Planning Time: 0.579 ms

Execution Time: 3277.727 ms

(31 rows)

以上語(yǔ)句在實(shí)際工作中一般需要添加以下索引，提高一下查詢(xún)速度，當(dāng)添加玩索引之后，發(fā)現(xiàn)查詢(xún)的代價(jià)下降的很少，以至于不需要觸發(fā)那么多并發(fā)進(jìn)程了：

music=# CREATE INDEX idx_pop_id ON pop(id);

WARNING: concurrent insert in progress within table "b"

CREATE INDEX

music=# set max_parallel_workers_per_gather = 6;

SET

music=# EXPLAIN (ANALYZE,VERBOSE) SELECT pop.name FROM rock,pop WHERE rock.id = pop.id AND pop.id < 10000;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------

-------------------

Gather (cost=1388.64..108804.88 rows=20599 width=13) (actual time=1.316..35.770 rows=19998 loops=1)

Output: pop.name

Workers Planned: 3

Workers Launched: 3

-> Nested Loop (cost=388.64..105744.98 rows=6645 width=13) (actual time=0.341..27.475 rows=5000 loops=4)

Output: pop.name

Worker 0: actual time=0.095..26.337 rows=5365 loops=1

Worker 1: actual time=0.091..26.641 rows=5753 loops=1

Worker 2: actual time=0.101..28.578 rows=3145 loops=1

-> Parallel Bitmap Heap Scan on public.pop (cost=388.08..51229.82 rows=6645 width=17) (actual time=0.298..1.265 rows=5000 loops=4)

Output: pop.id, pop.name

Recheck Cond: (pop.id < 10000)

Heap Blocks: exact=31

Worker 0: actual time=0.044..1.111 rows=5365 loops=1

Worker 1: actual time=0.044..1.205 rows=5753 loops=1

Worker 2: actual time=0.049..0.681 rows=3145 loops=1

-> Bitmap Index Scan on idx_pop_id (cost=0.00..382.93 rows=20599 width=0) (actual time=1.012..1.012 rows=19998 loops=1)

Index Cond: (pop.id < 10000)

-> Index Only Scan using idx_rock_id on public.rock (cost=0.56..8.19 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=19998)

Output: rock.id

Index Cond: (rock.id = pop.id)

Heap Fetches: 19998

Worker 0: actual time=0.004..0.004 rows=1 loops=5365

Worker 1: actual time=0.003..0.004 rows=1 loops=5753

Worker 2: actual time=0.004..0.004 rows=1 loops=3145

Planning Time: 0.264 ms

Execution Time: 37.025 ms

(27 rows)

總結(jié)

1、并行查詢(xún)?cè)趦?yōu)化器評(píng)估執(zhí)行計(jì)劃的代價(jià)后，在設(shè)定的max_parallel_workers_per_gather參數(shù)值范圍內(nèi)按需啟動(dòng)（啟動(dòng)一部分或不啟動(dòng)）work并發(fā)進(jìn)程進(jìn)行查詢(xún)。

2、并行查詢(xún)對(duì)數(shù)據(jù)量大的查詢(xún)效果明顯。

3、并行創(chuàng)建索引的時(shí)候，除了設(shè)置max_parallel_workers_per_gather參數(shù)之外，還需要配置max_parallel_maintenance_workers參數(shù)來(lái)配合處理工作。

4、并行度不是越大越好，盡管可以強(qiáng)制設(shè)置并行，但是建議不要在生產(chǎn)環(huán)境中冒險(xiǎn)。

5、可以在會(huì)話(huà)級(jí)別啟動(dòng)并行，主要用于數(shù)據(jù)量較大、并且及時(shí)性較強(qiáng)的任務(wù)中較好，如果第一次使用，需要密切關(guān)注服務(wù)器整體性能負(fù)載情況。