'用詳細的代碼，手把手傳授給你高效搜索法的絕技'

數據庫數據結構 PostgreSQL SQL JSON 地理讀芯術 2019-08-26

圖片來源：unsplash.com/@lucassankey

人與世界萬物的互動會產生大量的時空數據。那麼，當我們需要隨時調用過去的數據時，改怎麼辦？尤其是面對各種海量、多維度的數據庫，如果沒有高效的搜索方法，我們只能望洋興嘆、束手無策。

別擔心，本文將用詳細的代碼，手把手來傳授高效搜索法的絕技！

圖片來源：unsplash.com/@lucassankey

別擔心，本文將用詳細的代碼，手把手來傳授高效搜索法的絕技！

對象數據分類

對象數據可分為兩種類型：靜態數據（相對靜態，例如建築）和動態數據（例如人的活動和物聯網傳感器的活動）。

圖片來源：unsplash.com/@lucassankey

別擔心，本文將用詳細的代碼，手把手來傳授高效搜索法的絕技！

對象數據分類

對象數據可分為兩種類型：靜態數據（相對靜態，例如建築）和動態數據（例如人的活動和物聯網傳感器的活動）。

按研究需求分類的索引

時空快照搜索

有些對象以相對較低的頻率生成數據。例如，建築物和道路等惰性物體可能在數年內不會發生任何變化。如果將為這些對象生成的數據寫入數據庫，並按時間範圍查詢數據（例如，查詢日期為2017-07-01至2017-07-02），則可能找不到與這些對象相關的任何數據。原因很簡單，在這段時間內數據庫根本沒有相關數據輸入。

時空行為數據搜索

時空行為數據是指從人的活動等動態對象中獲取數據。

例如，分析特定地區特定時間段內某一人群的特徵，或者分析大學周邊人群在工作日和週末構成的差異。

時空快照不屬於本文的討論範圍。現在，我們看看如何搜索時空行為數據。

圖片來源：unsplash.com/@lucassankey

別擔心，本文將用詳細的代碼，手把手來傳授高效搜索法的絕技！

對象數據分類

對象數據可分為兩種類型：靜態數據（相對靜態，例如建築）和動態數據（例如人的活動和物聯網傳感器的活動）。

按研究需求分類的索引

時空快照搜索

時空行為數據搜索

時空行為數據是指從人的活動等動態對象中獲取數據。

例如，分析特定地區特定時間段內某一人群的特徵，或者分析大學周邊人群在工作日和週末構成的差異。

時空快照不屬於本文的討論範圍。現在，我們看看如何搜索時空行為數據。

數據結構

圖片來源：unsplash.com/@lucassankey

別擔心，本文將用詳細的代碼，手把手來傳授高效搜索法的絕技！

對象數據分類

對象數據可分為兩種類型：靜態數據（相對靜態，例如建築）和動態數據（例如人的活動和物聯網傳感器的活動）。

按研究需求分類的索引

時空快照搜索

時空行為數據搜索

時空行為數據是指從人的活動等動態對象中獲取數據。

例如，分析特定地區特定時間段內某一人群的特徵，或者分析大學周邊人群在工作日和週末構成的差異。

時空快照不屬於本文的討論範圍。現在，我們看看如何搜索時空行為數據。

數據結構

圖片來源：unsplash.com/@dekubaum

時空行為數據包含三個屬性：時間、空間和對象。

非結構化索引：

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 obj jsonb -- Object description 
);

除了應用於JSON，結構化數據還可以用於對象描述。例如：

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 c1 int, -- Some property examples 
 c2 int, 
 c3 text, 
 c4 float8, 
 c5 int, 
 c6 date, 
 c7 text, 
 c8 int, 
 c9 int, 
 c10 int 
);

時空行為數據的SQL查詢實例

select * from test 
 where 
 pos <-> ? < ? 
 and crt_time between ? and ? 
 and ( (c1 = ? and c2 between ? and ?) or c10=?) 
 ... 
 ;

圖片來源：unsplash.com/@lucassankey

別擔心，本文將用詳細的代碼，手把手來傳授高效搜索法的絕技！

對象數據分類

對象數據可分為兩種類型：靜態數據（相對靜態，例如建築）和動態數據（例如人的活動和物聯網傳感器的活動）。

按研究需求分類的索引

時空快照搜索

時空行為數據搜索

時空行為數據是指從人的活動等動態對象中獲取數據。

例如，分析特定地區特定時間段內某一人群的特徵，或者分析大學周邊人群在工作日和週末構成的差異。

時空快照不屬於本文的討論範圍。現在，我們看看如何搜索時空行為數據。

數據結構

圖片來源：unsplash.com/@dekubaum

時空行為數據包含三個屬性：時間、空間和對象。

非結構化索引：

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 obj jsonb -- Object description 
);

除了應用於JSON，結構化數據還可以用於對象描述。例如：

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 c1 int, -- Some property examples 
 c2 int, 
 c3 text, 
 c4 float8, 
 c5 int, 
 c6 date, 
 c7 text, 
 c8 int, 
 c9 int, 
 c10 int 
);

時空行為數據的SQL查詢實例

select * from test 
 where 
 pos <-> ? < ? 
 and crt_time between ? and ? 
 and ( (c1 = ? and c2 between ? and ?) or c10=?) 
 ... 
 ;

優化方法

考慮運用以下知識：

時間序列BRIN索引

crt_time字段是一個時間序列字段，表示生成數據的時間。在PostgreSQL堆存儲中，存儲和該字段的值具有很強的線性相關性。

因此，BRIN索引很合適。

使用BRIN索引來代替分區表進行TPC-H測試。大範圍搜索的性能甚至優於使用分區表時的功能。

create index idx_test_1 on test using brin(crt_time);

空間索引

顯然，空間檢索需要空間索引。PostgreSQL中可以使用三種方法實現空間檢索。

1. 幾何類型的GIST索引

create index idx_test_2 on test using gist(pos);

該索引支持空間KNN搜索和空間位置確定等功能。

2. 幾何類型的主索引

create index idx_test_2 on test using spgist(pos);

該索引支持空間KNN搜索和空間位置確定等功能。

3. Geohash和B-tree索引（將經度和緯度轉換為Geohash併為hash值創建B-tree索引）。只需使用表達式索引。

create index idx_test_3 on test using btree( ST_GeoHash(pos,15) );

此索引支持前綴搜索（其能落實編碼地理信息網格中包含的關係）。它屬於有損索引，需要二次過濾。

GiST和SPGiST空間索引能夠找到準確的地理位置信息，優於GEOHASH索引。但是，查詢信息時需要特別注意。

GIN 索引

此索引類型的目標是對象屬性字段JSONB或多個結構化對象屬性字段。只需使用GIN索引。

例如：

create extension btree_gin;

非結構化索引：

create index idx_test_4 on test using gin( obj );

結構化索引：

create index idx_test_4 on test using gin( c1,c2,c3,c4,c5,c6,c7,c8,c9 );

BitmapAnd和BitmapOr

在上一節中，根據數據類型和查詢需求可以為不同的查詢維度選擇相應的索引。

但是，可以同時使用這些索引嗎? PostgreSQL為多個索引提供bitmapAnd及bitmapOr接口。它們可以組合多個索引，減少需要掃描的數據庫數量。

Heap, one square = one page: 
+---------------------------------------------+ 
|c____u_____X___u___X_________u___cXcc______u_| 
+---------------------------------------------+ 
Rows marked c match customers pkey condition. 
Rows marked u match username condition. 
Rows marked X match both conditions. 
Bitmap scan from customers_pkey: 
+---------------------------------------------+ 
|100000000001000000010000000000000111100000000| bitmap 1 
+---------------------------------------------+ 
One bit per heap page, in the same order as the heap 
Bits 1 when condition matches, 0 if not 
Bitmap scan from ix_cust_username: 
+---------------------------------------------+ 
|000001000001000100010000000001000010000000010| bitmap 2 
+---------------------------------------------+ 
Once the bitmaps are created a bitwise AND is performed on them: 
+---------------------------------------------+ 
|100000000001000000010000000000000111100000000| bitmap 1 
|000001000001000100010000000001000010000000010| bitmap 2 
 &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 
|000000000001000000010000000000000010000000000| Combined bitmap 
+-----------+-------+--------------+----------+ 
 | | | 
 v v v 
Used to scan the heap only for matching pages: 
+---------------------------------------------+ 
|___________X_______X______________X__________| 
+---------------------------------------------+ 
The bitmap heap scan then seeks to the start of each page and reads the page: 
+---------------------------------------------+ 
|___________X_______X______________X__________| 
+---------------------------------------------+ 
seek------->^seek-->^seek--------->^ 
 | | | 
 ------------------------ 
 only these pages read

例如：

select * from test where 
 c1 ... 
 and crt_time between ? and ? 
 and test->> c1 in (?, ? ...);

根據統計數據自動使用適當的索引。如果需要，bitmapAnd和bitmapOr將在多個索引上自動執行合併掃描。跳過不需要掃描的頁面，重新檢查命中的頁面。

堆表存儲分級和分區

存儲可以分為一級分區或多級分區：

1. 單一分區

例如，按時間劃分。

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 obj jsonb -- Object description 
) 
PARTITION BY range (crt_time) 
; 
create table test_201701 PARTITION OF test for values FROM ( 2017-01-01 ) TO ( 2017-02-01 ); 
......

2. 多層分區

例如，先按時間分區，然後按Geohash劃分。

create table test_201701 PARTITION OF test for values 
FROM ( 2017-01-01 ) TO ( 2017-02-01 ) partition by range(st_geohash(pos,15)); 
... 
create table test_201701_prefix1 PARTITION OF test for values 
FROM ( xxxx1 ) TO ( xxxx2 ); 
-- Generate BOX (GRID) on a map, find corresponding boundaries and use 
-- boundaries as partitioning conditions

使用分區時，如果查詢條件包括分區鍵（如時間和空間範圍），相應的分區將自動定位，這即為需要掃描的數據量。

創建面向對象屬性的GIN索引，以實現高效查詢。

索引分級與分區

與數據一樣，索引在不使用分區表的情況下也支持分區邏輯。

空間索引+時間分區

create index idx_20170101 
on tbl using gist (pos) 
where crt_time between 2017-01-01 and 2017-01-02 ; 
... 
create index idx_20170102 
on tbl using gist (pos) 
where crt_time between 2017-01-02 and 2017-01-03 ; 
...

通過使用前述分區索引，可以在輸入時間範圍後快速定位目標數據，執行空間搜索。

select * from tbl 
 where crt_time between 2017-01-01 and 2017-01-02 -- Time 
 and (pos <-> ?) < ? -- Distance to a point to be searched for 
 and ? -- Other conditions 
 order by pos <-> ? -- Sort by distance 
 limit ?; -- Number of results to be returned

可以使用更多的索引分區，比如用作搜索條件和商店類型的維度（對象屬性）（假設它是可枚舉的或在範圍相對較小的情況下）。

create index idx_20170101_mod0 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=0; 
... 
create index idx_20170101_mod1 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=1; 
...

通過使用前面的分區索引，在輸入時間範圍或特定條件以執行空間搜索後，可以快速定位目標數據。

select * from tbl 
 where crt_time between 2017-01-01 and 2017-01-02 -- Time 
 and (pos <-> ?) < ? -- Distance to a point to be searched for 
 and dtype=0 -- Object condition 
 and ? -- Other conditions 
 order by pos <-> ? -- Sort by distance 
 limit ?; -- Number of results to be returned

請注意，前面的SQL查詢可以實現最佳性能優化。

索引組織形式（或索引結構）可以由邏輯分區重新構造，可以用上述類似的索引創建方法覆蓋所有條件。

CTID相交陣列連接掃描

如前所述，BitmapAnd和BitmapOr合併掃描是在多個索引或GIN索引中自動執行的。事實上，這種掃描也可以在SQL中顯式執行。

每個條件滲透對應的CTID。

使用Intersect或Union生成滿足總體需求的CTID。（Intersect對應於“and”條件；union對應於“or”條件。）

生成一個ctid數組。

圖片來源：unsplash.com/@lucassankey

別擔心，本文將用詳細的代碼，手把手來傳授高效搜索法的絕技！

對象數據分類

對象數據可分為兩種類型：靜態數據（相對靜態，例如建築）和動態數據（例如人的活動和物聯網傳感器的活動）。

按研究需求分類的索引

時空快照搜索

時空行為數據搜索

時空行為數據是指從人的活動等動態對象中獲取數據。

例如，分析特定地區特定時間段內某一人群的特徵，或者分析大學周邊人群在工作日和週末構成的差異。

時空快照不屬於本文的討論範圍。現在，我們看看如何搜索時空行為數據。

數據結構

圖片來源：unsplash.com/@dekubaum

時空行為數據包含三個屬性：時間、空間和對象。

非結構化索引：

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 obj jsonb -- Object description 
);

除了應用於JSON，結構化數據還可以用於對象描述。例如：

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 c1 int, -- Some property examples 
 c2 int, 
 c3 text, 
 c4 float8, 
 c5 int, 
 c6 date, 
 c7 text, 
 c8 int, 
 c9 int, 
 c10 int 
);

時空行為數據的SQL查詢實例

select * from test 
 where 
 pos <-> ? < ? 
 and crt_time between ? and ? 
 and ( (c1 = ? and c2 between ? and ?) or c10=?) 
 ... 
 ;

優化方法

考慮運用以下知識：

時間序列BRIN索引

crt_time字段是一個時間序列字段，表示生成數據的時間。在PostgreSQL堆存儲中，存儲和該字段的值具有很強的線性相關性。

因此，BRIN索引很合適。

使用BRIN索引來代替分區表進行TPC-H測試。大範圍搜索的性能甚至優於使用分區表時的功能。

create index idx_test_1 on test using brin(crt_time);

空間索引

顯然，空間檢索需要空間索引。PostgreSQL中可以使用三種方法實現空間檢索。

1. 幾何類型的GIST索引

create index idx_test_2 on test using gist(pos);

該索引支持空間KNN搜索和空間位置確定等功能。

2. 幾何類型的主索引

create index idx_test_2 on test using spgist(pos);

該索引支持空間KNN搜索和空間位置確定等功能。

3. Geohash和B-tree索引（將經度和緯度轉換為Geohash併為hash值創建B-tree索引）。只需使用表達式索引。

create index idx_test_3 on test using btree( ST_GeoHash(pos,15) );

此索引支持前綴搜索（其能落實編碼地理信息網格中包含的關係）。它屬於有損索引，需要二次過濾。

GiST和SPGiST空間索引能夠找到準確的地理位置信息，優於GEOHASH索引。但是，查詢信息時需要特別注意。

GIN 索引

此索引類型的目標是對象屬性字段JSONB或多個結構化對象屬性字段。只需使用GIN索引。

例如：

create extension btree_gin;

非結構化索引：

create index idx_test_4 on test using gin( obj );

結構化索引：

create index idx_test_4 on test using gin( c1,c2,c3,c4,c5,c6,c7,c8,c9 );

BitmapAnd和BitmapOr

在上一節中，根據數據類型和查詢需求可以為不同的查詢維度選擇相應的索引。

但是，可以同時使用這些索引嗎? PostgreSQL為多個索引提供bitmapAnd及bitmapOr接口。它們可以組合多個索引，減少需要掃描的數據庫數量。

Heap, one square = one page: 
+---------------------------------------------+ 
|c____u_____X___u___X_________u___cXcc______u_| 
+---------------------------------------------+ 
Rows marked c match customers pkey condition. 
Rows marked u match username condition. 
Rows marked X match both conditions. 
Bitmap scan from customers_pkey: 
+---------------------------------------------+ 
|100000000001000000010000000000000111100000000| bitmap 1 
+---------------------------------------------+ 
One bit per heap page, in the same order as the heap 
Bits 1 when condition matches, 0 if not 
Bitmap scan from ix_cust_username: 
+---------------------------------------------+ 
|000001000001000100010000000001000010000000010| bitmap 2 
+---------------------------------------------+ 
Once the bitmaps are created a bitwise AND is performed on them: 
+---------------------------------------------+ 
|100000000001000000010000000000000111100000000| bitmap 1 
|000001000001000100010000000001000010000000010| bitmap 2 
 &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 
|000000000001000000010000000000000010000000000| Combined bitmap 
+-----------+-------+--------------+----------+ 
 | | | 
 v v v 
Used to scan the heap only for matching pages: 
+---------------------------------------------+ 
|___________X_______X______________X__________| 
+---------------------------------------------+ 
The bitmap heap scan then seeks to the start of each page and reads the page: 
+---------------------------------------------+ 
|___________X_______X______________X__________| 
+---------------------------------------------+ 
seek------->^seek-->^seek--------->^ 
 | | | 
 ------------------------ 
 only these pages read

例如：

select * from test where 
 c1 ... 
 and crt_time between ? and ? 
 and test->> c1 in (?, ? ...);

根據統計數據自動使用適當的索引。如果需要，bitmapAnd和bitmapOr將在多個索引上自動執行合併掃描。跳過不需要掃描的頁面，重新檢查命中的頁面。

堆表存儲分級和分區

存儲可以分為一級分區或多級分區：

1. 單一分區

例如，按時間劃分。

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 obj jsonb -- Object description 
) 
PARTITION BY range (crt_time) 
; 
create table test_201701 PARTITION OF test for values FROM ( 2017-01-01 ) TO ( 2017-02-01 ); 
......

2. 多層分區

例如，先按時間分區，然後按Geohash劃分。

create table test_201701 PARTITION OF test for values 
FROM ( 2017-01-01 ) TO ( 2017-02-01 ) partition by range(st_geohash(pos,15)); 
... 
create table test_201701_prefix1 PARTITION OF test for values 
FROM ( xxxx1 ) TO ( xxxx2 ); 
-- Generate BOX (GRID) on a map, find corresponding boundaries and use 
-- boundaries as partitioning conditions

使用分區時，如果查詢條件包括分區鍵（如時間和空間範圍），相應的分區將自動定位，這即為需要掃描的數據量。

創建面向對象屬性的GIN索引，以實現高效查詢。

索引分級與分區

與數據一樣，索引在不使用分區表的情況下也支持分區邏輯。

空間索引+時間分區

create index idx_20170101 
on tbl using gist (pos) 
where crt_time between 2017-01-01 and 2017-01-02 ; 
... 
create index idx_20170102 
on tbl using gist (pos) 
where crt_time between 2017-01-02 and 2017-01-03 ; 
...

通過使用前述分區索引，可以在輸入時間範圍後快速定位目標數據，執行空間搜索。

select * from tbl 
 where crt_time between 2017-01-01 and 2017-01-02 -- Time 
 and (pos <-> ?) < ? -- Distance to a point to be searched for 
 and ? -- Other conditions 
 order by pos <-> ? -- Sort by distance 
 limit ?; -- Number of results to be returned

可以使用更多的索引分區，比如用作搜索條件和商店類型的維度（對象屬性）（假設它是可枚舉的或在範圍相對較小的情況下）。

create index idx_20170101_mod0 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=0; 
... 
create index idx_20170101_mod1 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=1; 
...

通過使用前面的分區索引，在輸入時間範圍或特定條件以執行空間搜索後，可以快速定位目標數據。

select * from tbl 
 where crt_time between 2017-01-01 and 2017-01-02 -- Time 
 and (pos <-> ?) < ? -- Distance to a point to be searched for 
 and dtype=0 -- Object condition 
 and ? -- Other conditions 
 order by pos <-> ? -- Sort by distance 
 limit ?; -- Number of results to be returned

請注意，前面的SQL查詢可以實現最佳性能優化。

索引組織形式（或索引結構）可以由邏輯分區重新構造，可以用上述類似的索引創建方法覆蓋所有條件。

CTID相交陣列連接掃描

如前所述，BitmapAnd和BitmapOr合併掃描是在多個索引或GIN索引中自動執行的。事實上，這種掃描也可以在SQL中顯式執行。

每個條件滲透對應的CTID。

使用Intersect或Union生成滿足總體需求的CTID。（Intersect對應於“and”條件；union對應於“or”條件。）

生成一個ctid數組。

示例

圖片來源：unsplash.com/@lucassankey

別擔心，本文將用詳細的代碼，手把手來傳授高效搜索法的絕技！

對象數據分類

對象數據可分為兩種類型：靜態數據（相對靜態，例如建築）和動態數據（例如人的活動和物聯網傳感器的活動）。

按研究需求分類的索引

時空快照搜索

時空行為數據搜索

時空行為數據是指從人的活動等動態對象中獲取數據。

例如，分析特定地區特定時間段內某一人群的特徵，或者分析大學周邊人群在工作日和週末構成的差異。

時空快照不屬於本文的討論範圍。現在，我們看看如何搜索時空行為數據。

數據結構

圖片來源：unsplash.com/@dekubaum

時空行為數據包含三個屬性：時間、空間和對象。

非結構化索引：

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 obj jsonb -- Object description 
);

除了應用於JSON，結構化數據還可以用於對象描述。例如：

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 c1 int, -- Some property examples 
 c2 int, 
 c3 text, 
 c4 float8, 
 c5 int, 
 c6 date, 
 c7 text, 
 c8 int, 
 c9 int, 
 c10 int 
);

時空行為數據的SQL查詢實例

select * from test 
 where 
 pos <-> ? < ? 
 and crt_time between ? and ? 
 and ( (c1 = ? and c2 between ? and ?) or c10=?) 
 ... 
 ;

優化方法

考慮運用以下知識：

時間序列BRIN索引

crt_time字段是一個時間序列字段，表示生成數據的時間。在PostgreSQL堆存儲中，存儲和該字段的值具有很強的線性相關性。

因此，BRIN索引很合適。

使用BRIN索引來代替分區表進行TPC-H測試。大範圍搜索的性能甚至優於使用分區表時的功能。

create index idx_test_1 on test using brin(crt_time);

空間索引

顯然，空間檢索需要空間索引。PostgreSQL中可以使用三種方法實現空間檢索。

1. 幾何類型的GIST索引

create index idx_test_2 on test using gist(pos);

該索引支持空間KNN搜索和空間位置確定等功能。

2. 幾何類型的主索引

create index idx_test_2 on test using spgist(pos);

該索引支持空間KNN搜索和空間位置確定等功能。

3. Geohash和B-tree索引（將經度和緯度轉換為Geohash併為hash值創建B-tree索引）。只需使用表達式索引。

create index idx_test_3 on test using btree( ST_GeoHash(pos,15) );

此索引支持前綴搜索（其能落實編碼地理信息網格中包含的關係）。它屬於有損索引，需要二次過濾。

GiST和SPGiST空間索引能夠找到準確的地理位置信息，優於GEOHASH索引。但是，查詢信息時需要特別注意。

GIN 索引

此索引類型的目標是對象屬性字段JSONB或多個結構化對象屬性字段。只需使用GIN索引。

例如：

create extension btree_gin;

非結構化索引：

create index idx_test_4 on test using gin( obj );

結構化索引：

create index idx_test_4 on test using gin( c1,c2,c3,c4,c5,c6,c7,c8,c9 );

BitmapAnd和BitmapOr

在上一節中，根據數據類型和查詢需求可以為不同的查詢維度選擇相應的索引。

但是，可以同時使用這些索引嗎? PostgreSQL為多個索引提供bitmapAnd及bitmapOr接口。它們可以組合多個索引，減少需要掃描的數據庫數量。

Heap, one square = one page: 
+---------------------------------------------+ 
|c____u_____X___u___X_________u___cXcc______u_| 
+---------------------------------------------+ 
Rows marked c match customers pkey condition. 
Rows marked u match username condition. 
Rows marked X match both conditions. 
Bitmap scan from customers_pkey: 
+---------------------------------------------+ 
|100000000001000000010000000000000111100000000| bitmap 1 
+---------------------------------------------+ 
One bit per heap page, in the same order as the heap 
Bits 1 when condition matches, 0 if not 
Bitmap scan from ix_cust_username: 
+---------------------------------------------+ 
|000001000001000100010000000001000010000000010| bitmap 2 
+---------------------------------------------+ 
Once the bitmaps are created a bitwise AND is performed on them: 
+---------------------------------------------+ 
|100000000001000000010000000000000111100000000| bitmap 1 
|000001000001000100010000000001000010000000010| bitmap 2 
 &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 
|000000000001000000010000000000000010000000000| Combined bitmap 
+-----------+-------+--------------+----------+ 
 | | | 
 v v v 
Used to scan the heap only for matching pages: 
+---------------------------------------------+ 
|___________X_______X______________X__________| 
+---------------------------------------------+ 
The bitmap heap scan then seeks to the start of each page and reads the page: 
+---------------------------------------------+ 
|___________X_______X______________X__________| 
+---------------------------------------------+ 
seek------->^seek-->^seek--------->^ 
 | | | 
 ------------------------ 
 only these pages read

例如：

select * from test where 
 c1 ... 
 and crt_time between ? and ? 
 and test->> c1 in (?, ? ...);

根據統計數據自動使用適當的索引。如果需要，bitmapAnd和bitmapOr將在多個索引上自動執行合併掃描。跳過不需要掃描的頁面，重新檢查命中的頁面。

堆表存儲分級和分區

存儲可以分為一級分區或多級分區：

1. 單一分區

例如，按時間劃分。

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 obj jsonb -- Object description 
) 
PARTITION BY range (crt_time) 
; 
create table test_201701 PARTITION OF test for values FROM ( 2017-01-01 ) TO ( 2017-02-01 ); 
......

2. 多層分區

例如，先按時間分區，然後按Geohash劃分。

create table test_201701 PARTITION OF test for values 
FROM ( 2017-01-01 ) TO ( 2017-02-01 ) partition by range(st_geohash(pos,15)); 
... 
create table test_201701_prefix1 PARTITION OF test for values 
FROM ( xxxx1 ) TO ( xxxx2 ); 
-- Generate BOX (GRID) on a map, find corresponding boundaries and use 
-- boundaries as partitioning conditions

使用分區時，如果查詢條件包括分區鍵（如時間和空間範圍），相應的分區將自動定位，這即為需要掃描的數據量。

創建面向對象屬性的GIN索引，以實現高效查詢。

索引分級與分區

與數據一樣，索引在不使用分區表的情況下也支持分區邏輯。

空間索引+時間分區

create index idx_20170101 
on tbl using gist (pos) 
where crt_time between 2017-01-01 and 2017-01-02 ; 
... 
create index idx_20170102 
on tbl using gist (pos) 
where crt_time between 2017-01-02 and 2017-01-03 ; 
...

通過使用前述分區索引，可以在輸入時間範圍後快速定位目標數據，執行空間搜索。

select * from tbl 
 where crt_time between 2017-01-01 and 2017-01-02 -- Time 
 and (pos <-> ?) < ? -- Distance to a point to be searched for 
 and ? -- Other conditions 
 order by pos <-> ? -- Sort by distance 
 limit ?; -- Number of results to be returned

可以使用更多的索引分區，比如用作搜索條件和商店類型的維度（對象屬性）（假設它是可枚舉的或在範圍相對較小的情況下）。

create index idx_20170101_mod0 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=0; 
... 
create index idx_20170101_mod1 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=1; 
...

通過使用前面的分區索引，在輸入時間範圍或特定條件以執行空間搜索後，可以快速定位目標數據。

select * from tbl 
 where crt_time between 2017-01-01 and 2017-01-02 -- Time 
 and (pos <-> ?) < ? -- Distance to a point to be searched for 
 and dtype=0 -- Object condition 
 and ? -- Other conditions 
 order by pos <-> ? -- Sort by distance 
 limit ?; -- Number of results to be returned

請注意，前面的SQL查詢可以實現最佳性能優化。

索引組織形式（或索引結構）可以由邏輯分區重新構造，可以用上述類似的索引創建方法覆蓋所有條件。

CTID相交陣列連接掃描

如前所述，BitmapAnd和BitmapOr合併掃描是在多個索引或GIN索引中自動執行的。事實上，這種掃描也可以在SQL中顯式執行。

每個條件滲透對應的CTID。

使用Intersect或Union生成滿足總體需求的CTID。（Intersect對應於“and”條件；union對應於“or”條件。）

生成一個ctid數組。

示例

圖片來源：unsplash.com/@markusspiske

1. 創建對象提要數據表

postgres=# create table tbl (id int, info text, crt_time timestamp, pos point, c1 int , c2 int, c3 int ); 
CREATE TABLE

2. 將5000萬條測試數據寫入表中

postgres=# insert into tbl select generate_series(1,50000000), md5(random()::text), clock_timestamp(), point(180-random()*180, 90-random()*90), random()*10000, random()*5000, random()*1000; 
INSERT 0 50000000

3. 創建對象索引

postgres=# create index idx_tbl_1 on tbl using gin (info, c1, c2, c3); 
CREATE INDEX

4. 創建時間索引

postgres=# create index idx_tbl_2 on tbl using btree (crt_time); 
CREATE INDEX

5. 創建空間索引

postgres=# create index idx_tbl_3 on tbl using gist (pos); 
CREATE INDEX

6. 生成數據佈局以方便後續查詢

postgres=# select min(crt_time),max(crt_time),count(*) from tbl; 
 min | max | count 
----------------------------+----------------------------+---------- 
 2017-07-22 17:59:34.136497 | 2017-07-22 18:01:27.233688 | 50000000 
(1 row)

7. 創建一個極限KNN查詢函數

create or replace function ff(point, float8, int) returns setof tid as 
$
declare 
 v_rec record; 
 v_limit int := $3; 
begin 
 set local enable_seqscan=off; -- Force index that exits when scanned rows reach a specific number 
 for v_rec in 
 select *, 
 (pos <-> $1) as dist, 
 ctid 
 from tbl 
 order by pos <-> $1 
 loop 
 if v_limit <=0 then 
 -- raise notice "Sufficient data obtained" 
 return; 
 end if; 
 if v_rec.dist > $2 then 
 -- raise notice "All matching points returned" 
 return; 
 else 
 return next v_rec.ctid; 
 end if; 
 v_limit := v_limit -1; 
 end loop; 
end; 
$
 language plpgsql strict volatile; 
postgres=# select * from ff(point (100,100) ,100,100) ; 
 ff 
------------- 
 (407383,11) 
 (640740,9) 
 (26073,51) 
 (642750,34) 
... 
(100 rows) 
Time: 1.061 ms

8. CTID合併檢索

顯示符合以下條件的記錄

( 
c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
c2 < 10 
) 
 and 
pos <-> point (0,0) < 5 
 and 
crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 ;

首先，分別查看每個條件，找匹配一個條件的記錄數量，以及在索引掃描上所花時長。

1. 54,907條記錄

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55); 
 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------- 
 Bitmap Heap Scan on postgres.tbl (cost=820.07..65393.94 rows=54151 width=73) (actual time=23.842..91.911 rows=54907 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Heap Blocks: exact=52778 
 Buffers: shared hit=52866 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.264..14.264 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 Planning time: 0.105 ms 
 Execution time: 94.606 ms 
(10 rows)

2. 95,147條記錄

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c2<10; 
 QUERY PLAN 
--------------------------------------------------------------------------------------------------------------------------------- 
 Bitmap Heap Scan on postgres.tbl (cost=835.73..112379.10 rows=99785 width=73) (actual time=69.243..179.388 rows=95147 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: (tbl.c2 < 10) 
 Heap Blocks: exact=88681 
 Buffers: shared hit=88734 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=53.612..53.612 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 Planning time: 0.094 ms 
 Execution time: 186.201 ms 
(10 rows)

3. 149930條記錄（為快速獲得結果，PostgreSQL使用位圖進行合併掃描）

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 <10; 
 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------ 
 Bitmap Heap Scan on postgres.tbl (cost=1694.23..166303.58 rows=153828 width=73) (actual time=98.988..266.852 rows=149930 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: ((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) 
 Heap Blocks: exact=134424 
 Buffers: shared hit=134565 
 -> BitmapOr (cost=1694.23..1694.23 rows=153936 width=0) (actual time=73.763..73.763 rows=0 loops=1) 
 Buffers: shared hit=141 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=16.733..16.733 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=57.029..57.029 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 Planning time: 0.149 ms 
 Execution time: 274.548 ms 
(15 rows)

4. 60,687條記錄（即使運用出色的KNN性能優化，仍然需要耗費195毫秒）。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ff(point (0,0) ,5,1000000); 
 QUERY PLAN 
---------------------------------------------------------------------------------------------------------------------- 
 Function Scan on postgres.ff (cost=0.25..10.25 rows=1000 width=6) (actual time=188.563..192.114 rows=60687 loops=1) 
 Output: ff 
 Function Call: ff( (0,0) ::point, 5 ::double precision, 1000000) 
 Buffers: shared hit=61296 
 Planning time: 0.029 ms 
 Execution time: 195.097 ms 
(6 rows)

讓我們看看不使用KNN優化需要多長時間。

結果非常令人驚訝——極限優化性能提高了一個數量級。

5. 2,640,751條記錄

使用所有索引逐個掃描數據條件，得到ctid並執行ctid掃描。

現在，讓我們來分解這個過程：

首先，讓我們看看時間和對象屬性的合併查詢，成果非常驚人。使用位圖BitmapOr時，查詢可以跳過大多數數據塊，並且掃描時間比單索引掃描要短。

注意，在此步驟中記錄的數量減少到7,847條。

postgres=# explain (analyze,verbose,timing,costs,buffers) select ctid from tbl 
 where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ); 
 QUERY PLAN 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
 Bitmap Heap Scan on postgres.tbl (cost=35025.85..44822.94 rows=7576 width=6) (actual time=205.577..214.821 rows=7847 loops=1) 
 Output: ctid 
 Recheck Cond: (((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) AND (tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Heap Blocks: exact=6983 
 Buffers: shared hit=14343 
 -> BitmapAnd (cost=35025.85..35025.85 rows=7581 width=0) (actual time=204.048..204.048 rows=0 loops=1) 
 Buffers: shared hit=7360 
 -> BitmapOr (cost=1621.11..1621.11 rows=153936 width=0) (actual time=70.279..70.279 rows=0 loops=1) 
 Buffers: shared hit=141 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=15.860..15.860 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=54.418..54.418 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 -> Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=127.101..127.101 rows=2640751 loops=1) 
 Index Cond: ((tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Buffers: shared hit=7219 
 Planning time: 0.203 ms 
 Execution time: 216.697 ms 
(20 rows)

然後，看KNN的掃描時間：

注意，60,687條記錄滿足KNN距離條件，所以接下來將解釋CTID合併掃描與原始掃描之間的性能比較。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ff(point (0,0) ,5,1000000); 
 QUERY PLAN 
---------------------------------------------------------------------------------------------------------------------- 
 Function Scan on postgres.ff (cost=0.25..10.25 rows=1000 width=6) (actual time=188.563..192.114 rows=60687 loops=1) 
 Output: ff 
 Function Call: ff( (0,0) ::point, 5 ::double precision, 1000000) 
 Buffers: shared hit=61296 
 Planning time: 0.029 ms 
 Execution time: 195.097 ms 
(6 rows)

最後，將這些片段合併到ctid中。

select * from ff(point (0,0) ,5,1000000) 
 intersect 
select ctid from tbl 
 where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ); 
 ff 
------------ 
 (1394,8) 
 (3892,50) 
 (6124,45) 
 (7235,8) 
 (7607,45) 
 (11540,8) 
 (13397,31) 
 (14266,36) 
 (18149,7) 
 (19256,44) 
 (24671,62) 
 (26525,64) 
 (30235,48) 
(13 rows) 
Time: 463.012 ms

取得最終紀錄。

select * from tbl where ctid = any 
( 
array( -- array start 
select * from ff(point (0,0) ,5,1000000) intersect select ctid from tbl 
 where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) 
) -- array end 
); 
 id | info | crt_time | pos | c1 | c2 | c3 
---------+----------------------------------+----------------------------+----------------------------------------+------+------+----- 
 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 
 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 
 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 
 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 
 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 
 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 
 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 
 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 
 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 
 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 
 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 
 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 
 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 
(13 rows) 
Time: 462.715 ms

過程花費462毫秒。

9. 測試原始SQL查詢的性能: PostgreSQL Multi-Index BitmapAnd and BitmapOr跳過掃描

直接編寫SQL查詢，而不是使用多CTID掃描。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl 
 where 
 crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) 
 and 
 pos <-> point (0,0) < 5; 
 Bitmap Heap Scan on postgres.tbl (cost=35022.06..44857.06 rows=2525 width=73) (actual time=205.542..214.547 rows=13 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: (((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) AND (tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Filter: ((tbl.pos <-> (0,0) ::point) < 5 ::double precision) 
 Rows Removed by Filter: 7834 
 Heap Blocks: exact=6983 
 Buffers: shared hit=14343 
 -> BitmapAnd (cost=35022.06..35022.06 rows=7581 width=0) (actual time=203.620..203.620 rows=0 loops=1) 
 Buffers: shared hit=7360 
 -> BitmapOr (cost=1618.58..1618.58 rows=153936 width=0) (actual time=71.660..71.660 rows=0 loops=1) 
 Buffers: shared hit=141 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.861..14.861 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=56.797..56.797 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 -> Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=125.255..125.255 rows=2640751 loops=1) 
 Index Cond: ((tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Buffers: shared hit=7219 
 Planning time: 0.160 ms 
 Execution time: 216.797 ms 
(22 rows)

性能如預期的那樣好，之前解釋過原因。KNN條件以外的條件已經將結果收斂到7,000條記錄，因此沒有必要使用包含KNN條件的索引。（即使使用KNN索引也需要195毫秒，因為有60,687條記錄滿足KNN條件。）

校驗結果：

select * from tbl 
 where 
 crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) 
 and 
 pos <-> point (0,0) < 5; 
 id | info | crt_time | pos | c1 | c2 | c3 
---------+----------------------------------+----------------------------+----------------------------------------+------+------+----- 
 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 
 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 
 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 
 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 
 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 
 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 
 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 
 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 
 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 
 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 
 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 
 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 
 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 
(13 rows)

圖片來源：unsplash.com/@lucassankey

別擔心，本文將用詳細的代碼，手把手來傳授高效搜索法的絕技！

對象數據分類

對象數據可分為兩種類型：靜態數據（相對靜態，例如建築）和動態數據（例如人的活動和物聯網傳感器的活動）。

按研究需求分類的索引

時空快照搜索

時空行為數據搜索

時空行為數據是指從人的活動等動態對象中獲取數據。

例如，分析特定地區特定時間段內某一人群的特徵，或者分析大學周邊人群在工作日和週末構成的差異。

時空快照不屬於本文的討論範圍。現在，我們看看如何搜索時空行為數據。

數據結構

圖片來源：unsplash.com/@dekubaum

時空行為數據包含三個屬性：時間、空間和對象。

非結構化索引：

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 obj jsonb -- Object description 
);

除了應用於JSON，結構化數據還可以用於對象描述。例如：

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 c1 int, -- Some property examples 
 c2 int, 
 c3 text, 
 c4 float8, 
 c5 int, 
 c6 date, 
 c7 text, 
 c8 int, 
 c9 int, 
 c10 int 
);

時空行為數據的SQL查詢實例

select * from test 
 where 
 pos <-> ? < ? 
 and crt_time between ? and ? 
 and ( (c1 = ? and c2 between ? and ?) or c10=?) 
 ... 
 ;

優化方法

考慮運用以下知識：

時間序列BRIN索引

crt_time字段是一個時間序列字段，表示生成數據的時間。在PostgreSQL堆存儲中，存儲和該字段的值具有很強的線性相關性。

因此，BRIN索引很合適。

使用BRIN索引來代替分區表進行TPC-H測試。大範圍搜索的性能甚至優於使用分區表時的功能。

create index idx_test_1 on test using brin(crt_time);

空間索引

顯然，空間檢索需要空間索引。PostgreSQL中可以使用三種方法實現空間檢索。

1. 幾何類型的GIST索引

create index idx_test_2 on test using gist(pos);

該索引支持空間KNN搜索和空間位置確定等功能。

2. 幾何類型的主索引

create index idx_test_2 on test using spgist(pos);

該索引支持空間KNN搜索和空間位置確定等功能。

3. Geohash和B-tree索引（將經度和緯度轉換為Geohash併為hash值創建B-tree索引）。只需使用表達式索引。

create index idx_test_3 on test using btree( ST_GeoHash(pos,15) );

此索引支持前綴搜索（其能落實編碼地理信息網格中包含的關係）。它屬於有損索引，需要二次過濾。

GiST和SPGiST空間索引能夠找到準確的地理位置信息，優於GEOHASH索引。但是，查詢信息時需要特別注意。

GIN 索引

此索引類型的目標是對象屬性字段JSONB或多個結構化對象屬性字段。只需使用GIN索引。

例如：

create extension btree_gin;

非結構化索引：

create index idx_test_4 on test using gin( obj );

結構化索引：

create index idx_test_4 on test using gin( c1,c2,c3,c4,c5,c6,c7,c8,c9 );

BitmapAnd和BitmapOr

在上一節中，根據數據類型和查詢需求可以為不同的查詢維度選擇相應的索引。

但是，可以同時使用這些索引嗎? PostgreSQL為多個索引提供bitmapAnd及bitmapOr接口。它們可以組合多個索引，減少需要掃描的數據庫數量。

Heap, one square = one page: 
+---------------------------------------------+ 
|c____u_____X___u___X_________u___cXcc______u_| 
+---------------------------------------------+ 
Rows marked c match customers pkey condition. 
Rows marked u match username condition. 
Rows marked X match both conditions. 
Bitmap scan from customers_pkey: 
+---------------------------------------------+ 
|100000000001000000010000000000000111100000000| bitmap 1 
+---------------------------------------------+ 
One bit per heap page, in the same order as the heap 
Bits 1 when condition matches, 0 if not 
Bitmap scan from ix_cust_username: 
+---------------------------------------------+ 
|000001000001000100010000000001000010000000010| bitmap 2 
+---------------------------------------------+ 
Once the bitmaps are created a bitwise AND is performed on them: 
+---------------------------------------------+ 
|100000000001000000010000000000000111100000000| bitmap 1 
|000001000001000100010000000001000010000000010| bitmap 2 
 &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 
|000000000001000000010000000000000010000000000| Combined bitmap 
+-----------+-------+--------------+----------+ 
 | | | 
 v v v 
Used to scan the heap only for matching pages: 
+---------------------------------------------+ 
|___________X_______X______________X__________| 
+---------------------------------------------+ 
The bitmap heap scan then seeks to the start of each page and reads the page: 
+---------------------------------------------+ 
|___________X_______X______________X__________| 
+---------------------------------------------+ 
seek------->^seek-->^seek--------->^ 
 | | | 
 ------------------------ 
 only these pages read

例如：

select * from test where 
 c1 ... 
 and crt_time between ? and ? 
 and test->> c1 in (?, ? ...);

根據統計數據自動使用適當的索引。如果需要，bitmapAnd和bitmapOr將在多個索引上自動執行合併掃描。跳過不需要掃描的頁面，重新檢查命中的頁面。

堆表存儲分級和分區

存儲可以分為一級分區或多級分區：

1. 單一分區

例如，按時間劃分。

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 obj jsonb -- Object description 
) 
PARTITION BY range (crt_time) 
; 
create table test_201701 PARTITION OF test for values FROM ( 2017-01-01 ) TO ( 2017-02-01 ); 
......

2. 多層分區

例如，先按時間分區，然後按Geohash劃分。

create table test_201701 PARTITION OF test for values 
FROM ( 2017-01-01 ) TO ( 2017-02-01 ) partition by range(st_geohash(pos,15)); 
... 
create table test_201701_prefix1 PARTITION OF test for values 
FROM ( xxxx1 ) TO ( xxxx2 ); 
-- Generate BOX (GRID) on a map, find corresponding boundaries and use 
-- boundaries as partitioning conditions

使用分區時，如果查詢條件包括分區鍵（如時間和空間範圍），相應的分區將自動定位，這即為需要掃描的數據量。

創建面向對象屬性的GIN索引，以實現高效查詢。

索引分級與分區

與數據一樣，索引在不使用分區表的情況下也支持分區邏輯。

空間索引+時間分區

create index idx_20170101 
on tbl using gist (pos) 
where crt_time between 2017-01-01 and 2017-01-02 ; 
... 
create index idx_20170102 
on tbl using gist (pos) 
where crt_time between 2017-01-02 and 2017-01-03 ; 
...

通過使用前述分區索引，可以在輸入時間範圍後快速定位目標數據，執行空間搜索。

select * from tbl 
 where crt_time between 2017-01-01 and 2017-01-02 -- Time 
 and (pos <-> ?) < ? -- Distance to a point to be searched for 
 and ? -- Other conditions 
 order by pos <-> ? -- Sort by distance 
 limit ?; -- Number of results to be returned

可以使用更多的索引分區，比如用作搜索條件和商店類型的維度（對象屬性）（假設它是可枚舉的或在範圍相對較小的情況下）。

create index idx_20170101_mod0 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=0; 
... 
create index idx_20170101_mod1 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=1; 
...

通過使用前面的分區索引，在輸入時間範圍或特定條件以執行空間搜索後，可以快速定位目標數據。

select * from tbl 
 where crt_time between 2017-01-01 and 2017-01-02 -- Time 
 and (pos <-> ?) < ? -- Distance to a point to be searched for 
 and dtype=0 -- Object condition 
 and ? -- Other conditions 
 order by pos <-> ? -- Sort by distance 
 limit ?; -- Number of results to be returned

請注意，前面的SQL查詢可以實現最佳性能優化。

索引組織形式（或索引結構）可以由邏輯分區重新構造，可以用上述類似的索引創建方法覆蓋所有條件。

CTID相交陣列連接掃描

如前所述，BitmapAnd和BitmapOr合併掃描是在多個索引或GIN索引中自動執行的。事實上，這種掃描也可以在SQL中顯式執行。

每個條件滲透對應的CTID。

使用Intersect或Union生成滿足總體需求的CTID。（Intersect對應於“and”條件；union對應於“or”條件。）

生成一個ctid數組。

示例

圖片來源：unsplash.com/@markusspiske

1. 創建對象提要數據表

postgres=# create table tbl (id int, info text, crt_time timestamp, pos point, c1 int , c2 int, c3 int ); 
CREATE TABLE

2. 將5000萬條測試數據寫入表中

postgres=# insert into tbl select generate_series(1,50000000), md5(random()::text), clock_timestamp(), point(180-random()*180, 90-random()*90), random()*10000, random()*5000, random()*1000; 
INSERT 0 50000000

3. 創建對象索引

postgres=# create index idx_tbl_1 on tbl using gin (info, c1, c2, c3); 
CREATE INDEX

4. 創建時間索引

postgres=# create index idx_tbl_2 on tbl using btree (crt_time); 
CREATE INDEX

5. 創建空間索引

postgres=# create index idx_tbl_3 on tbl using gist (pos); 
CREATE INDEX

6. 生成數據佈局以方便後續查詢

postgres=# select min(crt_time),max(crt_time),count(*) from tbl; 
 min | max | count 
----------------------------+----------------------------+---------- 
 2017-07-22 17:59:34.136497 | 2017-07-22 18:01:27.233688 | 50000000 
(1 row)

7. 創建一個極限KNN查詢函數

create or replace function ff(point, float8, int) returns setof tid as 
$
declare 
 v_rec record; 
 v_limit int := $3; 
begin 
 set local enable_seqscan=off; -- Force index that exits when scanned rows reach a specific number 
 for v_rec in 
 select *, 
 (pos <-> $1) as dist, 
 ctid 
 from tbl 
 order by pos <-> $1 
 loop 
 if v_limit <=0 then 
 -- raise notice "Sufficient data obtained" 
 return; 
 end if; 
 if v_rec.dist > $2 then 
 -- raise notice "All matching points returned" 
 return; 
 else 
 return next v_rec.ctid; 
 end if; 
 v_limit := v_limit -1; 
 end loop; 
end; 
$
 language plpgsql strict volatile; 
postgres=# select * from ff(point (100,100) ,100,100) ; 
 ff 
------------- 
 (407383,11) 
 (640740,9) 
 (26073,51) 
 (642750,34) 
... 
(100 rows) 
Time: 1.061 ms

8. CTID合併檢索

顯示符合以下條件的記錄

( 
c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
c2 < 10 
) 
 and 
pos <-> point (0,0) < 5 
 and 
crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 ;

首先，分別查看每個條件，找匹配一個條件的記錄數量，以及在索引掃描上所花時長。

1. 54,907條記錄

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55); 
 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------- 
 Bitmap Heap Scan on postgres.tbl (cost=820.07..65393.94 rows=54151 width=73) (actual time=23.842..91.911 rows=54907 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Heap Blocks: exact=52778 
 Buffers: shared hit=52866 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.264..14.264 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 Planning time: 0.105 ms 
 Execution time: 94.606 ms 
(10 rows)

2. 95,147條記錄

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c2<10; 
 QUERY PLAN 
--------------------------------------------------------------------------------------------------------------------------------- 
 Bitmap Heap Scan on postgres.tbl (cost=835.73..112379.10 rows=99785 width=73) (actual time=69.243..179.388 rows=95147 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: (tbl.c2 < 10) 
 Heap Blocks: exact=88681 
 Buffers: shared hit=88734 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=53.612..53.612 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 Planning time: 0.094 ms 
 Execution time: 186.201 ms 
(10 rows)

3. 149930條記錄（為快速獲得結果，PostgreSQL使用位圖進行合併掃描）

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 <10; 
 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------ 
 Bitmap Heap Scan on postgres.tbl (cost=1694.23..166303.58 rows=153828 width=73) (actual time=98.988..266.852 rows=149930 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: ((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) 
 Heap Blocks: exact=134424 
 Buffers: shared hit=134565 
 -> BitmapOr (cost=1694.23..1694.23 rows=153936 width=0) (actual time=73.763..73.763 rows=0 loops=1) 
 Buffers: shared hit=141 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=16.733..16.733 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=57.029..57.029 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 Planning time: 0.149 ms 
 Execution time: 274.548 ms 
(15 rows)

4. 60,687條記錄（即使運用出色的KNN性能優化，仍然需要耗費195毫秒）。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ff(point (0,0) ,5,1000000); 
 QUERY PLAN 
---------------------------------------------------------------------------------------------------------------------- 
 Function Scan on postgres.ff (cost=0.25..10.25 rows=1000 width=6) (actual time=188.563..192.114 rows=60687 loops=1) 
 Output: ff 
 Function Call: ff( (0,0) ::point, 5 ::double precision, 1000000) 
 Buffers: shared hit=61296 
 Planning time: 0.029 ms 
 Execution time: 195.097 ms 
(6 rows)

讓我們看看不使用KNN優化需要多長時間。

結果非常令人驚訝——極限優化性能提高了一個數量級。

5. 2,640,751條記錄

使用所有索引逐個掃描數據條件，得到ctid並執行ctid掃描。

現在，讓我們來分解這個過程：

首先，讓我們看看時間和對象屬性的合併查詢，成果非常驚人。使用位圖BitmapOr時，查詢可以跳過大多數數據塊，並且掃描時間比單索引掃描要短。

注意，在此步驟中記錄的數量減少到7,847條。

postgres=# explain (analyze,verbose,timing,costs,buffers) select ctid from tbl 
 where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ); 
 QUERY PLAN 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
 Bitmap Heap Scan on postgres.tbl (cost=35025.85..44822.94 rows=7576 width=6) (actual time=205.577..214.821 rows=7847 loops=1) 
 Output: ctid 
 Recheck Cond: (((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) AND (tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Heap Blocks: exact=6983 
 Buffers: shared hit=14343 
 -> BitmapAnd (cost=35025.85..35025.85 rows=7581 width=0) (actual time=204.048..204.048 rows=0 loops=1) 
 Buffers: shared hit=7360 
 -> BitmapOr (cost=1621.11..1621.11 rows=153936 width=0) (actual time=70.279..70.279 rows=0 loops=1) 
 Buffers: shared hit=141 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=15.860..15.860 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=54.418..54.418 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 -> Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=127.101..127.101 rows=2640751 loops=1) 
 Index Cond: ((tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Buffers: shared hit=7219 
 Planning time: 0.203 ms 
 Execution time: 216.697 ms 
(20 rows)

然後，看KNN的掃描時間：

注意，60,687條記錄滿足KNN距離條件，所以接下來將解釋CTID合併掃描與原始掃描之間的性能比較。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ff(point (0,0) ,5,1000000); 
 QUERY PLAN 
---------------------------------------------------------------------------------------------------------------------- 
 Function Scan on postgres.ff (cost=0.25..10.25 rows=1000 width=6) (actual time=188.563..192.114 rows=60687 loops=1) 
 Output: ff 
 Function Call: ff( (0,0) ::point, 5 ::double precision, 1000000) 
 Buffers: shared hit=61296 
 Planning time: 0.029 ms 
 Execution time: 195.097 ms 
(6 rows)

最後，將這些片段合併到ctid中。

select * from ff(point (0,0) ,5,1000000) 
 intersect 
select ctid from tbl 
 where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ); 
 ff 
------------ 
 (1394,8) 
 (3892,50) 
 (6124,45) 
 (7235,8) 
 (7607,45) 
 (11540,8) 
 (13397,31) 
 (14266,36) 
 (18149,7) 
 (19256,44) 
 (24671,62) 
 (26525,64) 
 (30235,48) 
(13 rows) 
Time: 463.012 ms

取得最終紀錄。

select * from tbl where ctid = any 
( 
array( -- array start 
select * from ff(point (0,0) ,5,1000000) intersect select ctid from tbl 
 where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) 
) -- array end 
); 
 id | info | crt_time | pos | c1 | c2 | c3 
---------+----------------------------------+----------------------------+----------------------------------------+------+------+----- 
 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 
 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 
 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 
 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 
 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 
 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 
 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 
 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 
 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 
 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 
 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 
 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 
 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 
(13 rows) 
Time: 462.715 ms

過程花費462毫秒。

9. 測試原始SQL查詢的性能: PostgreSQL Multi-Index BitmapAnd and BitmapOr跳過掃描

直接編寫SQL查詢，而不是使用多CTID掃描。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl 
 where 
 crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) 
 and 
 pos <-> point (0,0) < 5; 
 Bitmap Heap Scan on postgres.tbl (cost=35022.06..44857.06 rows=2525 width=73) (actual time=205.542..214.547 rows=13 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: (((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) AND (tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Filter: ((tbl.pos <-> (0,0) ::point) < 5 ::double precision) 
 Rows Removed by Filter: 7834 
 Heap Blocks: exact=6983 
 Buffers: shared hit=14343 
 -> BitmapAnd (cost=35022.06..35022.06 rows=7581 width=0) (actual time=203.620..203.620 rows=0 loops=1) 
 Buffers: shared hit=7360 
 -> BitmapOr (cost=1618.58..1618.58 rows=153936 width=0) (actual time=71.660..71.660 rows=0 loops=1) 
 Buffers: shared hit=141 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.861..14.861 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=56.797..56.797 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 -> Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=125.255..125.255 rows=2640751 loops=1) 
 Index Cond: ((tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Buffers: shared hit=7219 
 Planning time: 0.160 ms 
 Execution time: 216.797 ms 
(22 rows)

校驗結果：

select * from tbl 
 where 
 crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) 
 and 
 pos <-> point (0,0) < 5; 
 id | info | crt_time | pos | c1 | c2 | c3 
---------+----------------------------------+----------------------------+----------------------------------------+------+------+----- 
 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 
 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 
 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 
 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 
 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 
 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 
 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 
 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 
 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 
 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 
 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 
 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 
 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 
(13 rows)

分區索引示例

假設前面的查詢條件保持不變，使用分區索引來測試性能。

這是為了演示分區索引的極端效果。在實際場景中，集合級別可能沒有那麼高（例如按天集合或按ID散列集合）。只要集合是可能的，就可以展現出色的性能。

postgres=# create index idx_tbl_4 on tbl using gist (pos) where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) ; 
CREATE INDEX 
Time: 8359.330 ms (00:08.359)

重構極值KNN優化函數

create or replace function ff(point, float8, int) returns setof record as 
$
declare 
 v_rec record; 
 v_limit int := $3; 
begin 
 set local enable_seqscan=off; -- Force index that exits when scanned rows reach a specific number 
 for v_rec in 
 select *, 
 (pos <-> $1) as dist 
 from tbl 
 where 
 crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) 
 order by pos <-> $1 
 loop 
 if v_limit <=0 then 
 -- raise notice "Sufficient data obtained" 
 return; 
 end if; 
 if v_rec.dist > $2 then 
 -- raise notice "All matching points returned" 
 return; 
 else 
 return next v_rec; 
 end if; 
 v_limit := v_limit -1; 
 end loop; 
end; 
$
 language plpgsql strict volatile;

查詢性能：

postgres=# select * from ff(point (0,0) , 5, 10000000) as t(id int, info text, crt_time timestamp, pos point, c1 int, c2 int, c3 int, dist float8); 
 id | info | crt_time | pos | c1 | c2 | c3 | dist 
---------+----------------------------------+----------------------------+----------------------------------------+------+------+-----+------------------- 
 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 | 0.421309141034319 
 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 | 0.49127323294376 
 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 | 2.23004532710301 
 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 | 2.23438404136508 
 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 | 2.76586731309247 
 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 | 2.78803520274409 
 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 | 2.88931598221975 
 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 | 3.22896754478952 
 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 | 3.31688000783581 
 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 | 3.47958123047986 
 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 | 3.91188935630676 
 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 | 4.86069100130757 
 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 | 4.97877009299311 
(13 rows) 
Time: 0.592 ms

太棒了！查詢時間從200毫秒減少到1毫秒以內。

圖片來源：unsplash.com/@lucassankey

別擔心，本文將用詳細的代碼，手把手來傳授高效搜索法的絕技！

對象數據分類

對象數據可分為兩種類型：靜態數據（相對靜態，例如建築）和動態數據（例如人的活動和物聯網傳感器的活動）。

按研究需求分類的索引

時空快照搜索

時空行為數據搜索

時空行為數據是指從人的活動等動態對象中獲取數據。

例如，分析特定地區特定時間段內某一人群的特徵，或者分析大學周邊人群在工作日和週末構成的差異。

時空快照不屬於本文的討論範圍。現在，我們看看如何搜索時空行為數據。

數據結構

圖片來源：unsplash.com/@dekubaum

時空行為數據包含三個屬性：時間、空間和對象。

非結構化索引：

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 obj jsonb -- Object description 
);

除了應用於JSON，結構化數據還可以用於對象描述。例如：

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 c1 int, -- Some property examples 
 c2 int, 
 c3 text, 
 c4 float8, 
 c5 int, 
 c6 date, 
 c7 text, 
 c8 int, 
 c9 int, 
 c10 int 
);

時空行為數據的SQL查詢實例

select * from test 
 where 
 pos <-> ? < ? 
 and crt_time between ? and ? 
 and ( (c1 = ? and c2 between ? and ?) or c10=?) 
 ... 
 ;

優化方法

考慮運用以下知識：

時間序列BRIN索引

crt_time字段是一個時間序列字段，表示生成數據的時間。在PostgreSQL堆存儲中，存儲和該字段的值具有很強的線性相關性。

因此，BRIN索引很合適。

使用BRIN索引來代替分區表進行TPC-H測試。大範圍搜索的性能甚至優於使用分區表時的功能。

create index idx_test_1 on test using brin(crt_time);

空間索引

顯然，空間檢索需要空間索引。PostgreSQL中可以使用三種方法實現空間檢索。

1. 幾何類型的GIST索引

create index idx_test_2 on test using gist(pos);

該索引支持空間KNN搜索和空間位置確定等功能。

2. 幾何類型的主索引

create index idx_test_2 on test using spgist(pos);

該索引支持空間KNN搜索和空間位置確定等功能。

3. Geohash和B-tree索引（將經度和緯度轉換為Geohash併為hash值創建B-tree索引）。只需使用表達式索引。

create index idx_test_3 on test using btree( ST_GeoHash(pos,15) );

此索引支持前綴搜索（其能落實編碼地理信息網格中包含的關係）。它屬於有損索引，需要二次過濾。

GiST和SPGiST空間索引能夠找到準確的地理位置信息，優於GEOHASH索引。但是，查詢信息時需要特別注意。

GIN 索引

此索引類型的目標是對象屬性字段JSONB或多個結構化對象屬性字段。只需使用GIN索引。

例如：

create extension btree_gin;

非結構化索引：

create index idx_test_4 on test using gin( obj );

結構化索引：

create index idx_test_4 on test using gin( c1,c2,c3,c4,c5,c6,c7,c8,c9 );

BitmapAnd和BitmapOr

在上一節中，根據數據類型和查詢需求可以為不同的查詢維度選擇相應的索引。

但是，可以同時使用這些索引嗎? PostgreSQL為多個索引提供bitmapAnd及bitmapOr接口。它們可以組合多個索引，減少需要掃描的數據庫數量。

Heap, one square = one page: 
+---------------------------------------------+ 
|c____u_____X___u___X_________u___cXcc______u_| 
+---------------------------------------------+ 
Rows marked c match customers pkey condition. 
Rows marked u match username condition. 
Rows marked X match both conditions. 
Bitmap scan from customers_pkey: 
+---------------------------------------------+ 
|100000000001000000010000000000000111100000000| bitmap 1 
+---------------------------------------------+ 
One bit per heap page, in the same order as the heap 
Bits 1 when condition matches, 0 if not 
Bitmap scan from ix_cust_username: 
+---------------------------------------------+ 
|000001000001000100010000000001000010000000010| bitmap 2 
+---------------------------------------------+ 
Once the bitmaps are created a bitwise AND is performed on them: 
+---------------------------------------------+ 
|100000000001000000010000000000000111100000000| bitmap 1 
|000001000001000100010000000001000010000000010| bitmap 2 
 &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 
|000000000001000000010000000000000010000000000| Combined bitmap 
+-----------+-------+--------------+----------+ 
 | | | 
 v v v 
Used to scan the heap only for matching pages: 
+---------------------------------------------+ 
|___________X_______X______________X__________| 
+---------------------------------------------+ 
The bitmap heap scan then seeks to the start of each page and reads the page: 
+---------------------------------------------+ 
|___________X_______X______________X__________| 
+---------------------------------------------+ 
seek------->^seek-->^seek--------->^ 
 | | | 
 ------------------------ 
 only these pages read

例如：

select * from test where 
 c1 ... 
 and crt_time between ? and ? 
 and test->> c1 in (?, ? ...);

根據統計數據自動使用適當的索引。如果需要，bitmapAnd和bitmapOr將在多個索引上自動執行合併掃描。跳過不需要掃描的頁面，重新檢查命中的頁面。

堆表存儲分級和分區

存儲可以分為一級分區或多級分區：

1. 單一分區

例如，按時間劃分。

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 obj jsonb -- Object description 
) 
PARTITION BY range (crt_time) 
; 
create table test_201701 PARTITION OF test for values FROM ( 2017-01-01 ) TO ( 2017-02-01 ); 
......

2. 多層分區

例如，先按時間分區，然後按Geohash劃分。

create table test_201701 PARTITION OF test for values 
FROM ( 2017-01-01 ) TO ( 2017-02-01 ) partition by range(st_geohash(pos,15)); 
... 
create table test_201701_prefix1 PARTITION OF test for values 
FROM ( xxxx1 ) TO ( xxxx2 ); 
-- Generate BOX (GRID) on a map, find corresponding boundaries and use 
-- boundaries as partitioning conditions

使用分區時，如果查詢條件包括分區鍵（如時間和空間範圍），相應的分區將自動定位，這即為需要掃描的數據量。

創建面向對象屬性的GIN索引，以實現高效查詢。

索引分級與分區

與數據一樣，索引在不使用分區表的情況下也支持分區邏輯。

空間索引+時間分區

create index idx_20170101 
on tbl using gist (pos) 
where crt_time between 2017-01-01 and 2017-01-02 ; 
... 
create index idx_20170102 
on tbl using gist (pos) 
where crt_time between 2017-01-02 and 2017-01-03 ; 
...

通過使用前述分區索引，可以在輸入時間範圍後快速定位目標數據，執行空間搜索。

select * from tbl 
 where crt_time between 2017-01-01 and 2017-01-02 -- Time 
 and (pos <-> ?) < ? -- Distance to a point to be searched for 
 and ? -- Other conditions 
 order by pos <-> ? -- Sort by distance 
 limit ?; -- Number of results to be returned

可以使用更多的索引分區，比如用作搜索條件和商店類型的維度（對象屬性）（假設它是可枚舉的或在範圍相對較小的情況下）。

create index idx_20170101_mod0 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=0; 
... 
create index idx_20170101_mod1 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=1; 
...

通過使用前面的分區索引，在輸入時間範圍或特定條件以執行空間搜索後，可以快速定位目標數據。

select * from tbl 
 where crt_time between 2017-01-01 and 2017-01-02 -- Time 
 and (pos <-> ?) < ? -- Distance to a point to be searched for 
 and dtype=0 -- Object condition 
 and ? -- Other conditions 
 order by pos <-> ? -- Sort by distance 
 limit ?; -- Number of results to be returned

請注意，前面的SQL查詢可以實現最佳性能優化。

索引組織形式（或索引結構）可以由邏輯分區重新構造，可以用上述類似的索引創建方法覆蓋所有條件。

CTID相交陣列連接掃描

如前所述，BitmapAnd和BitmapOr合併掃描是在多個索引或GIN索引中自動執行的。事實上，這種掃描也可以在SQL中顯式執行。

每個條件滲透對應的CTID。

使用Intersect或Union生成滿足總體需求的CTID。（Intersect對應於“and”條件；union對應於“or”條件。）

生成一個ctid數組。

示例

圖片來源：unsplash.com/@markusspiske

1. 創建對象提要數據表

postgres=# create table tbl (id int, info text, crt_time timestamp, pos point, c1 int , c2 int, c3 int ); 
CREATE TABLE

2. 將5000萬條測試數據寫入表中

postgres=# insert into tbl select generate_series(1,50000000), md5(random()::text), clock_timestamp(), point(180-random()*180, 90-random()*90), random()*10000, random()*5000, random()*1000; 
INSERT 0 50000000

3. 創建對象索引

postgres=# create index idx_tbl_1 on tbl using gin (info, c1, c2, c3); 
CREATE INDEX

4. 創建時間索引

postgres=# create index idx_tbl_2 on tbl using btree (crt_time); 
CREATE INDEX

5. 創建空間索引

postgres=# create index idx_tbl_3 on tbl using gist (pos); 
CREATE INDEX

6. 生成數據佈局以方便後續查詢

postgres=# select min(crt_time),max(crt_time),count(*) from tbl; 
 min | max | count 
----------------------------+----------------------------+---------- 
 2017-07-22 17:59:34.136497 | 2017-07-22 18:01:27.233688 | 50000000 
(1 row)

7. 創建一個極限KNN查詢函數

create or replace function ff(point, float8, int) returns setof tid as 
$
declare 
 v_rec record; 
 v_limit int := $3; 
begin 
 set local enable_seqscan=off; -- Force index that exits when scanned rows reach a specific number 
 for v_rec in 
 select *, 
 (pos <-> $1) as dist, 
 ctid 
 from tbl 
 order by pos <-> $1 
 loop 
 if v_limit <=0 then 
 -- raise notice "Sufficient data obtained" 
 return; 
 end if; 
 if v_rec.dist > $2 then 
 -- raise notice "All matching points returned" 
 return; 
 else 
 return next v_rec.ctid; 
 end if; 
 v_limit := v_limit -1; 
 end loop; 
end; 
$
 language plpgsql strict volatile; 
postgres=# select * from ff(point (100,100) ,100,100) ; 
 ff 
------------- 
 (407383,11) 
 (640740,9) 
 (26073,51) 
 (642750,34) 
... 
(100 rows) 
Time: 1.061 ms

8. CTID合併檢索

顯示符合以下條件的記錄

( 
c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
c2 < 10 
) 
 and 
pos <-> point (0,0) < 5 
 and 
crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 ;

首先，分別查看每個條件，找匹配一個條件的記錄數量，以及在索引掃描上所花時長。

1. 54,907條記錄

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55); 
 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------- 
 Bitmap Heap Scan on postgres.tbl (cost=820.07..65393.94 rows=54151 width=73) (actual time=23.842..91.911 rows=54907 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Heap Blocks: exact=52778 
 Buffers: shared hit=52866 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.264..14.264 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 Planning time: 0.105 ms 
 Execution time: 94.606 ms 
(10 rows)

2. 95,147條記錄

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c2<10; 
 QUERY PLAN 
--------------------------------------------------------------------------------------------------------------------------------- 
 Bitmap Heap Scan on postgres.tbl (cost=835.73..112379.10 rows=99785 width=73) (actual time=69.243..179.388 rows=95147 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: (tbl.c2 < 10) 
 Heap Blocks: exact=88681 
 Buffers: shared hit=88734 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=53.612..53.612 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 Planning time: 0.094 ms 
 Execution time: 186.201 ms 
(10 rows)

3. 149930條記錄（為快速獲得結果，PostgreSQL使用位圖進行合併掃描）

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 <10; 
 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------ 
 Bitmap Heap Scan on postgres.tbl (cost=1694.23..166303.58 rows=153828 width=73) (actual time=98.988..266.852 rows=149930 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: ((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) 
 Heap Blocks: exact=134424 
 Buffers: shared hit=134565 
 -> BitmapOr (cost=1694.23..1694.23 rows=153936 width=0) (actual time=73.763..73.763 rows=0 loops=1) 
 Buffers: shared hit=141 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=16.733..16.733 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=57.029..57.029 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 Planning time: 0.149 ms 
 Execution time: 274.548 ms 
(15 rows)

4. 60,687條記錄（即使運用出色的KNN性能優化，仍然需要耗費195毫秒）。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ff(point (0,0) ,5,1000000); 
 QUERY PLAN 
---------------------------------------------------------------------------------------------------------------------- 
 Function Scan on postgres.ff (cost=0.25..10.25 rows=1000 width=6) (actual time=188.563..192.114 rows=60687 loops=1) 
 Output: ff 
 Function Call: ff( (0,0) ::point, 5 ::double precision, 1000000) 
 Buffers: shared hit=61296 
 Planning time: 0.029 ms 
 Execution time: 195.097 ms 
(6 rows)

讓我們看看不使用KNN優化需要多長時間。

結果非常令人驚訝——極限優化性能提高了一個數量級。

5. 2,640,751條記錄

使用所有索引逐個掃描數據條件，得到ctid並執行ctid掃描。

現在，讓我們來分解這個過程：

首先，讓我們看看時間和對象屬性的合併查詢，成果非常驚人。使用位圖BitmapOr時，查詢可以跳過大多數數據塊，並且掃描時間比單索引掃描要短。

注意，在此步驟中記錄的數量減少到7,847條。

postgres=# explain (analyze,verbose,timing,costs,buffers) select ctid from tbl 
 where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ); 
 QUERY PLAN 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
 Bitmap Heap Scan on postgres.tbl (cost=35025.85..44822.94 rows=7576 width=6) (actual time=205.577..214.821 rows=7847 loops=1) 
 Output: ctid 
 Recheck Cond: (((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) AND (tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Heap Blocks: exact=6983 
 Buffers: shared hit=14343 
 -> BitmapAnd (cost=35025.85..35025.85 rows=7581 width=0) (actual time=204.048..204.048 rows=0 loops=1) 
 Buffers: shared hit=7360 
 -> BitmapOr (cost=1621.11..1621.11 rows=153936 width=0) (actual time=70.279..70.279 rows=0 loops=1) 
 Buffers: shared hit=141 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=15.860..15.860 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=54.418..54.418 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 -> Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=127.101..127.101 rows=2640751 loops=1) 
 Index Cond: ((tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Buffers: shared hit=7219 
 Planning time: 0.203 ms 
 Execution time: 216.697 ms 
(20 rows)

然後，看KNN的掃描時間：

注意，60,687條記錄滿足KNN距離條件，所以接下來將解釋CTID合併掃描與原始掃描之間的性能比較。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ff(point (0,0) ,5,1000000); 
 QUERY PLAN 
---------------------------------------------------------------------------------------------------------------------- 
 Function Scan on postgres.ff (cost=0.25..10.25 rows=1000 width=6) (actual time=188.563..192.114 rows=60687 loops=1) 
 Output: ff 
 Function Call: ff( (0,0) ::point, 5 ::double precision, 1000000) 
 Buffers: shared hit=61296 
 Planning time: 0.029 ms 
 Execution time: 195.097 ms 
(6 rows)

最後，將這些片段合併到ctid中。

select * from ff(point (0,0) ,5,1000000) 
 intersect 
select ctid from tbl 
 where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ); 
 ff 
------------ 
 (1394,8) 
 (3892,50) 
 (6124,45) 
 (7235,8) 
 (7607,45) 
 (11540,8) 
 (13397,31) 
 (14266,36) 
 (18149,7) 
 (19256,44) 
 (24671,62) 
 (26525,64) 
 (30235,48) 
(13 rows) 
Time: 463.012 ms

取得最終紀錄。

select * from tbl where ctid = any 
( 
array( -- array start 
select * from ff(point (0,0) ,5,1000000) intersect select ctid from tbl 
 where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) 
) -- array end 
); 
 id | info | crt_time | pos | c1 | c2 | c3 
---------+----------------------------------+----------------------------+----------------------------------------+------+------+----- 
 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 
 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 
 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 
 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 
 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 
 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 
 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 
 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 
 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 
 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 
 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 
 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 
 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 
(13 rows) 
Time: 462.715 ms

過程花費462毫秒。

9. 測試原始SQL查詢的性能: PostgreSQL Multi-Index BitmapAnd and BitmapOr跳過掃描

直接編寫SQL查詢，而不是使用多CTID掃描。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl 
 where 
 crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) 
 and 
 pos <-> point (0,0) < 5; 
 Bitmap Heap Scan on postgres.tbl (cost=35022.06..44857.06 rows=2525 width=73) (actual time=205.542..214.547 rows=13 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: (((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) AND (tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Filter: ((tbl.pos <-> (0,0) ::point) < 5 ::double precision) 
 Rows Removed by Filter: 7834 
 Heap Blocks: exact=6983 
 Buffers: shared hit=14343 
 -> BitmapAnd (cost=35022.06..35022.06 rows=7581 width=0) (actual time=203.620..203.620 rows=0 loops=1) 
 Buffers: shared hit=7360 
 -> BitmapOr (cost=1618.58..1618.58 rows=153936 width=0) (actual time=71.660..71.660 rows=0 loops=1) 
 Buffers: shared hit=141 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.861..14.861 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=56.797..56.797 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 -> Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=125.255..125.255 rows=2640751 loops=1) 
 Index Cond: ((tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Buffers: shared hit=7219 
 Planning time: 0.160 ms 
 Execution time: 216.797 ms 
(22 rows)

校驗結果：

select * from tbl 
 where 
 crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) 
 and 
 pos <-> point (0,0) < 5; 
 id | info | crt_time | pos | c1 | c2 | c3 
---------+----------------------------------+----------------------------+----------------------------------------+------+------+----- 
 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 
 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 
 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 
 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 
 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 
 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 
 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 
 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 
 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 
 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 
 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 
 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 
 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 
(13 rows)

分區索引示例

假設前面的查詢條件保持不變，使用分區索引來測試性能。

postgres=# create index idx_tbl_4 on tbl using gist (pos) where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) ; 
CREATE INDEX 
Time: 8359.330 ms (00:08.359)

重構極值KNN優化函數

create or replace function ff(point, float8, int) returns setof record as 
$
declare 
 v_rec record; 
 v_limit int := $3; 
begin 
 set local enable_seqscan=off; -- Force index that exits when scanned rows reach a specific number 
 for v_rec in 
 select *, 
 (pos <-> $1) as dist 
 from tbl 
 where 
 crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) 
 order by pos <-> $1 
 loop 
 if v_limit <=0 then 
 -- raise notice "Sufficient data obtained" 
 return; 
 end if; 
 if v_rec.dist > $2 then 
 -- raise notice "All matching points returned" 
 return; 
 else 
 return next v_rec; 
 end if; 
 v_limit := v_limit -1; 
 end loop; 
end; 
$
 language plpgsql strict volatile;

查詢性能：

postgres=# select * from ff(point (0,0) , 5, 10000000) as t(id int, info text, crt_time timestamp, pos point, c1 int, c2 int, c3 int, dist float8); 
 id | info | crt_time | pos | c1 | c2 | c3 | dist 
---------+----------------------------------+----------------------------+----------------------------------------+------+------+-----+------------------- 
 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 | 0.421309141034319 
 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 | 0.49127323294376 
 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 | 2.23004532710301 
 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 | 2.23438404136508 
 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 | 2.76586731309247 
 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 | 2.78803520274409 
 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 | 2.88931598221975 
 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 | 3.22896754478952 
 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 | 3.31688000783581 
 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 | 3.47958123047986 
 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 | 3.91188935630676 
 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 | 4.86069100130757 
 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 | 4.97877009299311 
(13 rows) 
Time: 0.592 ms

太棒了！查詢時間從200毫秒減少到1毫秒以內。

優化方法綜述

優化方法回顧：

1. 為不同的數據類型構建不同的索引。

例如，對空間使用GiST或SP-GiST索引，對時間使用B樹或BRIN索引，對多個對象屬性使用GIN索引。索引的目的是縮小數據掃描的範圍。

2. 方法五提到數據分區。

數據分區的目的是有意地組織數據，這意味著有意地組織數據以滿足搜索需求。例如，如果時間是必需的查詢條件或公共查詢條件，那麼可以按時間（分區）分割數據，以減少需要掃描的數據量。

3. 方法六描述了索引分區。

目的類似於方法五。方法五和方法六的區別在於分區在索引級別使用，因此當執行索引掃描時，數據命中率會直接提高。

4.方法七中的ctid合併掃描類似於PostgreSQL中的多索引bitmapAnd或bitmapOr掃描。

bitmapAnd/bitmapOr跳過不需要掃描的塊，方法七中的ctid合併掃描跳過不需要掃描的行。

合併從多個索引掃描獲得的ctid。跳過不需要掃描的行數。

如果當其他條件為“AND”時，過濾條件可以顯著減少ctid（記錄），則沒有必要使用ctid合併掃描。相反，使用FILTER作為另一個條件。（這將略微增加CPU開銷。）

5. 最好的功夫總是以最大的靈活性、自由和對每一個動作的無限想象為特徵。

PostgreSQL實現多索引BitmapAnd或BitmapOr掃描，顯著提高了多種條件（索引）下的數據命中率。

此外，PostgreSQL具有出色的CBO估計機制，它允許PostgreSQL不總是使用位圖合併掃描的所有索引。這也是為什麼在“測試原始SQL查詢的性能——PostgreSQL多索引BitmapAnd位圖或跳過掃描”一節中描述的性能更好。

6. 如何實現極端優化

採用方法五或六，並使用可修復的條件作為分區鍵來分區數據或索引。

對於其他條件，可以使用PostgreSQL中的多索引BitmapAnd或BitmapOr掃描來提高多條件（索引）的數據命中率。

我們可以看到，按照時間、空間和對象屬性從5,000萬數據塊中進行多維檢索所需的時間減少到了0.592毫秒。

7. 對於空間數據，除了使用GiST索引，我們還可以使用BRIN索引，這降低了成本。有條理地組織數據後，會使濾波性能良好。

圖片來源：unsplash.com/@lucassankey

別擔心，本文將用詳細的代碼，手把手來傳授高效搜索法的絕技！

對象數據分類

對象數據可分為兩種類型：靜態數據（相對靜態，例如建築）和動態數據（例如人的活動和物聯網傳感器的活動）。

按研究需求分類的索引

時空快照搜索

時空行為數據搜索

時空行為數據是指從人的活動等動態對象中獲取數據。

例如，分析特定地區特定時間段內某一人群的特徵，或者分析大學周邊人群在工作日和週末構成的差異。

時空快照不屬於本文的討論範圍。現在，我們看看如何搜索時空行為數據。

數據結構

圖片來源：unsplash.com/@dekubaum

時空行為數據包含三個屬性：時間、空間和對象。

非結構化索引：

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 obj jsonb -- Object description 
);

除了應用於JSON，結構化數據還可以用於對象描述。例如：

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 c1 int, -- Some property examples 
 c2 int, 
 c3 text, 
 c4 float8, 
 c5 int, 
 c6 date, 
 c7 text, 
 c8 int, 
 c9 int, 
 c10 int 
);

時空行為數據的SQL查詢實例

select * from test 
 where 
 pos <-> ? < ? 
 and crt_time between ? and ? 
 and ( (c1 = ? and c2 between ? and ?) or c10=?) 
 ... 
 ;

優化方法

考慮運用以下知識：

時間序列BRIN索引

crt_time字段是一個時間序列字段，表示生成數據的時間。在PostgreSQL堆存儲中，存儲和該字段的值具有很強的線性相關性。

因此，BRIN索引很合適。

使用BRIN索引來代替分區表進行TPC-H測試。大範圍搜索的性能甚至優於使用分區表時的功能。

create index idx_test_1 on test using brin(crt_time);

空間索引

顯然，空間檢索需要空間索引。PostgreSQL中可以使用三種方法實現空間檢索。

1. 幾何類型的GIST索引

create index idx_test_2 on test using gist(pos);

該索引支持空間KNN搜索和空間位置確定等功能。

2. 幾何類型的主索引

create index idx_test_2 on test using spgist(pos);

該索引支持空間KNN搜索和空間位置確定等功能。

3. Geohash和B-tree索引（將經度和緯度轉換為Geohash併為hash值創建B-tree索引）。只需使用表達式索引。

create index idx_test_3 on test using btree( ST_GeoHash(pos,15) );

此索引支持前綴搜索（其能落實編碼地理信息網格中包含的關係）。它屬於有損索引，需要二次過濾。

GiST和SPGiST空間索引能夠找到準確的地理位置信息，優於GEOHASH索引。但是，查詢信息時需要特別注意。

GIN 索引

此索引類型的目標是對象屬性字段JSONB或多個結構化對象屬性字段。只需使用GIN索引。

例如：

create extension btree_gin;

非結構化索引：

create index idx_test_4 on test using gin( obj );

結構化索引：

create index idx_test_4 on test using gin( c1,c2,c3,c4,c5,c6,c7,c8,c9 );

BitmapAnd和BitmapOr

在上一節中，根據數據類型和查詢需求可以為不同的查詢維度選擇相應的索引。

但是，可以同時使用這些索引嗎? PostgreSQL為多個索引提供bitmapAnd及bitmapOr接口。它們可以組合多個索引，減少需要掃描的數據庫數量。

Heap, one square = one page: 
+---------------------------------------------+ 
|c____u_____X___u___X_________u___cXcc______u_| 
+---------------------------------------------+ 
Rows marked c match customers pkey condition. 
Rows marked u match username condition. 
Rows marked X match both conditions. 
Bitmap scan from customers_pkey: 
+---------------------------------------------+ 
|100000000001000000010000000000000111100000000| bitmap 1 
+---------------------------------------------+ 
One bit per heap page, in the same order as the heap 
Bits 1 when condition matches, 0 if not 
Bitmap scan from ix_cust_username: 
+---------------------------------------------+ 
|000001000001000100010000000001000010000000010| bitmap 2 
+---------------------------------------------+ 
Once the bitmaps are created a bitwise AND is performed on them: 
+---------------------------------------------+ 
|100000000001000000010000000000000111100000000| bitmap 1 
|000001000001000100010000000001000010000000010| bitmap 2 
 &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 
|000000000001000000010000000000000010000000000| Combined bitmap 
+-----------+-------+--------------+----------+ 
 | | | 
 v v v 
Used to scan the heap only for matching pages: 
+---------------------------------------------+ 
|___________X_______X______________X__________| 
+---------------------------------------------+ 
The bitmap heap scan then seeks to the start of each page and reads the page: 
+---------------------------------------------+ 
|___________X_______X______________X__________| 
+---------------------------------------------+ 
seek------->^seek-->^seek--------->^ 
 | | | 
 ------------------------ 
 only these pages read

例如：

select * from test where 
 c1 ... 
 and crt_time between ? and ? 
 and test->> c1 in (?, ? ...);

根據統計數據自動使用適當的索引。如果需要，bitmapAnd和bitmapOr將在多個索引上自動執行合併掃描。跳過不需要掃描的頁面，重新檢查命中的頁面。

堆表存儲分級和分區

存儲可以分為一級分區或多級分區：

1. 單一分區

例如，按時間劃分。

create table test( 
 id int8, 
 crt_time timestamp, -- Time 
 pos geometry, -- Location 
 obj jsonb -- Object description 
) 
PARTITION BY range (crt_time) 
; 
create table test_201701 PARTITION OF test for values FROM ( 2017-01-01 ) TO ( 2017-02-01 ); 
......

2. 多層分區

例如，先按時間分區，然後按Geohash劃分。

create table test_201701 PARTITION OF test for values 
FROM ( 2017-01-01 ) TO ( 2017-02-01 ) partition by range(st_geohash(pos,15)); 
... 
create table test_201701_prefix1 PARTITION OF test for values 
FROM ( xxxx1 ) TO ( xxxx2 ); 
-- Generate BOX (GRID) on a map, find corresponding boundaries and use 
-- boundaries as partitioning conditions

使用分區時，如果查詢條件包括分區鍵（如時間和空間範圍），相應的分區將自動定位，這即為需要掃描的數據量。

創建面向對象屬性的GIN索引，以實現高效查詢。

索引分級與分區

與數據一樣，索引在不使用分區表的情況下也支持分區邏輯。

空間索引+時間分區

create index idx_20170101 
on tbl using gist (pos) 
where crt_time between 2017-01-01 and 2017-01-02 ; 
... 
create index idx_20170102 
on tbl using gist (pos) 
where crt_time between 2017-01-02 and 2017-01-03 ; 
...

通過使用前述分區索引，可以在輸入時間範圍後快速定位目標數據，執行空間搜索。

select * from tbl 
 where crt_time between 2017-01-01 and 2017-01-02 -- Time 
 and (pos <-> ?) < ? -- Distance to a point to be searched for 
 and ? -- Other conditions 
 order by pos <-> ? -- Sort by distance 
 limit ?; -- Number of results to be returned

可以使用更多的索引分區，比如用作搜索條件和商店類型的維度（對象屬性）（假設它是可枚舉的或在範圍相對較小的情況下）。

create index idx_20170101_mod0 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=0; 
... 
create index idx_20170101_mod1 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=1; 
...

通過使用前面的分區索引，在輸入時間範圍或特定條件以執行空間搜索後，可以快速定位目標數據。

select * from tbl 
 where crt_time between 2017-01-01 and 2017-01-02 -- Time 
 and (pos <-> ?) < ? -- Distance to a point to be searched for 
 and dtype=0 -- Object condition 
 and ? -- Other conditions 
 order by pos <-> ? -- Sort by distance 
 limit ?; -- Number of results to be returned

請注意，前面的SQL查詢可以實現最佳性能優化。

索引組織形式（或索引結構）可以由邏輯分區重新構造，可以用上述類似的索引創建方法覆蓋所有條件。

CTID相交陣列連接掃描

如前所述，BitmapAnd和BitmapOr合併掃描是在多個索引或GIN索引中自動執行的。事實上，這種掃描也可以在SQL中顯式執行。

每個條件滲透對應的CTID。

使用Intersect或Union生成滿足總體需求的CTID。（Intersect對應於“and”條件；union對應於“or”條件。）

生成一個ctid數組。

示例

圖片來源：unsplash.com/@markusspiske

1. 創建對象提要數據表

postgres=# create table tbl (id int, info text, crt_time timestamp, pos point, c1 int , c2 int, c3 int ); 
CREATE TABLE

2. 將5000萬條測試數據寫入表中

postgres=# insert into tbl select generate_series(1,50000000), md5(random()::text), clock_timestamp(), point(180-random()*180, 90-random()*90), random()*10000, random()*5000, random()*1000; 
INSERT 0 50000000

3. 創建對象索引

postgres=# create index idx_tbl_1 on tbl using gin (info, c1, c2, c3); 
CREATE INDEX

4. 創建時間索引

postgres=# create index idx_tbl_2 on tbl using btree (crt_time); 
CREATE INDEX

5. 創建空間索引

postgres=# create index idx_tbl_3 on tbl using gist (pos); 
CREATE INDEX

6. 生成數據佈局以方便後續查詢

postgres=# select min(crt_time),max(crt_time),count(*) from tbl; 
 min | max | count 
----------------------------+----------------------------+---------- 
 2017-07-22 17:59:34.136497 | 2017-07-22 18:01:27.233688 | 50000000 
(1 row)

7. 創建一個極限KNN查詢函數

create or replace function ff(point, float8, int) returns setof tid as 
$
declare 
 v_rec record; 
 v_limit int := $3; 
begin 
 set local enable_seqscan=off; -- Force index that exits when scanned rows reach a specific number 
 for v_rec in 
 select *, 
 (pos <-> $1) as dist, 
 ctid 
 from tbl 
 order by pos <-> $1 
 loop 
 if v_limit <=0 then 
 -- raise notice "Sufficient data obtained" 
 return; 
 end if; 
 if v_rec.dist > $2 then 
 -- raise notice "All matching points returned" 
 return; 
 else 
 return next v_rec.ctid; 
 end if; 
 v_limit := v_limit -1; 
 end loop; 
end; 
$
 language plpgsql strict volatile; 
postgres=# select * from ff(point (100,100) ,100,100) ; 
 ff 
------------- 
 (407383,11) 
 (640740,9) 
 (26073,51) 
 (642750,34) 
... 
(100 rows) 
Time: 1.061 ms

8. CTID合併檢索

顯示符合以下條件的記錄

( 
c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
c2 < 10 
) 
 and 
pos <-> point (0,0) < 5 
 and 
crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 ;

首先，分別查看每個條件，找匹配一個條件的記錄數量，以及在索引掃描上所花時長。

1. 54,907條記錄

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55); 
 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------- 
 Bitmap Heap Scan on postgres.tbl (cost=820.07..65393.94 rows=54151 width=73) (actual time=23.842..91.911 rows=54907 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Heap Blocks: exact=52778 
 Buffers: shared hit=52866 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.264..14.264 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 Planning time: 0.105 ms 
 Execution time: 94.606 ms 
(10 rows)

2. 95,147條記錄

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c2<10; 
 QUERY PLAN 
--------------------------------------------------------------------------------------------------------------------------------- 
 Bitmap Heap Scan on postgres.tbl (cost=835.73..112379.10 rows=99785 width=73) (actual time=69.243..179.388 rows=95147 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: (tbl.c2 < 10) 
 Heap Blocks: exact=88681 
 Buffers: shared hit=88734 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=53.612..53.612 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 Planning time: 0.094 ms 
 Execution time: 186.201 ms 
(10 rows)

3. 149930條記錄（為快速獲得結果，PostgreSQL使用位圖進行合併掃描）

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 <10; 
 QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------ 
 Bitmap Heap Scan on postgres.tbl (cost=1694.23..166303.58 rows=153828 width=73) (actual time=98.988..266.852 rows=149930 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: ((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) 
 Heap Blocks: exact=134424 
 Buffers: shared hit=134565 
 -> BitmapOr (cost=1694.23..1694.23 rows=153936 width=0) (actual time=73.763..73.763 rows=0 loops=1) 
 Buffers: shared hit=141 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=16.733..16.733 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=57.029..57.029 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 Planning time: 0.149 ms 
 Execution time: 274.548 ms 
(15 rows)

4. 60,687條記錄（即使運用出色的KNN性能優化，仍然需要耗費195毫秒）。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ff(point (0,0) ,5,1000000); 
 QUERY PLAN 
---------------------------------------------------------------------------------------------------------------------- 
 Function Scan on postgres.ff (cost=0.25..10.25 rows=1000 width=6) (actual time=188.563..192.114 rows=60687 loops=1) 
 Output: ff 
 Function Call: ff( (0,0) ::point, 5 ::double precision, 1000000) 
 Buffers: shared hit=61296 
 Planning time: 0.029 ms 
 Execution time: 195.097 ms 
(6 rows)

讓我們看看不使用KNN優化需要多長時間。

結果非常令人驚訝——極限優化性能提高了一個數量級。

5. 2,640,751條記錄

使用所有索引逐個掃描數據條件，得到ctid並執行ctid掃描。

現在，讓我們來分解這個過程：

首先，讓我們看看時間和對象屬性的合併查詢，成果非常驚人。使用位圖BitmapOr時，查詢可以跳過大多數數據塊，並且掃描時間比單索引掃描要短。

注意，在此步驟中記錄的數量減少到7,847條。

postgres=# explain (analyze,verbose,timing,costs,buffers) select ctid from tbl 
 where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ); 
 QUERY PLAN 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
 Bitmap Heap Scan on postgres.tbl (cost=35025.85..44822.94 rows=7576 width=6) (actual time=205.577..214.821 rows=7847 loops=1) 
 Output: ctid 
 Recheck Cond: (((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) AND (tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Heap Blocks: exact=6983 
 Buffers: shared hit=14343 
 -> BitmapAnd (cost=35025.85..35025.85 rows=7581 width=0) (actual time=204.048..204.048 rows=0 loops=1) 
 Buffers: shared hit=7360 
 -> BitmapOr (cost=1621.11..1621.11 rows=153936 width=0) (actual time=70.279..70.279 rows=0 loops=1) 
 Buffers: shared hit=141 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=15.860..15.860 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=54.418..54.418 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 -> Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=127.101..127.101 rows=2640751 loops=1) 
 Index Cond: ((tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Buffers: shared hit=7219 
 Planning time: 0.203 ms 
 Execution time: 216.697 ms 
(20 rows)

然後，看KNN的掃描時間：

注意，60,687條記錄滿足KNN距離條件，所以接下來將解釋CTID合併掃描與原始掃描之間的性能比較。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ff(point (0,0) ,5,1000000); 
 QUERY PLAN 
---------------------------------------------------------------------------------------------------------------------- 
 Function Scan on postgres.ff (cost=0.25..10.25 rows=1000 width=6) (actual time=188.563..192.114 rows=60687 loops=1) 
 Output: ff 
 Function Call: ff( (0,0) ::point, 5 ::double precision, 1000000) 
 Buffers: shared hit=61296 
 Planning time: 0.029 ms 
 Execution time: 195.097 ms 
(6 rows)

最後，將這些片段合併到ctid中。

select * from ff(point (0,0) ,5,1000000) 
 intersect 
select ctid from tbl 
 where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ); 
 ff 
------------ 
 (1394,8) 
 (3892,50) 
 (6124,45) 
 (7235,8) 
 (7607,45) 
 (11540,8) 
 (13397,31) 
 (14266,36) 
 (18149,7) 
 (19256,44) 
 (24671,62) 
 (26525,64) 
 (30235,48) 
(13 rows) 
Time: 463.012 ms

取得最終紀錄。

select * from tbl where ctid = any 
( 
array( -- array start 
select * from ff(point (0,0) ,5,1000000) intersect select ctid from tbl 
 where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) 
) -- array end 
); 
 id | info | crt_time | pos | c1 | c2 | c3 
---------+----------------------------------+----------------------------+----------------------------------------+------+------+----- 
 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 
 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 
 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 
 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 
 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 
 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 
 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 
 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 
 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 
 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 
 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 
 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 
 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 
(13 rows) 
Time: 462.715 ms

過程花費462毫秒。

9. 測試原始SQL查詢的性能: PostgreSQL Multi-Index BitmapAnd and BitmapOr跳過掃描

直接編寫SQL查詢，而不是使用多CTID掃描。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl 
 where 
 crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) 
 and 
 pos <-> point (0,0) < 5; 
 Bitmap Heap Scan on postgres.tbl (cost=35022.06..44857.06 rows=2525 width=73) (actual time=205.542..214.547 rows=13 loops=1) 
 Output: id, info, crt_time, pos, c1, c2, c3 
 Recheck Cond: (((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) AND (tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Filter: ((tbl.pos <-> (0,0) ::point) < 5 ::double precision) 
 Rows Removed by Filter: 7834 
 Heap Blocks: exact=6983 
 Buffers: shared hit=14343 
 -> BitmapAnd (cost=35022.06..35022.06 rows=7581 width=0) (actual time=203.620..203.620 rows=0 loops=1) 
 Buffers: shared hit=7360 
 -> BitmapOr (cost=1618.58..1618.58 rows=153936 width=0) (actual time=71.660..71.660 rows=0 loops=1) 
 Buffers: shared hit=141 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.861..14.861 rows=54907 loops=1) 
 Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) 
 Buffers: shared hit=88 
 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=56.797..56.797 rows=95147 loops=1) 
 Index Cond: (tbl.c2 < 10) 
 Buffers: shared hit=53 
 -> Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=125.255..125.255 rows=2640751 loops=1) 
 Index Cond: ((tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) 
 Buffers: shared hit=7219 
 Planning time: 0.160 ms 
 Execution time: 216.797 ms 
(22 rows)

校驗結果：

select * from tbl 
 where 
 crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) 
 and 
 pos <-> point (0,0) < 5; 
 id | info | crt_time | pos | c1 | c2 | c3 
---------+----------------------------------+----------------------------+----------------------------------------+------+------+----- 
 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 
 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 
 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 
 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 
 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 
 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 
 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 
 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 
 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 
 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 
 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 
 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 
 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 
(13 rows)

分區索引示例

假設前面的查詢條件保持不變，使用分區索引來測試性能。

postgres=# create index idx_tbl_4 on tbl using gist (pos) where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) ; 
CREATE INDEX 
Time: 8359.330 ms (00:08.359)

重構極值KNN優化函數

create or replace function ff(point, float8, int) returns setof record as 
$
declare 
 v_rec record; 
 v_limit int := $3; 
begin 
 set local enable_seqscan=off; -- Force index that exits when scanned rows reach a specific number 
 for v_rec in 
 select *, 
 (pos <-> $1) as dist 
 from tbl 
 where 
 crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 
 and ( 
 c1 in (1,2,3,4,100,200,99,88,77,66,55) 
 or 
 c2 < 10 
 ) 
 order by pos <-> $1 
 loop 
 if v_limit <=0 then 
 -- raise notice "Sufficient data obtained" 
 return; 
 end if; 
 if v_rec.dist > $2 then 
 -- raise notice "All matching points returned" 
 return; 
 else 
 return next v_rec; 
 end if; 
 v_limit := v_limit -1; 
 end loop; 
end; 
$
 language plpgsql strict volatile;

查詢性能：

postgres=# select * from ff(point (0,0) , 5, 10000000) as t(id int, info text, crt_time timestamp, pos point, c1 int, c2 int, c3 int, dist float8); 
 id | info | crt_time | pos | c1 | c2 | c3 | dist 
---------+----------------------------------+----------------------------+----------------------------------------+------+------+-----+------------------- 
 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 | 0.421309141034319 
 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 | 0.49127323294376 
 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 | 2.23004532710301 
 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 | 2.23438404136508 
 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 | 2.76586731309247 
 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 | 2.78803520274409 
 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 | 2.88931598221975 
 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 | 3.22896754478952 
 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 | 3.31688000783581 
 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 | 3.47958123047986 
 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 | 3.91188935630676 
 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 | 4.86069100130757 
 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 | 4.97877009299311 
(13 rows) 
Time: 0.592 ms

太棒了！查詢時間從200毫秒減少到1毫秒以內。

優化方法綜述

優化方法回顧：

1. 為不同的數據類型構建不同的索引。

例如，對空間使用GiST或SP-GiST索引，對時間使用B樹或BRIN索引，對多個對象屬性使用GIN索引。索引的目的是縮小數據掃描的範圍。

2. 方法五提到數據分區。

3. 方法六描述了索引分區。

目的類似於方法五。方法五和方法六的區別在於分區在索引級別使用，因此當執行索引掃描時，數據命中率會直接提高。

4.方法七中的ctid合併掃描類似於PostgreSQL中的多索引bitmapAnd或bitmapOr掃描。

bitmapAnd/bitmapOr跳過不需要掃描的塊，方法七中的ctid合併掃描跳過不需要掃描的行。

合併從多個索引掃描獲得的ctid。跳過不需要掃描的行數。

5. 最好的功夫總是以最大的靈活性、自由和對每一個動作的無限想象為特徵。

PostgreSQL實現多索引BitmapAnd或BitmapOr掃描，顯著提高了多種條件（索引）下的數據命中率。

6. 如何實現極端優化

採用方法五或六，並使用可修復的條件作為分區鍵來分區數據或索引。

對於其他條件，可以使用PostgreSQL中的多索引BitmapAnd或BitmapOr掃描來提高多條件（索引）的數據命中率。

我們可以看到，按照時間、空間和對象屬性從5,000萬數據塊中進行多維檢索所需的時間減少到了0.592毫秒。

7. 對於空間數據，除了使用GiST索引，我們還可以使用BRIN索引，這降低了成本。有條理地組織數據後，會使濾波性能良好。

留言點贊關注

我們一起分享AI學習與發展的乾貨

編譯組：李美尼、餘書敏

相關鏈接：

https://dzone.com/articles/efficient-search-on-massive-data-with-multidimensi

如需轉載，請後臺留言，遵守轉載規範

'用詳細的代碼，手把手傳授給你高效搜索法的絕技'

相關推薦