'數據庫DBA不可不知的操作系統內核參數，值得收藏'

數據庫 DBA 操作系統 CentOS PostgreSQL 收藏硬件軟件波波說運維 2019-08-05

概述

操作系統為了適應更多的硬件環境，許多初始的設置值，寬容度都很高。如果不經調整，這些值可能無法適應HPC，或者硬件稍好些的環境，無法發揮更好的硬件性能，甚至可能影響某些應用軟件的使用，特別是數據庫。今天主要介紹一些DBA不可不知的操作系統內核參數，僅供參考，只針對數據庫方面。

數據庫關心的OS內核參數

這裡以512GB 內存為例

1.參數

fs.aio-max-nr

支持系統：CentOS 6, 7

概述

數據庫關心的OS內核參數

這裡以512GB 內存為例

1.參數

fs.aio-max-nr

支持系統：CentOS 6, 7

參數解釋

推薦設置

fs.aio-max-nr = 1xxxxxx 
. 
PostgreSQL, Greenplum 均未使用io_setup創建aio contexts. 無需設置。 
如果Oracle數據庫，要使用aio的話，需要設置它。 
設置它也沒什麼壞處，如果將來需要適應異步IO，可以不需要重新修改這個設置。

2.參數

fs.file-max

支持系統：CentOS 6, 7

概述

數據庫關心的OS內核參數

這裡以512GB 內存為例

1.參數

fs.aio-max-nr

支持系統：CentOS 6, 7

參數解釋

推薦設置

fs.aio-max-nr = 1xxxxxx 
. 
PostgreSQL, Greenplum 均未使用io_setup創建aio contexts. 無需設置。 
如果Oracle數據庫，要使用aio的話，需要設置它。 
設置它也沒什麼壞處，如果將來需要適應異步IO，可以不需要重新修改這個設置。

2.參數

fs.file-max

支持系統：CentOS 6, 7

參數解釋

推薦設置

fs.file-max = 7xxxxxxx 
. 
PostgreSQL 有一套自己管理的VFS，真正打開的FD與內核管理的文件打開關閉有一套映射的機制，所以真實情況不需要使用那麼多的file handlers。 
max_files_per_process 參數。 
假設1GB內存支撐100個連接，每個連接打開1000個文件，那麼一個PG實例需要打開10萬個文件，一臺機器按512G內存來算可以跑500個PG實例，則需要5000萬個file handler。 
以上設置綽綽有餘。

3.參數

kernel.core_pattern

支持系統：CentOS 6, 7

參數解釋

概述

數據庫關心的OS內核參數

這裡以512GB 內存為例

1.參數

fs.aio-max-nr

支持系統：CentOS 6, 7

參數解釋

推薦設置

fs.aio-max-nr = 1xxxxxx 
. 
PostgreSQL, Greenplum 均未使用io_setup創建aio contexts. 無需設置。 
如果Oracle數據庫，要使用aio的話，需要設置它。 
設置它也沒什麼壞處，如果將來需要適應異步IO，可以不需要重新修改這個設置。

2.參數

fs.file-max

支持系統：CentOS 6, 7

參數解釋

推薦設置

fs.file-max = 7xxxxxxx 
. 
PostgreSQL 有一套自己管理的VFS，真正打開的FD與內核管理的文件打開關閉有一套映射的機制，所以真實情況不需要使用那麼多的file handlers。 
max_files_per_process 參數。 
假設1GB內存支撐100個連接，每個連接打開1000個文件，那麼一個PG實例需要打開10萬個文件，一臺機器按512G內存來算可以跑500個PG實例，則需要5000萬個file handler。 
以上設置綽綽有餘。

3.參數

kernel.core_pattern

支持系統：CentOS 6, 7

參數解釋

推薦設置

kernel.core_pattern = /xxx/core_%e_%u_%t_%s.%p 
. 
這個目錄要777的權限，如果它是個軟鏈，則真實目錄需要777的權限 
mkdir /xxx 
chmod 777 /xxx 
注意留足夠的空間

4.參數

kernel.sem

支持系統：CentOS 6, 7

參數解釋

kernel.sem = 4096 2147483647 2147483646 512000 
. 
4096 每組多少信號量 (>=17, PostgreSQL 每16個進程一組, 每組需要17個信號量) , 
2147483647 總共多少信號量 (2^31-1 , 且大於4096*512000 ) , 
2147483646 每個semop()調用支持多少操作 (2^31-1), 
512000 多少組信號量 (假設每GB支持100個連接, 512GB支持51200個連接, 加上其他進程, > 51200*2/16 綽綽有餘) 
. 
# sysctl -w kernel.sem="4096 2147483647 2147483646 512000" 
. 
# ipcs -s -l 
 ------ Semaphore Limits -------- 
max number of arrays = 512000 
max semaphores per array = 4096 
max semaphores system wide = 2147483647 
max ops per semop call = 2147483646 
semaphore max value = 32767

推薦設置

kernel.sem = 4096 2147483647 2147483646 512000 
. 
4096可能能夠適合更多的場景, 所以大點無妨，關鍵是512000 arrays也夠了。

5.參數

kernel.shmall = 107374182 
kernel.shmmax = 274877906944 
kernel.shmmni = 819200

支持系統：CentOS 6, 7

參數解釋

假設主機內存 512GB 
. 
shmmax 單個共享內存段最大 256GB (主機內存的一半，單位字節) 
shmall 所有共享內存段加起來最大 (主機內存的80%，單位PAGE) 
shmmni 一共允許創建819200個共享內存段 (每個數據庫啟動需要2個共享內存段。 將來允許動態創建共享內存段，可能需求量更大) 
. 
# getconf PAGE_SIZE 
4096

推薦設置

kernel.shmall = 107374182 
kernel.shmmax = 274877906944 
kernel.shmmni = 819200 
. 
9.2以及以前的版本，數據庫啟動時，對共享內存段的內存需求非常大，需要考慮以下幾點 
Connections:\t(1800 + 270 * max_locks_per_transaction) * max_connections 
Autovacuum workers:\t(1800 + 270 * max_locks_per_transaction) * autovacuum_max_workers 
Prepared transactions:\t(770 + 270 * max_locks_per_transaction) * max_prepared_transactions 
Shared disk buffers:\t(block_size + 208) * shared_buffers 
WAL buffers:\t(wal_block_size + 8) * wal_buffers 
Fixed space requirements:\t770 kB 
. 
以上建議參數根據9.2以前的版本設置，後期的版本同樣適用。

6.參數

net.core.netdev_max_backlog

支持系統：CentOS 6, 7

參數解釋

netdev_max_backlog 
 ------------------ 
Maximum number of packets, queued on the INPUT side, 
when the interface receives packets faster than kernel can process them.

推薦設置

net.core.netdev_max_backlog=1xxxx 
. 
INPUT鏈表越長，處理耗費越大，如果用了iptables管理的話，需要加大這個值。

7.參數

net.core.rmem_default 
net.core.rmem_max 
net.core.wmem_default 
net.core.wmem_max

支持系統：CentOS 6, 7

參數解釋

rmem_default 
 ------------ 
The default setting of the socket receive buffer in bytes. 
. 
rmem_max 
 -------- 
The maximum receive socket buffer size in bytes. 
. 
wmem_default 
 ------------ 
The default setting (in bytes) of the socket send buffer. 
. 
wmem_max 
 -------- 
The maximum send socket buffer size in bytes.

推薦設置

net.core.rmem_default = 262144 
net.core.rmem_max = 4194304 
net.core.wmem_default = 262144 
net.core.wmem_max = 4194304

8.參數

net.core.somaxconn

支持系統：CentOS 6, 7

參數解釋

somaxconn - INTEGER 
 Limit of socket listen() backlog, known in userspace as SOMAXCONN. 
 Defaults to 128. 
\tSee also tcp_max_syn_backlog for additional tuning for TCP sockets.

推薦設置

net.core.somaxconn=4xxx

9.參數

net.ipv4.tcp_max_syn_backlog

支持系統：CentOS 6, 7

參數解釋

tcp_max_syn_backlog - INTEGER 
 Maximal number of remembered connection requests, which have not 
 received an acknowledgment from connecting client. 
 The minimal value is 128 for low memory machines, and it will 
 increase in proportion to the memory of machine. 
 If server suffers from overload, try increasing this number.

推薦設置

net.ipv4.tcp_max_syn_backlog=4xxx 
pgpool-II 使用了這個值，用於將超過num_init_child以外的連接queue。 
所以這個值決定了有多少連接可以在隊列裡面等待。

10.參數

net.ipv4.tcp_keepalive_intvl=20 
net.ipv4.tcp_keepalive_probes=3 
net.ipv4.tcp_keepalive_time=60

支持系統：CentOS 6, 7

參數解釋

tcp_keepalive_time - INTEGER 
 How often TCP sends out keepalive messages when keepalive is enabled. 
 Default: 2hours. 
. 
tcp_keepalive_probes - INTEGER 
 How many keepalive probes TCP sends out, until it decides that the 
 connection is broken. Default value: 9. 
. 
tcp_keepalive_intvl - INTEGER 
 How frequently the probes are send out. Multiplied by 
 tcp_keepalive_probes it is time to kill not responding connection, 
 after probes started. Default value: 75sec i.e. connection 
 will be aborted after ~11 minutes of retries.

推薦設置

net.ipv4.tcp_keepalive_intvl=20 
net.ipv4.tcp_keepalive_probes=3 
net.ipv4.tcp_keepalive_time=60 
. 
連接空閒60秒後, 每隔20秒發心跳包, 嘗試3次心跳包沒有響應，關閉連接。 從開始空閒，到關閉連接總共歷時120秒。

11.參數

net.ipv4.tcp_mem=8388608 12582912 16777216

支持系統：CentOS 6, 7

參數解釋

tcp_mem - vector of 3 INTEGERs: min, pressure, max 
單位 page 
 min: below this number of pages TCP is not bothered about its 
 memory appetite. 
. 
 pressure: when amount of memory allocated by TCP exceeds this number 
 of pages, TCP moderates its memory consumption and enters memory 
 pressure mode, which is exited when memory consumption falls 
 under "min". 
. 
 max: number of pages allowed for queueing by all TCP sockets. 
. 
 Defaults are calculated at boot time from amount of available 
 memory. 
64GB 內存，自動計算的值是這樣的 
net.ipv4.tcp_mem = 1539615 2052821 3079230 
. 
512GB 內存，自動計算得到的值是這樣的 
net.ipv4.tcp_mem = 49621632 66162176 99243264 
. 
這個參數讓操作系統啟動時自動計算，問題也不大

推薦設置

net.ipv4.tcp_mem=8388608 12582912 16777216 
. 
這個參數讓操作系統啟動時自動計算，問題也不大

12.參數

net.ipv4.tcp_fin_timeout

支持系統：CentOS 6, 7

參數解釋

tcp_fin_timeout - INTEGER 
 The length of time an orphaned (no longer referenced by any 
 application) connection will remain in the FIN_WAIT_2 state 
 before it is aborted at the local end. While a perfectly 
 valid "receive only" state for an un-orphaned connection, an 
 orphaned connection in FIN_WAIT_2 state could otherwise wait 
 forever for the remote to close its end of the connection. 
 Cf. tcp_max_orphans 
 Default: 60 seconds

推薦設置

net.ipv4.tcp_fin_timeout=5 
. 
加快殭屍連接回收速度

13.參數

net.ipv4.tcp_synack_retries

支持系統：CentOS 6, 7

參數解釋

tcp_synack_retries - INTEGER 
 Number of times SYNACKs for a passive TCP connection attempt will 
 be retransmitted. Should not be higher than 255. Default value 
 is 5, which corresponds to 31seconds till the last retransmission 
 with the current initial RTO of 1second. With this the final timeout 
 for a passive TCP connection will happen after 63seconds.

推薦設置

net.ipv4.tcp_synack_retries=2 
. 
縮短tcp syncack超時時間

14.參數

net.ipv4.tcp_syncookies

支持系統：CentOS 6, 7

參數解釋

tcp_syncookies - BOOLEAN 
 Only valid when the kernel was compiled with CONFIG_SYN_COOKIES 
 Send out syncookies when the syn backlog queue of a socket 
 overflows. This is to prevent against the common 'SYN flood attack' 
 Default: 1 
. 
 Note, that syncookies is fallback facility. 
 It MUST NOT be used to help highly loaded servers to stand 
 against legal connection rate. If you see SYN flood warnings 
 in your logs, but investigation shows that they occur 
 because of overload with legal connections, you should tune 
 another parameters until this warning disappear. 
 See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow. 
. 
 syncookies seriously violate TCP protocol, do not allow 
 to use TCP extensions, can result in serious degradation 
 of some services (f.e. SMTP relaying), visible not by you, 
 but your clients and relays, contacting you. While you see 
 SYN flood warnings in logs not being really flooded, your server 
 is seriously misconfigured. 
. 
 If you want to test which effects syncookies have to your 
 network connections you can set this knob to 2 to enable 
 unconditionally generation of syncookies.

推薦設置

net.ipv4.tcp_syncookies=1 
. 
防止syn flood攻擊

15.參數

net.ipv4.tcp_timestamps

支持系統：CentOS 6, 7

參數解釋

tcp_timestamps - BOOLEAN 
 Enable timestamps as defined in RFC1323.

推薦設置

net.ipv4.tcp_timestamps=1 
. 
tcp_timestamps 是 tcp 協議中的一個擴展項，通過時間戳的方式來檢測過來的包以防止 PAWS(Protect Against Wrapped Sequence numbers)，可以提高 tcp 的性能。

16.參數

net.ipv4.tcp_tw_recycle 
net.ipv4.tcp_tw_reuse 
net.ipv4.tcp_max_tw_buckets

支持系統：CentOS 6, 7

參數解釋

tcp_tw_recycle - BOOLEAN 
 Enable fast recycling TIME-WAIT sockets. Default value is 0. 
 It should not be changed without advice/request of technical 
 experts. 
. 
tcp_tw_reuse - BOOLEAN 
 Allow to reuse TIME-WAIT sockets for new connections when it is 
 safe from protocol viewpoint. Default value is 0. 
 It should not be changed without advice/request of technical 
 experts. 
. 
tcp_max_tw_buckets - INTEGER 
 Maximal number of timewait sockets held by system simultaneously. 
 If this number is exceeded time-wait socket is immediately destroyed 
 and warning is printed. 
\tThis limit exists only to prevent simple DoS attacks, 
\tyou _must_ not lower the limit artificially, 
 but rather increase it (probably, after increasing installed memory), 
 if network conditions require more than default value.

推薦設置

net.ipv4.tcp_tw_recycle=0 
net.ipv4.tcp_tw_reuse=1 
net.ipv4.tcp_max_tw_buckets = 2xxxxx 
. 
net.ipv4.tcp_tw_recycle和net.ipv4.tcp_timestamps不建議同時開啟

17.參數

net.ipv4.tcp_rmem 
net.ipv4.tcp_wmem

支持系統：CentOS 6, 7

參數解釋

tcp_wmem - vector of 3 INTEGERs: min, default, max 
 min: Amount of memory reserved for send buffers for TCP sockets. 
 Each TCP socket has rights to use it due to fact of its birth. 
 Default: 1 page 
. 
 default: initial size of send buffer used by TCP sockets. This 
 value overrides net.core.wmem_default used by other protocols. 
 It is usually lower than net.core.wmem_default. 
 Default: 16K 
. 
 max: Maximal amount of memory allowed for automatically tuned 
 send buffers for TCP sockets. This value does not override 
 net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables 
 automatic tuning of that socket's send buffer size, in which case 
 this value is ignored. 
 Default: between 64K and 4MB, depending on RAM size. 
. 
tcp_rmem - vector of 3 INTEGERs: min, default, max 
 min: Minimal size of receive buffer used by TCP sockets. 
 It is guaranteed to each TCP socket, even under moderate memory 
 pressure. 
 Default: 1 page 
. 
 default: initial size of receive buffer used by TCP sockets. 
 This value overrides net.core.rmem_default used by other protocols. 
 Default: 87380 bytes. This value results in window of 65535 with 
 default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit 
 less for default tcp_app_win. See below about these variables. 
. 
 max: maximal size of receive buffer allowed for automatically 
 selected receiver buffers for TCP socket. This value does not override 
 net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables 
 automatic tuning of that socket's receive buffer size, in which 
 case this value is ignored. 
 Default: between 87380B and 6MB, depending on RAM size.

推薦設置

net.ipv4.tcp_rmem=8192 87380 16777216 
net.ipv4.tcp_wmem=8192 65536 16777216 
. 
許多數據庫的推薦設置，提高網絡性能

18.參數

net.nf_conntrack_max 
net.netfilter.nf_conntrack_max

支持系統：CentOS 6

參數解釋

nf_conntrack_max - INTEGER 
 Size of connection tracking table. 
\tDefault value is nf_conntrack_buckets value * 4.

推薦設置

net.nf_conntrack_max=1xxxxxx 
net.netfilter.nf_conntrack_max=1xxxxxx

19.參數

vm.dirty_background_bytes 
vm.dirty_expire_centisecs 
vm.dirty_ratio 
vm.dirty_writeback_centisecs

支持系統：CentOS 6, 7

參數解釋

=====================================================
. 
dirty_background_bytes 
. 
Contains the amount of dirty memory at which the background kernel 
flusher threads will start writeback. 
. 
Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only 
one of them may be specified at a time. When one sysctl is written it is 
immediately taken into account to evaluate the dirty memory limits and the 
other appears as 0 when read. 
. 
=====================================================
. 
dirty_background_ratio 
. 
Contains, as a percentage of total system memory, the number of pages at which 
the background kernel flusher threads will start writing out dirty data. 
. 
=====================================================
. 
dirty_bytes 
. 
Contains the amount of dirty memory at which a process generating disk writes 
will itself start writeback. 
. 
Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be 
specified at a time. When one sysctl is written it is immediately taken into 
account to evaluate the dirty memory limits and the other appears as 0 when 
read. 
. 
Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any 
value lower than this limit will be ignored and the old configuration will be 
retained. 
. 
=====================================================
. 
dirty_expire_centisecs 
. 
This tunable is used to define when dirty data is old enough to be eligible 
for writeout by the kernel flusher threads. It is expressed in 100'ths 
of a second. Data which has been dirty in-memory for longer than this 
interval will be written out next time a flusher thread wakes up. 
. 
=====================================================
. 
dirty_ratio 
. 
Contains, as a percentage of total system memory, the number of pages at which 
a process which is generating disk writes will itself start writing out dirty 
data. 
. 
=====================================================
. 
dirty_writeback_centisecs 
. 
The kernel flusher threads will periodically wake up and write `old' data 
out to disk. This tunable expresses the interval between those wakeups, in 
100'ths of a second. 
. 
Setting this to zero disables periodic writeback altogether.

推薦設置

vm.dirty_background_bytes = 4096000000 
vm.dirty_expire_centisecs = 6000 
vm.dirty_ratio = 80 
vm.dirty_writeback_centisecs = 50 
. 
減少數據庫進程刷髒頁的頻率，dirty_background_bytes根據實際IOPS能力以及內存大小設置

20.參數

vm.extra_free_kbytes

支持系統：CentOS 6

參數解釋

extra_free_kbytes 
. 
This parameter tells the VM to keep extra free memory 
between the threshold where background reclaim (kswapd) kicks in, 
and the threshold where direct reclaim (by allocating processes) kicks in. 
. 
This is useful for workloads that require low latency memory allocations 
and have a bounded burstiness in memory allocations, 
for example a realtime application that receives and transmits network traffic 
(causing in-kernel memory allocations) with a maximum total message burst 
size of 200MB may need 200MB of extra free memory to avoid direct reclaim 
related latencies. 
. 
目標是儘量讓後臺進程回收內存，比用戶進程提早多少kbytes回收，因此用戶進程可以快速分配內存。

推薦設置

vm.extra_free_kbytes=4xxxxxx

21.參數

vm.min_free_kbytes

支持系統：CentOS 6, 7

參數解釋

min_free_kbytes: 
. 
This is used to force the Linux VM to keep a minimum number 
of kilobytes free. The VM uses this number to compute a 
watermark[WMARK_MIN] value for each lowmem zone in the system. 
Each lowmem zone gets a number of reserved free pages based 
proportionally on its size. 
. 
Some minimal amount of memory is needed to satisfy PF_MEMALLOC 
allocations; if you set this to lower than 1024KB, your system will 
become subtly broken, and prone to deadlock under high loads. 
. 
Setting this too high will OOM your machine instantly.

推薦設置

vm.min_free_kbytes = 2xxxxxx # vm.min_free_kbytes 建議每32G內存分配1G vm.min_free_kbytes
. 
防止在高負載時系統無響應，減少內存分配死鎖概率。

22.參數

vm.mmap_min_addr

支持系統：CentOS 6, 7

參數解釋

mmap_min_addr 
. 
This file indicates the amount of address space which a user process will 
be restricted from mmapping. Since kernel null dereference bugs could 
accidentally operate based on the information in the first couple of pages 
of memory userspace processes should not be allowed to write to them. By 
default this value is set to 0 and no protections will be enforced by the 
security module. Setting this value to something like 64k will allow the 
vast majority of applications to work correctly and provide defense in depth 
against future potential kernel bugs.

推薦設置

vm.mmap_min_addr=6xxxx 
. 
防止內核隱藏的BUG導致的問題

23.參數

vm.overcommit_memory 
vm.overcommit_ratio

支持系統：CentOS 6, 7

參數解釋

====================================================
. 
overcommit_kbytes: 
. 
When overcommit_memory is set to 2, the committed address space is not 
permitted to exceed swap plus this amount of physical RAM. See below. 
. 
Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one 
of them may be specified at a time. Setting one disables the other (which 
then appears as 0 when read). 
. 
====================================================
. 
overcommit_memory: 
. 
This value contains a flag that enables memory overcommitment. 
. 
When this flag is 0, 
the kernel attempts to estimate the amount 
of free memory left when userspace requests more memory. 
. 
When this flag is 1, 
the kernel pretends there is always enough memory until it actually runs out. 
. 
When this flag is 2, 
the kernel uses a "never overcommit" 
policy that attempts to prevent any overcommit of memory. 
Note that user_reserve_kbytes affects this policy. 
. 
This feature can be very useful because there are a lot of 
programs that malloc() huge amounts of memory "just-in-case" 
and don't use much of it. 
. 
The default value is 0. 
. 
See Documentation/vm/overcommit-accounting and 
security/commoncap.c::cap_vm_enough_memory() for more information. 
. 
=====================================================
. 
overcommit_ratio: 
. 
When overcommit_memory is set to 2, 
the committed address space is not permitted to exceed 
 swap + this percentage of physical RAM. 
See above.

推薦設置

vm.overcommit_memory = 0 
vm.overcommit_ratio = 90 
. 
vm.overcommit_memory = 0 時 vm.overcommit_ratio可以不設置

24.參數

vm.swappiness

支持系統：CentOS 6, 7

參數解釋

swappiness 
. 
This control is used to define how aggressive the kernel will swap 
memory pages. 
Higher values will increase agressiveness, lower values 
decrease the amount of swap. 
. 
The default value is 60.

推薦設置

vm.swappiness = 0

25.參數

vm.zone_reclaim_mode

支持系統：CentOS 6, 7

參數解釋

zone_reclaim_mode: 
. 
Zone_reclaim_mode allows someone to set more or less aggressive approaches to 
reclaim memory when a zone runs out of memory. If it is set to zero then no 
zone reclaim occurs. Allocations will be satisfied from other zones / nodes 
in the system. 
. 
This is value ORed together of 
. 
1 = Zone reclaim on 
2 = Zone reclaim writes dirty pages out 
4 = Zone reclaim swaps pages 
. 
zone_reclaim_mode is disabled by default. For file servers or workloads 
that benefit from having their data cached, zone_reclaim_mode should be 
left disabled as the caching effect is likely to be more important than 
data locality. 
. 
zone_reclaim may be enabled if it's known that the workload is partitioned 
such that each partition fits within a NUMA node and that accessing remote 
memory would cause a measurable performance reduction. The page allocator 
will then reclaim easily reusable pages (those page cache pages that are 
currently not used) before allocating off node pages. 
. 
Allowing zone reclaim to write out pages stops processes that are 
writing large amounts of data from dirtying pages on other nodes. Zone 
reclaim will write out dirty pages if a zone fills up and so effectively 
throttle the process. This may decrease the performance of a single process 
since it cannot use all of system memory to buffer the outgoing writes 
anymore but it preserve the memory on other nodes so that the performance 
of other processes running on other nodes will not be affected. 
. 
Allowing regular swap effectively restricts allocations to the local 
node unless explicitly overridden by memory policies or cpuset 
configurations.

推薦設置

vm.zone_reclaim_mode=0 
. 
不使用NUMA

26.參數

net.ipv4.ip_local_port_range

支持系統：CentOS 6, 7

參數解釋

ip_local_port_range - 2 INTEGERS 
 Defines the local port range that is used by TCP and UDP to 
 choose the local port. The first number is the first, the 
 second the last local port number. The default values are 
 32768 and 61000 respectively. 
. 
ip_local_reserved_ports - list of comma separated ranges 
 Specify the ports which are reserved for known third-party 
 applications. These ports will not be used by automatic port 
 assignments (e.g. when calling connect() or bind() with port 
 number 0). Explicit port allocation behavior is unchanged. 
. 
 The format used for both input and output is a comma separated 
 list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and 
 10). Writing to the file will clear all previously reserved 
 ports and update the current list with the one given in the 
 input. 
. 
 Note that ip_local_port_range and ip_local_reserved_ports 
 settings are independent and both are considered by the kernel 
 when determining which ports are available for automatic port 
 assignments. 
. 
 You can reserve ports which are not in the current 
 ip_local_port_range, e.g.: 
. 
 $ cat /proc/sys/net/ipv4/ip_local_port_range 
 32000 61000 
 $ cat /proc/sys/net/ipv4/ip_local_reserved_ports 
 8080,9148 
. 
 although this is redundant. However such a setting is useful 
 if later the port range is changed to a value that will 
 include the reserved ports. 
. 
 Default: Empty

推薦設置

net.ipv4.ip_local_port_range=40000 65535 
. 
限制本地動態端口分配範圍，防止佔用監聽端口。

27.參數

 vm.nr_hugepages

支持系統：CentOS 6, 7

參數解釋

=====================================================
nr_hugepages 
Change the minimum size of the hugepage pool. 
See Documentation/vm/hugetlbpage.txt 
=====================================================
nr_overcommit_hugepages 
Change the maximum size of the hugepage pool. The maximum is 
nr_hugepages + nr_overcommit_hugepages. 
See Documentation/vm/hugetlbpage.txt 
. 
The output of "cat /proc/meminfo" will include lines like: 
...... 
HugePages_Total: vvv 
HugePages_Free: www 
HugePages_Rsvd: xxx 
HugePages_Surp: yyy 
Hugepagesize: zzz kB 
. 
where: 
HugePages_Total is the size of the pool of huge pages. 
HugePages_Free is the number of huge pages in the pool that are not yet 
 allocated. 
HugePages_Rsvd is short for "reserved," and is the number of huge pages for 
 which a commitment to allocate from the pool has been made, 
 but no allocation has yet been made. Reserved huge pages 
 guarantee that an application will be able to allocate a 
 huge page from the pool of huge pages at fault time. 
HugePages_Surp is short for "surplus," and is the number of huge pages in 
 the pool above the value in /proc/sys/vm/nr_hugepages. The 
 maximum number of surplus huge pages is controlled by 
 /proc/sys/vm/nr_overcommit_hugepages. 
. 
/proc/filesystems should also show a filesystem of type "hugetlbfs" configured 
in the kernel. 
. 
/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge 
pages in the kernel's huge page pool. "Persistent" huge pages will be 
returned to the huge page pool when freed by a task. A user with root 
privileges can dynamically allocate more or free some persistent huge pages 
by increasing or decreasing the value of 'nr_hugepages'.

推薦設置

如果要使用PostgreSQL的huge page，建議設置它。 
大於數據庫需要的共享內存即可。

28.參數

 fs.nr_open

支持系統：CentOS 6, 7

參數解釋

nr_open:
This denotes the maximum number of file-handles a process can
allocate. Default value is 1024*1024 (1048576) which should be
enough for most machines. Actual limit depends on RLIMIT_NOFILE
resource limit.
它還影響security/limits.conf 的文件句柄限制，單個進程的打開句柄不能大於fs.nr_open，所以要加大文件句柄限制，首先要加大nr_open

推薦設置

對於有很多對象（表、視圖、索引、序列、物化視圖等）的PostgreSQL數據庫，建議設置為2000萬，
例如fs.nr_open=20480000

數據庫關心的資源限制

1. 通過/etc/security/limits.conf設置，或者ulimit設置

2. 通過/proc/$pid/limits查看當前進程的設置

# - core - limits the core file size (KB) 
# - memlock - max locked-in-memory address space (KB) 
# - nofile - max number of open files 建議設置為1000萬 , 但是必須設置sysctl, fs.nr_open大於它，否則會導致系統無法登陸。
# - nproc - max number of processes 
以上四個是非常關心的配置 
.... 
# - data - max data size (KB) 
# - fsize - maximum filesize (KB) 
# - rss - max resident set size (KB) 
# - stack - max stack size (KB) 
# - cpu - max CPU time (MIN) 
# - as - address space limit (KB) 
# - maxlogins - max number of logins for this user 
# - maxsyslogins - max number of logins on the system 
# - priority - the priority to run user process with 
# - locks - max number of file locks the user can hold 
# - sigpending - max number of pending signals 
# - msgqueue - max memory used by POSIX message queues (bytes) 
# - nice - max nice priority allowed to raise to values: [-20, 19] 
# - rtprio - max realtime priority

數據庫關心的IO調度規則

1. 目前操作系統支持的IO調度策略包括cfq, deadline, noop 等。

從這裡可以看到它的調度策略

cat /sys/block/磁盤/queue/scheduler

概述

數據庫關心的OS內核參數

這裡以512GB 內存為例

1.參數

fs.aio-max-nr

支持系統：CentOS 6, 7

參數解釋

推薦設置

fs.aio-max-nr = 1xxxxxx 
. 
PostgreSQL, Greenplum 均未使用io_setup創建aio contexts. 無需設置。 
如果Oracle數據庫，要使用aio的話，需要設置它。 
設置它也沒什麼壞處，如果將來需要適應異步IO，可以不需要重新修改這個設置。

2.參數

fs.file-max

支持系統：CentOS 6, 7

參數解釋

推薦設置

fs.file-max = 7xxxxxxx 
. 
PostgreSQL 有一套自己管理的VFS，真正打開的FD與內核管理的文件打開關閉有一套映射的機制，所以真實情況不需要使用那麼多的file handlers。 
max_files_per_process 參數。 
假設1GB內存支撐100個連接，每個連接打開1000個文件，那麼一個PG實例需要打開10萬個文件，一臺機器按512G內存來算可以跑500個PG實例，則需要5000萬個file handler。 
以上設置綽綽有餘。

3.參數

kernel.core_pattern

支持系統：CentOS 6, 7

參數解釋

推薦設置

kernel.core_pattern = /xxx/core_%e_%u_%t_%s.%p 
. 
這個目錄要777的權限，如果它是個軟鏈，則真實目錄需要777的權限 
mkdir /xxx 
chmod 777 /xxx 
注意留足夠的空間

4.參數

kernel.sem

支持系統：CentOS 6, 7

參數解釋

kernel.sem = 4096 2147483647 2147483646 512000 
. 
4096 每組多少信號量 (>=17, PostgreSQL 每16個進程一組, 每組需要17個信號量) , 
2147483647 總共多少信號量 (2^31-1 , 且大於4096*512000 ) , 
2147483646 每個semop()調用支持多少操作 (2^31-1), 
512000 多少組信號量 (假設每GB支持100個連接, 512GB支持51200個連接, 加上其他進程, > 51200*2/16 綽綽有餘) 
. 
# sysctl -w kernel.sem="4096 2147483647 2147483646 512000" 
. 
# ipcs -s -l 
 ------ Semaphore Limits -------- 
max number of arrays = 512000 
max semaphores per array = 4096 
max semaphores system wide = 2147483647 
max ops per semop call = 2147483646 
semaphore max value = 32767

推薦設置

kernel.sem = 4096 2147483647 2147483646 512000 
. 
4096可能能夠適合更多的場景, 所以大點無妨，關鍵是512000 arrays也夠了。

5.參數

kernel.shmall = 107374182 
kernel.shmmax = 274877906944 
kernel.shmmni = 819200

支持系統：CentOS 6, 7

參數解釋

假設主機內存 512GB 
. 
shmmax 單個共享內存段最大 256GB (主機內存的一半，單位字節) 
shmall 所有共享內存段加起來最大 (主機內存的80%，單位PAGE) 
shmmni 一共允許創建819200個共享內存段 (每個數據庫啟動需要2個共享內存段。 將來允許動態創建共享內存段，可能需求量更大) 
. 
# getconf PAGE_SIZE 
4096

推薦設置

kernel.shmall = 107374182 
kernel.shmmax = 274877906944 
kernel.shmmni = 819200 
. 
9.2以及以前的版本，數據庫啟動時，對共享內存段的內存需求非常大，需要考慮以下幾點 
Connections:\t(1800 + 270 * max_locks_per_transaction) * max_connections 
Autovacuum workers:\t(1800 + 270 * max_locks_per_transaction) * autovacuum_max_workers 
Prepared transactions:\t(770 + 270 * max_locks_per_transaction) * max_prepared_transactions 
Shared disk buffers:\t(block_size + 208) * shared_buffers 
WAL buffers:\t(wal_block_size + 8) * wal_buffers 
Fixed space requirements:\t770 kB 
. 
以上建議參數根據9.2以前的版本設置，後期的版本同樣適用。

6.參數

net.core.netdev_max_backlog

支持系統：CentOS 6, 7

參數解釋

netdev_max_backlog 
 ------------------ 
Maximum number of packets, queued on the INPUT side, 
when the interface receives packets faster than kernel can process them.

推薦設置

net.core.netdev_max_backlog=1xxxx 
. 
INPUT鏈表越長，處理耗費越大，如果用了iptables管理的話，需要加大這個值。

7.參數

net.core.rmem_default 
net.core.rmem_max 
net.core.wmem_default 
net.core.wmem_max

支持系統：CentOS 6, 7

參數解釋

rmem_default 
 ------------ 
The default setting of the socket receive buffer in bytes. 
. 
rmem_max 
 -------- 
The maximum receive socket buffer size in bytes. 
. 
wmem_default 
 ------------ 
The default setting (in bytes) of the socket send buffer. 
. 
wmem_max 
 -------- 
The maximum send socket buffer size in bytes.

推薦設置

net.core.rmem_default = 262144 
net.core.rmem_max = 4194304 
net.core.wmem_default = 262144 
net.core.wmem_max = 4194304

8.參數

net.core.somaxconn

支持系統：CentOS 6, 7

參數解釋

somaxconn - INTEGER 
 Limit of socket listen() backlog, known in userspace as SOMAXCONN. 
 Defaults to 128. 
\tSee also tcp_max_syn_backlog for additional tuning for TCP sockets.

推薦設置

net.core.somaxconn=4xxx

9.參數

net.ipv4.tcp_max_syn_backlog

支持系統：CentOS 6, 7

參數解釋

tcp_max_syn_backlog - INTEGER 
 Maximal number of remembered connection requests, which have not 
 received an acknowledgment from connecting client. 
 The minimal value is 128 for low memory machines, and it will 
 increase in proportion to the memory of machine. 
 If server suffers from overload, try increasing this number.

推薦設置

net.ipv4.tcp_max_syn_backlog=4xxx 
pgpool-II 使用了這個值，用於將超過num_init_child以外的連接queue。 
所以這個值決定了有多少連接可以在隊列裡面等待。

10.參數

net.ipv4.tcp_keepalive_intvl=20 
net.ipv4.tcp_keepalive_probes=3 
net.ipv4.tcp_keepalive_time=60

支持系統：CentOS 6, 7

參數解釋

tcp_keepalive_time - INTEGER 
 How often TCP sends out keepalive messages when keepalive is enabled. 
 Default: 2hours. 
. 
tcp_keepalive_probes - INTEGER 
 How many keepalive probes TCP sends out, until it decides that the 
 connection is broken. Default value: 9. 
. 
tcp_keepalive_intvl - INTEGER 
 How frequently the probes are send out. Multiplied by 
 tcp_keepalive_probes it is time to kill not responding connection, 
 after probes started. Default value: 75sec i.e. connection 
 will be aborted after ~11 minutes of retries.

推薦設置

net.ipv4.tcp_keepalive_intvl=20 
net.ipv4.tcp_keepalive_probes=3 
net.ipv4.tcp_keepalive_time=60 
. 
連接空閒60秒後, 每隔20秒發心跳包, 嘗試3次心跳包沒有響應，關閉連接。 從開始空閒，到關閉連接總共歷時120秒。

11.參數

net.ipv4.tcp_mem=8388608 12582912 16777216

支持系統：CentOS 6, 7

參數解釋

tcp_mem - vector of 3 INTEGERs: min, pressure, max 
單位 page 
 min: below this number of pages TCP is not bothered about its 
 memory appetite. 
. 
 pressure: when amount of memory allocated by TCP exceeds this number 
 of pages, TCP moderates its memory consumption and enters memory 
 pressure mode, which is exited when memory consumption falls 
 under "min". 
. 
 max: number of pages allowed for queueing by all TCP sockets. 
. 
 Defaults are calculated at boot time from amount of available 
 memory. 
64GB 內存，自動計算的值是這樣的 
net.ipv4.tcp_mem = 1539615 2052821 3079230 
. 
512GB 內存，自動計算得到的值是這樣的 
net.ipv4.tcp_mem = 49621632 66162176 99243264 
. 
這個參數讓操作系統啟動時自動計算，問題也不大

推薦設置

net.ipv4.tcp_mem=8388608 12582912 16777216 
. 
這個參數讓操作系統啟動時自動計算，問題也不大

12.參數

net.ipv4.tcp_fin_timeout

支持系統：CentOS 6, 7

參數解釋

tcp_fin_timeout - INTEGER 
 The length of time an orphaned (no longer referenced by any 
 application) connection will remain in the FIN_WAIT_2 state 
 before it is aborted at the local end. While a perfectly 
 valid "receive only" state for an un-orphaned connection, an 
 orphaned connection in FIN_WAIT_2 state could otherwise wait 
 forever for the remote to close its end of the connection. 
 Cf. tcp_max_orphans 
 Default: 60 seconds

推薦設置

net.ipv4.tcp_fin_timeout=5 
. 
加快殭屍連接回收速度

13.參數

net.ipv4.tcp_synack_retries

支持系統：CentOS 6, 7

參數解釋

tcp_synack_retries - INTEGER 
 Number of times SYNACKs for a passive TCP connection attempt will 
 be retransmitted. Should not be higher than 255. Default value 
 is 5, which corresponds to 31seconds till the last retransmission 
 with the current initial RTO of 1second. With this the final timeout 
 for a passive TCP connection will happen after 63seconds.

推薦設置

net.ipv4.tcp_synack_retries=2 
. 
縮短tcp syncack超時時間

14.參數

net.ipv4.tcp_syncookies

支持系統：CentOS 6, 7

參數解釋

tcp_syncookies - BOOLEAN 
 Only valid when the kernel was compiled with CONFIG_SYN_COOKIES 
 Send out syncookies when the syn backlog queue of a socket 
 overflows. This is to prevent against the common 'SYN flood attack' 
 Default: 1 
. 
 Note, that syncookies is fallback facility. 
 It MUST NOT be used to help highly loaded servers to stand 
 against legal connection rate. If you see SYN flood warnings 
 in your logs, but investigation shows that they occur 
 because of overload with legal connections, you should tune 
 another parameters until this warning disappear. 
 See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow. 
. 
 syncookies seriously violate TCP protocol, do not allow 
 to use TCP extensions, can result in serious degradation 
 of some services (f.e. SMTP relaying), visible not by you, 
 but your clients and relays, contacting you. While you see 
 SYN flood warnings in logs not being really flooded, your server 
 is seriously misconfigured. 
. 
 If you want to test which effects syncookies have to your 
 network connections you can set this knob to 2 to enable 
 unconditionally generation of syncookies.

推薦設置

net.ipv4.tcp_syncookies=1 
. 
防止syn flood攻擊

15.參數

net.ipv4.tcp_timestamps

支持系統：CentOS 6, 7

參數解釋

tcp_timestamps - BOOLEAN 
 Enable timestamps as defined in RFC1323.

推薦設置

net.ipv4.tcp_timestamps=1 
. 
tcp_timestamps 是 tcp 協議中的一個擴展項，通過時間戳的方式來檢測過來的包以防止 PAWS(Protect Against Wrapped Sequence numbers)，可以提高 tcp 的性能。

16.參數

net.ipv4.tcp_tw_recycle 
net.ipv4.tcp_tw_reuse 
net.ipv4.tcp_max_tw_buckets

支持系統：CentOS 6, 7

參數解釋

tcp_tw_recycle - BOOLEAN 
 Enable fast recycling TIME-WAIT sockets. Default value is 0. 
 It should not be changed without advice/request of technical 
 experts. 
. 
tcp_tw_reuse - BOOLEAN 
 Allow to reuse TIME-WAIT sockets for new connections when it is 
 safe from protocol viewpoint. Default value is 0. 
 It should not be changed without advice/request of technical 
 experts. 
. 
tcp_max_tw_buckets - INTEGER 
 Maximal number of timewait sockets held by system simultaneously. 
 If this number is exceeded time-wait socket is immediately destroyed 
 and warning is printed. 
\tThis limit exists only to prevent simple DoS attacks, 
\tyou _must_ not lower the limit artificially, 
 but rather increase it (probably, after increasing installed memory), 
 if network conditions require more than default value.

推薦設置

net.ipv4.tcp_tw_recycle=0 
net.ipv4.tcp_tw_reuse=1 
net.ipv4.tcp_max_tw_buckets = 2xxxxx 
. 
net.ipv4.tcp_tw_recycle和net.ipv4.tcp_timestamps不建議同時開啟

17.參數

net.ipv4.tcp_rmem 
net.ipv4.tcp_wmem

支持系統：CentOS 6, 7

參數解釋

tcp_wmem - vector of 3 INTEGERs: min, default, max 
 min: Amount of memory reserved for send buffers for TCP sockets. 
 Each TCP socket has rights to use it due to fact of its birth. 
 Default: 1 page 
. 
 default: initial size of send buffer used by TCP sockets. This 
 value overrides net.core.wmem_default used by other protocols. 
 It is usually lower than net.core.wmem_default. 
 Default: 16K 
. 
 max: Maximal amount of memory allowed for automatically tuned 
 send buffers for TCP sockets. This value does not override 
 net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables 
 automatic tuning of that socket's send buffer size, in which case 
 this value is ignored. 
 Default: between 64K and 4MB, depending on RAM size. 
. 
tcp_rmem - vector of 3 INTEGERs: min, default, max 
 min: Minimal size of receive buffer used by TCP sockets. 
 It is guaranteed to each TCP socket, even under moderate memory 
 pressure. 
 Default: 1 page 
. 
 default: initial size of receive buffer used by TCP sockets. 
 This value overrides net.core.rmem_default used by other protocols. 
 Default: 87380 bytes. This value results in window of 65535 with 
 default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit 
 less for default tcp_app_win. See below about these variables. 
. 
 max: maximal size of receive buffer allowed for automatically 
 selected receiver buffers for TCP socket. This value does not override 
 net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables 
 automatic tuning of that socket's receive buffer size, in which 
 case this value is ignored. 
 Default: between 87380B and 6MB, depending on RAM size.

推薦設置

net.ipv4.tcp_rmem=8192 87380 16777216 
net.ipv4.tcp_wmem=8192 65536 16777216 
. 
許多數據庫的推薦設置，提高網絡性能

18.參數

net.nf_conntrack_max 
net.netfilter.nf_conntrack_max

支持系統：CentOS 6

參數解釋

nf_conntrack_max - INTEGER 
 Size of connection tracking table. 
\tDefault value is nf_conntrack_buckets value * 4.

推薦設置

net.nf_conntrack_max=1xxxxxx 
net.netfilter.nf_conntrack_max=1xxxxxx

19.參數

vm.dirty_background_bytes 
vm.dirty_expire_centisecs 
vm.dirty_ratio 
vm.dirty_writeback_centisecs

支持系統：CentOS 6, 7

參數解釋

=====================================================
. 
dirty_background_bytes 
. 
Contains the amount of dirty memory at which the background kernel 
flusher threads will start writeback. 
. 
Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only 
one of them may be specified at a time. When one sysctl is written it is 
immediately taken into account to evaluate the dirty memory limits and the 
other appears as 0 when read. 
. 
=====================================================
. 
dirty_background_ratio 
. 
Contains, as a percentage of total system memory, the number of pages at which 
the background kernel flusher threads will start writing out dirty data. 
. 
=====================================================
. 
dirty_bytes 
. 
Contains the amount of dirty memory at which a process generating disk writes 
will itself start writeback. 
. 
Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be 
specified at a time. When one sysctl is written it is immediately taken into 
account to evaluate the dirty memory limits and the other appears as 0 when 
read. 
. 
Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any 
value lower than this limit will be ignored and the old configuration will be 
retained. 
. 
=====================================================
. 
dirty_expire_centisecs 
. 
This tunable is used to define when dirty data is old enough to be eligible 
for writeout by the kernel flusher threads. It is expressed in 100'ths 
of a second. Data which has been dirty in-memory for longer than this 
interval will be written out next time a flusher thread wakes up. 
. 
=====================================================
. 
dirty_ratio 
. 
Contains, as a percentage of total system memory, the number of pages at which 
a process which is generating disk writes will itself start writing out dirty 
data. 
. 
=====================================================
. 
dirty_writeback_centisecs 
. 
The kernel flusher threads will periodically wake up and write `old' data 
out to disk. This tunable expresses the interval between those wakeups, in 
100'ths of a second. 
. 
Setting this to zero disables periodic writeback altogether.

推薦設置

vm.dirty_background_bytes = 4096000000 
vm.dirty_expire_centisecs = 6000 
vm.dirty_ratio = 80 
vm.dirty_writeback_centisecs = 50 
. 
減少數據庫進程刷髒頁的頻率，dirty_background_bytes根據實際IOPS能力以及內存大小設置

20.參數

vm.extra_free_kbytes

支持系統：CentOS 6

參數解釋

extra_free_kbytes 
. 
This parameter tells the VM to keep extra free memory 
between the threshold where background reclaim (kswapd) kicks in, 
and the threshold where direct reclaim (by allocating processes) kicks in. 
. 
This is useful for workloads that require low latency memory allocations 
and have a bounded burstiness in memory allocations, 
for example a realtime application that receives and transmits network traffic 
(causing in-kernel memory allocations) with a maximum total message burst 
size of 200MB may need 200MB of extra free memory to avoid direct reclaim 
related latencies. 
. 
目標是儘量讓後臺進程回收內存，比用戶進程提早多少kbytes回收，因此用戶進程可以快速分配內存。

推薦設置

vm.extra_free_kbytes=4xxxxxx

21.參數

vm.min_free_kbytes

支持系統：CentOS 6, 7

參數解釋

min_free_kbytes: 
. 
This is used to force the Linux VM to keep a minimum number 
of kilobytes free. The VM uses this number to compute a 
watermark[WMARK_MIN] value for each lowmem zone in the system. 
Each lowmem zone gets a number of reserved free pages based 
proportionally on its size. 
. 
Some minimal amount of memory is needed to satisfy PF_MEMALLOC 
allocations; if you set this to lower than 1024KB, your system will 
become subtly broken, and prone to deadlock under high loads. 
. 
Setting this too high will OOM your machine instantly.

推薦設置

vm.min_free_kbytes = 2xxxxxx # vm.min_free_kbytes 建議每32G內存分配1G vm.min_free_kbytes
. 
防止在高負載時系統無響應，減少內存分配死鎖概率。

22.參數

vm.mmap_min_addr

支持系統：CentOS 6, 7

參數解釋

mmap_min_addr 
. 
This file indicates the amount of address space which a user process will 
be restricted from mmapping. Since kernel null dereference bugs could 
accidentally operate based on the information in the first couple of pages 
of memory userspace processes should not be allowed to write to them. By 
default this value is set to 0 and no protections will be enforced by the 
security module. Setting this value to something like 64k will allow the 
vast majority of applications to work correctly and provide defense in depth 
against future potential kernel bugs.

推薦設置

vm.mmap_min_addr=6xxxx 
. 
防止內核隱藏的BUG導致的問題

23.參數

vm.overcommit_memory 
vm.overcommit_ratio

支持系統：CentOS 6, 7

參數解釋

====================================================
. 
overcommit_kbytes: 
. 
When overcommit_memory is set to 2, the committed address space is not 
permitted to exceed swap plus this amount of physical RAM. See below. 
. 
Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one 
of them may be specified at a time. Setting one disables the other (which 
then appears as 0 when read). 
. 
====================================================
. 
overcommit_memory: 
. 
This value contains a flag that enables memory overcommitment. 
. 
When this flag is 0, 
the kernel attempts to estimate the amount 
of free memory left when userspace requests more memory. 
. 
When this flag is 1, 
the kernel pretends there is always enough memory until it actually runs out. 
. 
When this flag is 2, 
the kernel uses a "never overcommit" 
policy that attempts to prevent any overcommit of memory. 
Note that user_reserve_kbytes affects this policy. 
. 
This feature can be very useful because there are a lot of 
programs that malloc() huge amounts of memory "just-in-case" 
and don't use much of it. 
. 
The default value is 0. 
. 
See Documentation/vm/overcommit-accounting and 
security/commoncap.c::cap_vm_enough_memory() for more information. 
. 
=====================================================
. 
overcommit_ratio: 
. 
When overcommit_memory is set to 2, 
the committed address space is not permitted to exceed 
 swap + this percentage of physical RAM. 
See above.

推薦設置

vm.overcommit_memory = 0 
vm.overcommit_ratio = 90 
. 
vm.overcommit_memory = 0 時 vm.overcommit_ratio可以不設置

24.參數

vm.swappiness

支持系統：CentOS 6, 7

參數解釋

swappiness 
. 
This control is used to define how aggressive the kernel will swap 
memory pages. 
Higher values will increase agressiveness, lower values 
decrease the amount of swap. 
. 
The default value is 60.

推薦設置

vm.swappiness = 0

25.參數

vm.zone_reclaim_mode

支持系統：CentOS 6, 7

參數解釋

zone_reclaim_mode: 
. 
Zone_reclaim_mode allows someone to set more or less aggressive approaches to 
reclaim memory when a zone runs out of memory. If it is set to zero then no 
zone reclaim occurs. Allocations will be satisfied from other zones / nodes 
in the system. 
. 
This is value ORed together of 
. 
1 = Zone reclaim on 
2 = Zone reclaim writes dirty pages out 
4 = Zone reclaim swaps pages 
. 
zone_reclaim_mode is disabled by default. For file servers or workloads 
that benefit from having their data cached, zone_reclaim_mode should be 
left disabled as the caching effect is likely to be more important than 
data locality. 
. 
zone_reclaim may be enabled if it's known that the workload is partitioned 
such that each partition fits within a NUMA node and that accessing remote 
memory would cause a measurable performance reduction. The page allocator 
will then reclaim easily reusable pages (those page cache pages that are 
currently not used) before allocating off node pages. 
. 
Allowing zone reclaim to write out pages stops processes that are 
writing large amounts of data from dirtying pages on other nodes. Zone 
reclaim will write out dirty pages if a zone fills up and so effectively 
throttle the process. This may decrease the performance of a single process 
since it cannot use all of system memory to buffer the outgoing writes 
anymore but it preserve the memory on other nodes so that the performance 
of other processes running on other nodes will not be affected. 
. 
Allowing regular swap effectively restricts allocations to the local 
node unless explicitly overridden by memory policies or cpuset 
configurations.

推薦設置

vm.zone_reclaim_mode=0 
. 
不使用NUMA

26.參數

net.ipv4.ip_local_port_range

支持系統：CentOS 6, 7

參數解釋

ip_local_port_range - 2 INTEGERS 
 Defines the local port range that is used by TCP and UDP to 
 choose the local port. The first number is the first, the 
 second the last local port number. The default values are 
 32768 and 61000 respectively. 
. 
ip_local_reserved_ports - list of comma separated ranges 
 Specify the ports which are reserved for known third-party 
 applications. These ports will not be used by automatic port 
 assignments (e.g. when calling connect() or bind() with port 
 number 0). Explicit port allocation behavior is unchanged. 
. 
 The format used for both input and output is a comma separated 
 list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and 
 10). Writing to the file will clear all previously reserved 
 ports and update the current list with the one given in the 
 input. 
. 
 Note that ip_local_port_range and ip_local_reserved_ports 
 settings are independent and both are considered by the kernel 
 when determining which ports are available for automatic port 
 assignments. 
. 
 You can reserve ports which are not in the current 
 ip_local_port_range, e.g.: 
. 
 $ cat /proc/sys/net/ipv4/ip_local_port_range 
 32000 61000 
 $ cat /proc/sys/net/ipv4/ip_local_reserved_ports 
 8080,9148 
. 
 although this is redundant. However such a setting is useful 
 if later the port range is changed to a value that will 
 include the reserved ports. 
. 
 Default: Empty

推薦設置

net.ipv4.ip_local_port_range=40000 65535 
. 
限制本地動態端口分配範圍，防止佔用監聽端口。

27.參數

 vm.nr_hugepages

支持系統：CentOS 6, 7

參數解釋

=====================================================
nr_hugepages 
Change the minimum size of the hugepage pool. 
See Documentation/vm/hugetlbpage.txt 
=====================================================
nr_overcommit_hugepages 
Change the maximum size of the hugepage pool. The maximum is 
nr_hugepages + nr_overcommit_hugepages. 
See Documentation/vm/hugetlbpage.txt 
. 
The output of "cat /proc/meminfo" will include lines like: 
...... 
HugePages_Total: vvv 
HugePages_Free: www 
HugePages_Rsvd: xxx 
HugePages_Surp: yyy 
Hugepagesize: zzz kB 
. 
where: 
HugePages_Total is the size of the pool of huge pages. 
HugePages_Free is the number of huge pages in the pool that are not yet 
 allocated. 
HugePages_Rsvd is short for "reserved," and is the number of huge pages for 
 which a commitment to allocate from the pool has been made, 
 but no allocation has yet been made. Reserved huge pages 
 guarantee that an application will be able to allocate a 
 huge page from the pool of huge pages at fault time. 
HugePages_Surp is short for "surplus," and is the number of huge pages in 
 the pool above the value in /proc/sys/vm/nr_hugepages. The 
 maximum number of surplus huge pages is controlled by 
 /proc/sys/vm/nr_overcommit_hugepages. 
. 
/proc/filesystems should also show a filesystem of type "hugetlbfs" configured 
in the kernel. 
. 
/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge 
pages in the kernel's huge page pool. "Persistent" huge pages will be 
returned to the huge page pool when freed by a task. A user with root 
privileges can dynamically allocate more or free some persistent huge pages 
by increasing or decreasing the value of 'nr_hugepages'.

推薦設置

如果要使用PostgreSQL的huge page，建議設置它。 
大於數據庫需要的共享內存即可。

28.參數

 fs.nr_open

支持系統：CentOS 6, 7

參數解釋

nr_open:
This denotes the maximum number of file-handles a process can
allocate. Default value is 1024*1024 (1048576) which should be
enough for most machines. Actual limit depends on RLIMIT_NOFILE
resource limit.
它還影響security/limits.conf 的文件句柄限制，單個進程的打開句柄不能大於fs.nr_open，所以要加大文件句柄限制，首先要加大nr_open

推薦設置

對於有很多對象（表、視圖、索引、序列、物化視圖等）的PostgreSQL數據庫，建議設置為2000萬，
例如fs.nr_open=20480000

數據庫關心的資源限制

1. 通過/etc/security/limits.conf設置，或者ulimit設置

2. 通過/proc/$pid/limits查看當前進程的設置

# - core - limits the core file size (KB) 
# - memlock - max locked-in-memory address space (KB) 
# - nofile - max number of open files 建議設置為1000萬 , 但是必須設置sysctl, fs.nr_open大於它，否則會導致系統無法登陸。
# - nproc - max number of processes 
以上四個是非常關心的配置 
.... 
# - data - max data size (KB) 
# - fsize - maximum filesize (KB) 
# - rss - max resident set size (KB) 
# - stack - max stack size (KB) 
# - cpu - max CPU time (MIN) 
# - as - address space limit (KB) 
# - maxlogins - max number of logins for this user 
# - maxsyslogins - max number of logins on the system 
# - priority - the priority to run user process with 
# - locks - max number of file locks the user can hold 
# - sigpending - max number of pending signals 
# - msgqueue - max memory used by POSIX message queues (bytes) 
# - nice - max nice priority allowed to raise to values: [-20, 19] 
# - rtprio - max realtime priority

數據庫關心的IO調度規則

1. 目前操作系統支持的IO調度策略包括cfq, deadline, noop 等。

從這裡可以看到它的調度策略

cat /sys/block/磁盤/queue/scheduler

修改

echo deadline > /sys/block/sda/queue/scheduler

或者修改啟動參數

grub.conf 
elevator=deadline

從很多測試結果來看，數據庫使用deadline調度，性能會更穩定一些。

其實還有一些參數，例如關閉透明大頁、禁用NUMA、SSD的對齊等等，篇幅有限，就介紹到這裡了，按上面設置基本是足夠了。後面會分享更多devops和DBA方面的內容，感興趣的朋友可以關注一下~

概述

數據庫關心的OS內核參數

這裡以512GB 內存為例

1.參數

fs.aio-max-nr

支持系統：CentOS 6, 7

參數解釋

推薦設置

fs.aio-max-nr = 1xxxxxx 
. 
PostgreSQL, Greenplum 均未使用io_setup創建aio contexts. 無需設置。 
如果Oracle數據庫，要使用aio的話，需要設置它。 
設置它也沒什麼壞處，如果將來需要適應異步IO，可以不需要重新修改這個設置。

2.參數

fs.file-max

支持系統：CentOS 6, 7

參數解釋

推薦設置

fs.file-max = 7xxxxxxx 
. 
PostgreSQL 有一套自己管理的VFS，真正打開的FD與內核管理的文件打開關閉有一套映射的機制，所以真實情況不需要使用那麼多的file handlers。 
max_files_per_process 參數。 
假設1GB內存支撐100個連接，每個連接打開1000個文件，那麼一個PG實例需要打開10萬個文件，一臺機器按512G內存來算可以跑500個PG實例，則需要5000萬個file handler。 
以上設置綽綽有餘。

3.參數

kernel.core_pattern

支持系統：CentOS 6, 7

參數解釋

推薦設置

kernel.core_pattern = /xxx/core_%e_%u_%t_%s.%p 
. 
這個目錄要777的權限，如果它是個軟鏈，則真實目錄需要777的權限 
mkdir /xxx 
chmod 777 /xxx 
注意留足夠的空間

4.參數

kernel.sem

支持系統：CentOS 6, 7

參數解釋

kernel.sem = 4096 2147483647 2147483646 512000 
. 
4096 每組多少信號量 (>=17, PostgreSQL 每16個進程一組, 每組需要17個信號量) , 
2147483647 總共多少信號量 (2^31-1 , 且大於4096*512000 ) , 
2147483646 每個semop()調用支持多少操作 (2^31-1), 
512000 多少組信號量 (假設每GB支持100個連接, 512GB支持51200個連接, 加上其他進程, > 51200*2/16 綽綽有餘) 
. 
# sysctl -w kernel.sem="4096 2147483647 2147483646 512000" 
. 
# ipcs -s -l 
 ------ Semaphore Limits -------- 
max number of arrays = 512000 
max semaphores per array = 4096 
max semaphores system wide = 2147483647 
max ops per semop call = 2147483646 
semaphore max value = 32767

推薦設置

kernel.sem = 4096 2147483647 2147483646 512000 
. 
4096可能能夠適合更多的場景, 所以大點無妨，關鍵是512000 arrays也夠了。

5.參數

kernel.shmall = 107374182 
kernel.shmmax = 274877906944 
kernel.shmmni = 819200

支持系統：CentOS 6, 7

參數解釋

假設主機內存 512GB 
. 
shmmax 單個共享內存段最大 256GB (主機內存的一半，單位字節) 
shmall 所有共享內存段加起來最大 (主機內存的80%，單位PAGE) 
shmmni 一共允許創建819200個共享內存段 (每個數據庫啟動需要2個共享內存段。 將來允許動態創建共享內存段，可能需求量更大) 
. 
# getconf PAGE_SIZE 
4096

推薦設置

kernel.shmall = 107374182 
kernel.shmmax = 274877906944 
kernel.shmmni = 819200 
. 
9.2以及以前的版本，數據庫啟動時，對共享內存段的內存需求非常大，需要考慮以下幾點 
Connections:\t(1800 + 270 * max_locks_per_transaction) * max_connections 
Autovacuum workers:\t(1800 + 270 * max_locks_per_transaction) * autovacuum_max_workers 
Prepared transactions:\t(770 + 270 * max_locks_per_transaction) * max_prepared_transactions 
Shared disk buffers:\t(block_size + 208) * shared_buffers 
WAL buffers:\t(wal_block_size + 8) * wal_buffers 
Fixed space requirements:\t770 kB 
. 
以上建議參數根據9.2以前的版本設置，後期的版本同樣適用。

6.參數

net.core.netdev_max_backlog

支持系統：CentOS 6, 7

參數解釋

netdev_max_backlog 
 ------------------ 
Maximum number of packets, queued on the INPUT side, 
when the interface receives packets faster than kernel can process them.

推薦設置

net.core.netdev_max_backlog=1xxxx 
. 
INPUT鏈表越長，處理耗費越大，如果用了iptables管理的話，需要加大這個值。

7.參數

net.core.rmem_default 
net.core.rmem_max 
net.core.wmem_default 
net.core.wmem_max

支持系統：CentOS 6, 7

參數解釋

rmem_default 
 ------------ 
The default setting of the socket receive buffer in bytes. 
. 
rmem_max 
 -------- 
The maximum receive socket buffer size in bytes. 
. 
wmem_default 
 ------------ 
The default setting (in bytes) of the socket send buffer. 
. 
wmem_max 
 -------- 
The maximum send socket buffer size in bytes.

推薦設置

net.core.rmem_default = 262144 
net.core.rmem_max = 4194304 
net.core.wmem_default = 262144 
net.core.wmem_max = 4194304

8.參數

net.core.somaxconn

支持系統：CentOS 6, 7

參數解釋

somaxconn - INTEGER 
 Limit of socket listen() backlog, known in userspace as SOMAXCONN. 
 Defaults to 128. 
\tSee also tcp_max_syn_backlog for additional tuning for TCP sockets.

推薦設置

net.core.somaxconn=4xxx

9.參數

net.ipv4.tcp_max_syn_backlog

支持系統：CentOS 6, 7

參數解釋

tcp_max_syn_backlog - INTEGER 
 Maximal number of remembered connection requests, which have not 
 received an acknowledgment from connecting client. 
 The minimal value is 128 for low memory machines, and it will 
 increase in proportion to the memory of machine. 
 If server suffers from overload, try increasing this number.

推薦設置

net.ipv4.tcp_max_syn_backlog=4xxx 
pgpool-II 使用了這個值，用於將超過num_init_child以外的連接queue。 
所以這個值決定了有多少連接可以在隊列裡面等待。

10.參數

net.ipv4.tcp_keepalive_intvl=20 
net.ipv4.tcp_keepalive_probes=3 
net.ipv4.tcp_keepalive_time=60

支持系統：CentOS 6, 7

參數解釋

tcp_keepalive_time - INTEGER 
 How often TCP sends out keepalive messages when keepalive is enabled. 
 Default: 2hours. 
. 
tcp_keepalive_probes - INTEGER 
 How many keepalive probes TCP sends out, until it decides that the 
 connection is broken. Default value: 9. 
. 
tcp_keepalive_intvl - INTEGER 
 How frequently the probes are send out. Multiplied by 
 tcp_keepalive_probes it is time to kill not responding connection, 
 after probes started. Default value: 75sec i.e. connection 
 will be aborted after ~11 minutes of retries.

推薦設置

net.ipv4.tcp_keepalive_intvl=20 
net.ipv4.tcp_keepalive_probes=3 
net.ipv4.tcp_keepalive_time=60 
. 
連接空閒60秒後, 每隔20秒發心跳包, 嘗試3次心跳包沒有響應，關閉連接。 從開始空閒，到關閉連接總共歷時120秒。

11.參數

net.ipv4.tcp_mem=8388608 12582912 16777216

支持系統：CentOS 6, 7

參數解釋

tcp_mem - vector of 3 INTEGERs: min, pressure, max 
單位 page 
 min: below this number of pages TCP is not bothered about its 
 memory appetite. 
. 
 pressure: when amount of memory allocated by TCP exceeds this number 
 of pages, TCP moderates its memory consumption and enters memory 
 pressure mode, which is exited when memory consumption falls 
 under "min". 
. 
 max: number of pages allowed for queueing by all TCP sockets. 
. 
 Defaults are calculated at boot time from amount of available 
 memory. 
64GB 內存，自動計算的值是這樣的 
net.ipv4.tcp_mem = 1539615 2052821 3079230 
. 
512GB 內存，自動計算得到的值是這樣的 
net.ipv4.tcp_mem = 49621632 66162176 99243264 
. 
這個參數讓操作系統啟動時自動計算，問題也不大

推薦設置

net.ipv4.tcp_mem=8388608 12582912 16777216 
. 
這個參數讓操作系統啟動時自動計算，問題也不大

12.參數

net.ipv4.tcp_fin_timeout

支持系統：CentOS 6, 7

參數解釋

tcp_fin_timeout - INTEGER 
 The length of time an orphaned (no longer referenced by any 
 application) connection will remain in the FIN_WAIT_2 state 
 before it is aborted at the local end. While a perfectly 
 valid "receive only" state for an un-orphaned connection, an 
 orphaned connection in FIN_WAIT_2 state could otherwise wait 
 forever for the remote to close its end of the connection. 
 Cf. tcp_max_orphans 
 Default: 60 seconds

推薦設置

net.ipv4.tcp_fin_timeout=5 
. 
加快殭屍連接回收速度

13.參數

net.ipv4.tcp_synack_retries

支持系統：CentOS 6, 7

參數解釋

tcp_synack_retries - INTEGER 
 Number of times SYNACKs for a passive TCP connection attempt will 
 be retransmitted. Should not be higher than 255. Default value 
 is 5, which corresponds to 31seconds till the last retransmission 
 with the current initial RTO of 1second. With this the final timeout 
 for a passive TCP connection will happen after 63seconds.

推薦設置

net.ipv4.tcp_synack_retries=2 
. 
縮短tcp syncack超時時間

14.參數

net.ipv4.tcp_syncookies

支持系統：CentOS 6, 7

參數解釋

tcp_syncookies - BOOLEAN 
 Only valid when the kernel was compiled with CONFIG_SYN_COOKIES 
 Send out syncookies when the syn backlog queue of a socket 
 overflows. This is to prevent against the common 'SYN flood attack' 
 Default: 1 
. 
 Note, that syncookies is fallback facility. 
 It MUST NOT be used to help highly loaded servers to stand 
 against legal connection rate. If you see SYN flood warnings 
 in your logs, but investigation shows that they occur 
 because of overload with legal connections, you should tune 
 another parameters until this warning disappear. 
 See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow. 
. 
 syncookies seriously violate TCP protocol, do not allow 
 to use TCP extensions, can result in serious degradation 
 of some services (f.e. SMTP relaying), visible not by you, 
 but your clients and relays, contacting you. While you see 
 SYN flood warnings in logs not being really flooded, your server 
 is seriously misconfigured. 
. 
 If you want to test which effects syncookies have to your 
 network connections you can set this knob to 2 to enable 
 unconditionally generation of syncookies.

推薦設置

net.ipv4.tcp_syncookies=1 
. 
防止syn flood攻擊

15.參數

net.ipv4.tcp_timestamps

支持系統：CentOS 6, 7

參數解釋

tcp_timestamps - BOOLEAN 
 Enable timestamps as defined in RFC1323.

推薦設置

net.ipv4.tcp_timestamps=1 
. 
tcp_timestamps 是 tcp 協議中的一個擴展項，通過時間戳的方式來檢測過來的包以防止 PAWS(Protect Against Wrapped Sequence numbers)，可以提高 tcp 的性能。

16.參數

net.ipv4.tcp_tw_recycle 
net.ipv4.tcp_tw_reuse 
net.ipv4.tcp_max_tw_buckets

支持系統：CentOS 6, 7

參數解釋

tcp_tw_recycle - BOOLEAN 
 Enable fast recycling TIME-WAIT sockets. Default value is 0. 
 It should not be changed without advice/request of technical 
 experts. 
. 
tcp_tw_reuse - BOOLEAN 
 Allow to reuse TIME-WAIT sockets for new connections when it is 
 safe from protocol viewpoint. Default value is 0. 
 It should not be changed without advice/request of technical 
 experts. 
. 
tcp_max_tw_buckets - INTEGER 
 Maximal number of timewait sockets held by system simultaneously. 
 If this number is exceeded time-wait socket is immediately destroyed 
 and warning is printed. 
\tThis limit exists only to prevent simple DoS attacks, 
\tyou _must_ not lower the limit artificially, 
 but rather increase it (probably, after increasing installed memory), 
 if network conditions require more than default value.

推薦設置

net.ipv4.tcp_tw_recycle=0 
net.ipv4.tcp_tw_reuse=1 
net.ipv4.tcp_max_tw_buckets = 2xxxxx 
. 
net.ipv4.tcp_tw_recycle和net.ipv4.tcp_timestamps不建議同時開啟

17.參數

net.ipv4.tcp_rmem 
net.ipv4.tcp_wmem

支持系統：CentOS 6, 7

參數解釋

tcp_wmem - vector of 3 INTEGERs: min, default, max 
 min: Amount of memory reserved for send buffers for TCP sockets. 
 Each TCP socket has rights to use it due to fact of its birth. 
 Default: 1 page 
. 
 default: initial size of send buffer used by TCP sockets. This 
 value overrides net.core.wmem_default used by other protocols. 
 It is usually lower than net.core.wmem_default. 
 Default: 16K 
. 
 max: Maximal amount of memory allowed for automatically tuned 
 send buffers for TCP sockets. This value does not override 
 net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables 
 automatic tuning of that socket's send buffer size, in which case 
 this value is ignored. 
 Default: between 64K and 4MB, depending on RAM size. 
. 
tcp_rmem - vector of 3 INTEGERs: min, default, max 
 min: Minimal size of receive buffer used by TCP sockets. 
 It is guaranteed to each TCP socket, even under moderate memory 
 pressure. 
 Default: 1 page 
. 
 default: initial size of receive buffer used by TCP sockets. 
 This value overrides net.core.rmem_default used by other protocols. 
 Default: 87380 bytes. This value results in window of 65535 with 
 default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit 
 less for default tcp_app_win. See below about these variables. 
. 
 max: maximal size of receive buffer allowed for automatically 
 selected receiver buffers for TCP socket. This value does not override 
 net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables 
 automatic tuning of that socket's receive buffer size, in which 
 case this value is ignored. 
 Default: between 87380B and 6MB, depending on RAM size.

推薦設置

net.ipv4.tcp_rmem=8192 87380 16777216 
net.ipv4.tcp_wmem=8192 65536 16777216 
. 
許多數據庫的推薦設置，提高網絡性能

18.參數

net.nf_conntrack_max 
net.netfilter.nf_conntrack_max

支持系統：CentOS 6

參數解釋

nf_conntrack_max - INTEGER 
 Size of connection tracking table. 
\tDefault value is nf_conntrack_buckets value * 4.

推薦設置

net.nf_conntrack_max=1xxxxxx 
net.netfilter.nf_conntrack_max=1xxxxxx

19.參數

vm.dirty_background_bytes 
vm.dirty_expire_centisecs 
vm.dirty_ratio 
vm.dirty_writeback_centisecs

支持系統：CentOS 6, 7

參數解釋

=====================================================
. 
dirty_background_bytes 
. 
Contains the amount of dirty memory at which the background kernel 
flusher threads will start writeback. 
. 
Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only 
one of them may be specified at a time. When one sysctl is written it is 
immediately taken into account to evaluate the dirty memory limits and the 
other appears as 0 when read. 
. 
=====================================================
. 
dirty_background_ratio 
. 
Contains, as a percentage of total system memory, the number of pages at which 
the background kernel flusher threads will start writing out dirty data. 
. 
=====================================================
. 
dirty_bytes 
. 
Contains the amount of dirty memory at which a process generating disk writes 
will itself start writeback. 
. 
Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be 
specified at a time. When one sysctl is written it is immediately taken into 
account to evaluate the dirty memory limits and the other appears as 0 when 
read. 
. 
Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any 
value lower than this limit will be ignored and the old configuration will be 
retained. 
. 
=====================================================
. 
dirty_expire_centisecs 
. 
This tunable is used to define when dirty data is old enough to be eligible 
for writeout by the kernel flusher threads. It is expressed in 100'ths 
of a second. Data which has been dirty in-memory for longer than this 
interval will be written out next time a flusher thread wakes up. 
. 
=====================================================
. 
dirty_ratio 
. 
Contains, as a percentage of total system memory, the number of pages at which 
a process which is generating disk writes will itself start writing out dirty 
data. 
. 
=====================================================
. 
dirty_writeback_centisecs 
. 
The kernel flusher threads will periodically wake up and write `old' data 
out to disk. This tunable expresses the interval between those wakeups, in 
100'ths of a second. 
. 
Setting this to zero disables periodic writeback altogether.

推薦設置

vm.dirty_background_bytes = 4096000000 
vm.dirty_expire_centisecs = 6000 
vm.dirty_ratio = 80 
vm.dirty_writeback_centisecs = 50 
. 
減少數據庫進程刷髒頁的頻率，dirty_background_bytes根據實際IOPS能力以及內存大小設置

20.參數

vm.extra_free_kbytes

支持系統：CentOS 6

參數解釋

extra_free_kbytes 
. 
This parameter tells the VM to keep extra free memory 
between the threshold where background reclaim (kswapd) kicks in, 
and the threshold where direct reclaim (by allocating processes) kicks in. 
. 
This is useful for workloads that require low latency memory allocations 
and have a bounded burstiness in memory allocations, 
for example a realtime application that receives and transmits network traffic 
(causing in-kernel memory allocations) with a maximum total message burst 
size of 200MB may need 200MB of extra free memory to avoid direct reclaim 
related latencies. 
. 
目標是儘量讓後臺進程回收內存，比用戶進程提早多少kbytes回收，因此用戶進程可以快速分配內存。

推薦設置

vm.extra_free_kbytes=4xxxxxx

21.參數

vm.min_free_kbytes

支持系統：CentOS 6, 7

參數解釋

min_free_kbytes: 
. 
This is used to force the Linux VM to keep a minimum number 
of kilobytes free. The VM uses this number to compute a 
watermark[WMARK_MIN] value for each lowmem zone in the system. 
Each lowmem zone gets a number of reserved free pages based 
proportionally on its size. 
. 
Some minimal amount of memory is needed to satisfy PF_MEMALLOC 
allocations; if you set this to lower than 1024KB, your system will 
become subtly broken, and prone to deadlock under high loads. 
. 
Setting this too high will OOM your machine instantly.

推薦設置

vm.min_free_kbytes = 2xxxxxx # vm.min_free_kbytes 建議每32G內存分配1G vm.min_free_kbytes
. 
防止在高負載時系統無響應，減少內存分配死鎖概率。

22.參數

vm.mmap_min_addr

支持系統：CentOS 6, 7

參數解釋

mmap_min_addr 
. 
This file indicates the amount of address space which a user process will 
be restricted from mmapping. Since kernel null dereference bugs could 
accidentally operate based on the information in the first couple of pages 
of memory userspace processes should not be allowed to write to them. By 
default this value is set to 0 and no protections will be enforced by the 
security module. Setting this value to something like 64k will allow the 
vast majority of applications to work correctly and provide defense in depth 
against future potential kernel bugs.

推薦設置

vm.mmap_min_addr=6xxxx 
. 
防止內核隱藏的BUG導致的問題

23.參數

vm.overcommit_memory 
vm.overcommit_ratio

支持系統：CentOS 6, 7

參數解釋

====================================================
. 
overcommit_kbytes: 
. 
When overcommit_memory is set to 2, the committed address space is not 
permitted to exceed swap plus this amount of physical RAM. See below. 
. 
Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one 
of them may be specified at a time. Setting one disables the other (which 
then appears as 0 when read). 
. 
====================================================
. 
overcommit_memory: 
. 
This value contains a flag that enables memory overcommitment. 
. 
When this flag is 0, 
the kernel attempts to estimate the amount 
of free memory left when userspace requests more memory. 
. 
When this flag is 1, 
the kernel pretends there is always enough memory until it actually runs out. 
. 
When this flag is 2, 
the kernel uses a "never overcommit" 
policy that attempts to prevent any overcommit of memory. 
Note that user_reserve_kbytes affects this policy. 
. 
This feature can be very useful because there are a lot of 
programs that malloc() huge amounts of memory "just-in-case" 
and don't use much of it. 
. 
The default value is 0. 
. 
See Documentation/vm/overcommit-accounting and 
security/commoncap.c::cap_vm_enough_memory() for more information. 
. 
=====================================================
. 
overcommit_ratio: 
. 
When overcommit_memory is set to 2, 
the committed address space is not permitted to exceed 
 swap + this percentage of physical RAM. 
See above.

推薦設置

vm.overcommit_memory = 0 
vm.overcommit_ratio = 90 
. 
vm.overcommit_memory = 0 時 vm.overcommit_ratio可以不設置

24.參數

vm.swappiness

支持系統：CentOS 6, 7

參數解釋

swappiness 
. 
This control is used to define how aggressive the kernel will swap 
memory pages. 
Higher values will increase agressiveness, lower values 
decrease the amount of swap. 
. 
The default value is 60.

推薦設置

vm.swappiness = 0

25.參數

vm.zone_reclaim_mode

支持系統：CentOS 6, 7

參數解釋

zone_reclaim_mode: 
. 
Zone_reclaim_mode allows someone to set more or less aggressive approaches to 
reclaim memory when a zone runs out of memory. If it is set to zero then no 
zone reclaim occurs. Allocations will be satisfied from other zones / nodes 
in the system. 
. 
This is value ORed together of 
. 
1 = Zone reclaim on 
2 = Zone reclaim writes dirty pages out 
4 = Zone reclaim swaps pages 
. 
zone_reclaim_mode is disabled by default. For file servers or workloads 
that benefit from having their data cached, zone_reclaim_mode should be 
left disabled as the caching effect is likely to be more important than 
data locality. 
. 
zone_reclaim may be enabled if it's known that the workload is partitioned 
such that each partition fits within a NUMA node and that accessing remote 
memory would cause a measurable performance reduction. The page allocator 
will then reclaim easily reusable pages (those page cache pages that are 
currently not used) before allocating off node pages. 
. 
Allowing zone reclaim to write out pages stops processes that are 
writing large amounts of data from dirtying pages on other nodes. Zone 
reclaim will write out dirty pages if a zone fills up and so effectively 
throttle the process. This may decrease the performance of a single process 
since it cannot use all of system memory to buffer the outgoing writes 
anymore but it preserve the memory on other nodes so that the performance 
of other processes running on other nodes will not be affected. 
. 
Allowing regular swap effectively restricts allocations to the local 
node unless explicitly overridden by memory policies or cpuset 
configurations.

推薦設置

vm.zone_reclaim_mode=0 
. 
不使用NUMA

26.參數

net.ipv4.ip_local_port_range

支持系統：CentOS 6, 7

參數解釋

ip_local_port_range - 2 INTEGERS 
 Defines the local port range that is used by TCP and UDP to 
 choose the local port. The first number is the first, the 
 second the last local port number. The default values are 
 32768 and 61000 respectively. 
. 
ip_local_reserved_ports - list of comma separated ranges 
 Specify the ports which are reserved for known third-party 
 applications. These ports will not be used by automatic port 
 assignments (e.g. when calling connect() or bind() with port 
 number 0). Explicit port allocation behavior is unchanged. 
. 
 The format used for both input and output is a comma separated 
 list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and 
 10). Writing to the file will clear all previously reserved 
 ports and update the current list with the one given in the 
 input. 
. 
 Note that ip_local_port_range and ip_local_reserved_ports 
 settings are independent and both are considered by the kernel 
 when determining which ports are available for automatic port 
 assignments. 
. 
 You can reserve ports which are not in the current 
 ip_local_port_range, e.g.: 
. 
 $ cat /proc/sys/net/ipv4/ip_local_port_range 
 32000 61000 
 $ cat /proc/sys/net/ipv4/ip_local_reserved_ports 
 8080,9148 
. 
 although this is redundant. However such a setting is useful 
 if later the port range is changed to a value that will 
 include the reserved ports. 
. 
 Default: Empty

推薦設置

net.ipv4.ip_local_port_range=40000 65535 
. 
限制本地動態端口分配範圍，防止佔用監聽端口。

27.參數

 vm.nr_hugepages

支持系統：CentOS 6, 7

參數解釋

=====================================================
nr_hugepages 
Change the minimum size of the hugepage pool. 
See Documentation/vm/hugetlbpage.txt 
=====================================================
nr_overcommit_hugepages 
Change the maximum size of the hugepage pool. The maximum is 
nr_hugepages + nr_overcommit_hugepages. 
See Documentation/vm/hugetlbpage.txt 
. 
The output of "cat /proc/meminfo" will include lines like: 
...... 
HugePages_Total: vvv 
HugePages_Free: www 
HugePages_Rsvd: xxx 
HugePages_Surp: yyy 
Hugepagesize: zzz kB 
. 
where: 
HugePages_Total is the size of the pool of huge pages. 
HugePages_Free is the number of huge pages in the pool that are not yet 
 allocated. 
HugePages_Rsvd is short for "reserved," and is the number of huge pages for 
 which a commitment to allocate from the pool has been made, 
 but no allocation has yet been made. Reserved huge pages 
 guarantee that an application will be able to allocate a 
 huge page from the pool of huge pages at fault time. 
HugePages_Surp is short for "surplus," and is the number of huge pages in 
 the pool above the value in /proc/sys/vm/nr_hugepages. The 
 maximum number of surplus huge pages is controlled by 
 /proc/sys/vm/nr_overcommit_hugepages. 
. 
/proc/filesystems should also show a filesystem of type "hugetlbfs" configured 
in the kernel. 
. 
/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge 
pages in the kernel's huge page pool. "Persistent" huge pages will be 
returned to the huge page pool when freed by a task. A user with root 
privileges can dynamically allocate more or free some persistent huge pages 
by increasing or decreasing the value of 'nr_hugepages'.

推薦設置

如果要使用PostgreSQL的huge page，建議設置它。 
大於數據庫需要的共享內存即可。

28.參數

 fs.nr_open

支持系統：CentOS 6, 7

參數解釋

nr_open:
This denotes the maximum number of file-handles a process can
allocate. Default value is 1024*1024 (1048576) which should be
enough for most machines. Actual limit depends on RLIMIT_NOFILE
resource limit.
它還影響security/limits.conf 的文件句柄限制，單個進程的打開句柄不能大於fs.nr_open，所以要加大文件句柄限制，首先要加大nr_open

推薦設置

對於有很多對象（表、視圖、索引、序列、物化視圖等）的PostgreSQL數據庫，建議設置為2000萬，
例如fs.nr_open=20480000

數據庫關心的資源限制

1. 通過/etc/security/limits.conf設置，或者ulimit設置

2. 通過/proc/$pid/limits查看當前進程的設置

# - core - limits the core file size (KB) 
# - memlock - max locked-in-memory address space (KB) 
# - nofile - max number of open files 建議設置為1000萬 , 但是必須設置sysctl, fs.nr_open大於它，否則會導致系統無法登陸。
# - nproc - max number of processes 
以上四個是非常關心的配置 
.... 
# - data - max data size (KB) 
# - fsize - maximum filesize (KB) 
# - rss - max resident set size (KB) 
# - stack - max stack size (KB) 
# - cpu - max CPU time (MIN) 
# - as - address space limit (KB) 
# - maxlogins - max number of logins for this user 
# - maxsyslogins - max number of logins on the system 
# - priority - the priority to run user process with 
# - locks - max number of file locks the user can hold 
# - sigpending - max number of pending signals 
# - msgqueue - max memory used by POSIX message queues (bytes) 
# - nice - max nice priority allowed to raise to values: [-20, 19] 
# - rtprio - max realtime priority

數據庫關心的IO調度規則

1. 目前操作系統支持的IO調度策略包括cfq, deadline, noop 等。

從這裡可以看到它的調度策略

cat /sys/block/磁盤/queue/scheduler

修改

echo deadline > /sys/block/sda/queue/scheduler

或者修改啟動參數

grub.conf 
elevator=deadline

從很多測試結果來看，數據庫使用deadline調度，性能會更穩定一些。

'數據庫DBA不可不知的操作系統內核參數，值得收藏'

概述

數據庫關心的OS內核參數

概述

數據庫關心的OS內核參數

概述

數據庫關心的OS內核參數

概述

數據庫關心的OS內核參數

數據庫關心的資源限制

數據庫關心的IO調度規則

概述

數據庫關心的OS內核參數

數據庫關心的資源限制

數據庫關心的IO調度規則

概述

數據庫關心的OS內核參數

數據庫關心的資源限制

數據庫關心的IO調度規則

相關推薦