DBA不可不知的操作系统内核参数

本文涉及的产品
云数据库 RDS SQL Server,基础系列 2核4GB
云原生数据库 PolarDB 分布式版,标准版 2核8GB
RDS PostgreSQL Serverless,0.5-4RCU 50GB 3个月
推荐场景:
对影评进行热评分析
简介: 背景 操作系统为了适应更多的硬件环境,许多初始的设置值,宽容度都很高。 如果不经调整,这些值可能无法适应HPC,或者硬件稍好些的环境。 无法发挥更好的硬件性能,甚至可能影响某些应用软件的使用,特别是数据库。 数据库关心的OS内核参数 512GB 内存为例 参数

背景

操作系统为了适应更多的硬件环境,许多初始的设置值,宽容度都很高。

如果不经调整,这些值可能无法适应HPC,或者硬件稍好些的环境。

无法发挥更好的硬件性能,甚至可能影响某些应用软件的使用,特别是数据库。

数据库关心的OS内核参数

512GB 内存为例

  1. 参数
fs.aio-max-nr
AI 代码解读

支持系统

CentOS 6, 7     
AI 代码解读

参数解释

aio-nr & aio-max-nr:  
.
aio-nr is the running total of the number of events specified on the  
io_setup system call for all currently active aio contexts.  
.
If aio-nr reaches aio-max-nr then io_setup will fail with EAGAIN.  
.
Note that raising aio-max-nr does not result in the pre-allocation or re-sizing  
of any kernel data structures.  
.
aio-nr & aio-max-nr:  
.
aio-nr shows the current system-wide number of asynchronous io requests.  
.
aio-max-nr allows you to change the maximum value aio-nr can grow to.  
AI 代码解读

推荐设置

fs.aio-max-nr = 1xxxxxx
.
PostgreSQL, Greenplum 均未使用io_setup创建aio contexts. 无需设置。  
如果Oracle数据库,要使用aio的话,需要设置它。  
设置它也没什么坏处,如果将来需要适应异步IO,可以不需要重新修改这个设置。 
AI 代码解读
  1. 参数
fs.file-max
AI 代码解读

支持系统

CentOS 6, 7     
AI 代码解读

参数解释

file-max & file-nr:  
.
The value in file-max denotes the maximum number of file handles that the Linux kernel will allocate. 
.
When you get lots of error messages about running out of file handles, 
you might want to increase this limit.  
.
Historically, the kernel was able to allocate file handles dynamically, 
but not to free them again.   
.
The three values in file-nr denote :    
the number of allocated file handles ,   
the number of allocated but unused file handles ,   
the maximum number of file handles.   
.
Linux 2.6 always reports 0 as the number of free  
file handles -- this is not an error, it just means that the  
number of allocated file handles exactly matches the number of  
used file handles.  
.
Attempts to allocate more file descriptors than file-max are reported with printk, 
look for "VFS: file-max limit <number> reached".  
AI 代码解读

推荐设置

fs.file-max = 7xxxxxxx
.
PostgreSQL 有一套自己管理的VFS,真正打开的FD与内核管理的文件打开关闭有一套映射的机制,所以真实情况不需要使用那么多的file handlers。   
max_files_per_process 参数。   
假设1GB内存支撑100个连接,每个连接打开1000个文件,那么一个PG实例需要打开10万个文件,一台机器按512G内存来算可以跑500个PG实例,则需要5000万个file handler。   
以上设置绰绰有余。   
AI 代码解读
  1. 参数
kernel.core_pattern
AI 代码解读

支持系统

CentOS 6, 7     
AI 代码解读

参数解释

core_pattern:  
.
core_pattern is used to specify a core dumpfile pattern name.  
. max length 128 characters; default value is "core"  
. core_pattern is used as a pattern template for the output filename;  
  certain string patterns (beginning with '%') are substituted with  
  their actual values.  
. backward compatibility with core_uses_pid:  
        If core_pattern does not include "%p" (default does not)  
        and core_uses_pid is set, then .PID will be appended to  
        the filename.  
. corename format specifiers:  
        %<NUL>  '%' is dropped  
        %%      output one '%'  
        %p      pid  
        %P      global pid (init PID namespace)  
        %i      tid  
        %I      global tid (init PID namespace)  
        %u      uid  
        %g      gid  
        %d      dump mode, matches PR_SET_DUMPABLE and  
                /proc/sys/fs/suid_dumpable  
        %s      signal number  
        %t      UNIX time of dump  
        %h      hostname  
        %e      executable filename (may be shortened)  
        %E      executable path  
        %<OTHER> both are dropped  
. If the first character of the pattern is a '|', the kernel will treat  
  the rest of the pattern as a command to run.  The core dump will be  
  written to the standard input of that program instead of to a file.  
AI 代码解读

推荐设置

kernel.core_pattern = /xxx/core_%e_%u_%t_%s.%p  
.
这个目录要777的权限,如果它是个软链,则真实目录需要777的权限
mkdir /xxx
chmod 777 /xxx
留足够的空间
AI 代码解读
  1. 参数
kernel.sem 
AI 代码解读

支持系统

CentOS 6, 7     
AI 代码解读

参数解释

kernel.sem = 4096 2147483647 2147483646 512000  
.
4096 每组多少信号量 (>=17, PostgreSQL 每16个进程一组, 每组需要17个信号量) ,   
2147483647 总共多少信号量 (2^31-1 , 且大于4096*512000 ) ,   
2147483646 每个semop()调用支持多少操作 (2^31-1),   
512000 多少组信号量 (假设每GB支持100个连接, 512GB支持51200个连接, 加上其他进程, > 51200*2/16 绰绰有余)   
.
# sysctl -w kernel.sem="4096 2147483647 2147483646 512000"  
.
# ipcs -s -l  
  ------ Semaphore Limits --------  
max number of arrays = 512000  
max semaphores per array = 4096  
max semaphores system wide = 2147483647  
max ops per semop call = 2147483646  
semaphore max value = 32767  
AI 代码解读

推荐设置

kernel.sem = 4096 2147483647 2147483646 512000  
.
4096可能能够适合更多的场景, 所以大点无妨,关键是512000 arrays也够了。  
AI 代码解读
  1. 参数
kernel.shmall = 107374182  
kernel.shmmax = 274877906944  
kernel.shmmni = 819200  
AI 代码解读

支持系统

CentOS 6, 7      
AI 代码解读

参数解释

假设主机内存 512GB  
.
shmmax 单个共享内存段最大 256GB (主机内存的一半,单位字节)    
shmall 所有共享内存段加起来最大 (主机内存的80%,单位PAGE)    
shmmni 一共允许创建819200个共享内存段 (每个数据库启动需要2个共享内存段。  将来允许动态创建共享内存段,可能需求量更大)   
.
# getconf PAGE_SIZE  
4096  
AI 代码解读

推荐设置

kernel.shmall = 107374182  
kernel.shmmax = 274877906944  
kernel.shmmni = 819200  
.
9.2以及以前的版本,数据库启动时,对共享内存段的内存需求非常大,需要考虑以下几点
Connections:    (1800 + 270 * max_locks_per_transaction) * max_connections
Autovacuum workers:    (1800 + 270 * max_locks_per_transaction) * autovacuum_max_workers
Prepared transactions:    (770 + 270 * max_locks_per_transaction) * max_prepared_transactions
Shared disk buffers:    (block_size + 208) * shared_buffers
WAL buffers:    (wal_block_size + 8) * wal_buffers
Fixed space requirements:    770 kB
.
以上建议参数根据9.2以前的版本设置,后期的版本同样适用。
AI 代码解读
  1. 参数
net.core.netdev_max_backlog
AI 代码解读

支持系统

CentOS 6, 7   
AI 代码解读

参数解释

netdev_max_backlog  
  ------------------  
Maximum number  of  packets,  queued  on  the  INPUT  side,  
when the interface receives packets faster than kernel can process them.  
AI 代码解读

推荐设置

net.core.netdev_max_backlog=1xxxx  
.
INPUT链表越长,处理耗费越大,如果用了iptables管理的话,需要加大这个值。  
AI 代码解读
  1. 参数
net.core.rmem_default
net.core.rmem_max
net.core.wmem_default
net.core.wmem_max
AI 代码解读

支持系统

CentOS 6, 7   
AI 代码解读

参数解释

rmem_default  
  ------------  
The default setting of the socket receive buffer in bytes.  
.
rmem_max  
  --------  
The maximum receive socket buffer size in bytes.  
.
wmem_default  
  ------------  
The default setting (in bytes) of the socket send buffer.  
.
wmem_max  
  --------  
The maximum send socket buffer size in bytes.  
AI 代码解读

推荐设置

net.core.rmem_default = 262144  
net.core.rmem_max = 4194304  
net.core.wmem_default = 262144  
net.core.wmem_max = 4194304  
AI 代码解读
  1. 参数
net.core.somaxconn 
AI 代码解读

支持系统

CentOS 6, 7      
AI 代码解读

参数解释

somaxconn - INTEGER  
        Limit of socket listen() backlog, known in userspace as SOMAXCONN.  
        Defaults to 128.  
    See also tcp_max_syn_backlog for additional tuning for TCP sockets.  
AI 代码解读

推荐设置

net.core.somaxconn=4xxx  
AI 代码解读
  1. 参数
net.ipv4.tcp_max_syn_backlog
AI 代码解读

支持系统

CentOS 6, 7       
AI 代码解读

参数解释

tcp_max_syn_backlog - INTEGER  
        Maximal number of remembered connection requests, which have not  
        received an acknowledgment from connecting client.  
        The minimal value is 128 for low memory machines, and it will  
        increase in proportion to the memory of machine.  
        If server suffers from overload, try increasing this number.  
AI 代码解读

推荐设置

net.ipv4.tcp_max_syn_backlog=4xxx  
pgpool-II 使用了这个值,用于将超过num_init_child以外的连接queue。   
所以这个值决定了有多少连接可以在队列里面等待。  
AI 代码解读
  1. 参数
net.ipv4.tcp_keepalive_intvl=20  
net.ipv4.tcp_keepalive_probes=3  
net.ipv4.tcp_keepalive_time=60   
AI 代码解读

支持系统

CentOS 6, 7      
AI 代码解读

参数解释

tcp_keepalive_time - INTEGER  
        How often TCP sends out keepalive messages when keepalive is enabled.  
        Default: 2hours.  
.
tcp_keepalive_probes - INTEGER  
        How many keepalive probes TCP sends out, until it decides that the  
        connection is broken. Default value: 9.  
.
tcp_keepalive_intvl - INTEGER  
        How frequently the probes are send out. Multiplied by  
        tcp_keepalive_probes it is time to kill not responding connection,  
        after probes started. Default value: 75sec i.e. connection  
        will be aborted after ~11 minutes of retries.  
AI 代码解读

推荐设置

net.ipv4.tcp_keepalive_intvl=20  
net.ipv4.tcp_keepalive_probes=3  
net.ipv4.tcp_keepalive_time=60  
.
连接空闲60秒后, 每隔20秒发心跳包, 尝试3次心跳包没有响应,关闭连接。 从开始空闲,到关闭连接总共历时120秒。  
AI 代码解读
  1. 参数
net.ipv4.tcp_mem=8388608 12582912 16777216  
AI 代码解读

支持系统

CentOS 6, 7  
AI 代码解读

参数解释

tcp_mem - vector of 3 INTEGERs: min, pressure, max  
单位 page  
        min: below this number of pages TCP is not bothered about its  
        memory appetite.  
.
        pressure: when amount of memory allocated by TCP exceeds this number  
        of pages, TCP moderates its memory consumption and enters memory  
        pressure mode, which is exited when memory consumption falls  
        under "min".  
.
        max: number of pages allowed for queueing by all TCP sockets.  
.
        Defaults are calculated at boot time from amount of available  
        memory.  
64GB 内存,自动计算的值是这样的  
net.ipv4.tcp_mem = 1539615      2052821 3079230  
.
512GB 内存,自动计算得到的值是这样的  
net.ipv4.tcp_mem = 49621632     66162176        99243264  
.
这个参数让操作系统启动时自动计算,问题也不大
AI 代码解读

推荐设置

net.ipv4.tcp_mem=8388608 12582912 16777216  
.
这个参数让操作系统启动时自动计算,问题也不大
AI 代码解读
  1. 参数
net.ipv4.tcp_fin_timeout
AI 代码解读

支持系统

CentOS 6, 7      
AI 代码解读

参数解释

tcp_fin_timeout - INTEGER  
        The length of time an orphaned (no longer referenced by any  
        application) connection will remain in the FIN_WAIT_2 state  
        before it is aborted at the local end.  While a perfectly  
        valid "receive only" state for an un-orphaned connection, an  
        orphaned connection in FIN_WAIT_2 state could otherwise wait  
        forever for the remote to close its end of the connection.  
        Cf. tcp_max_orphans  
        Default: 60 seconds  
AI 代码解读

推荐设置

net.ipv4.tcp_fin_timeout=5  
.
加快僵尸连接回收速度 
AI 代码解读
  1. 参数
net.ipv4.tcp_synack_retries
AI 代码解读

支持系统

CentOS 6, 7       
AI 代码解读

参数解释

tcp_synack_retries - INTEGER  
        Number of times SYNACKs for a passive TCP connection attempt will  
        be retransmitted. Should not be higher than 255. Default value  
        is 5, which corresponds to 31seconds till the last retransmission  
        with the current initial RTO of 1second. With this the final timeout  
        for a passive TCP connection will happen after 63seconds.  
AI 代码解读

推荐设置

net.ipv4.tcp_synack_retries=2  
.
缩短tcp syncack超时时间
AI 代码解读
  1. 参数
net.ipv4.tcp_syncookies
AI 代码解读

支持系统

CentOS 6, 7       
AI 代码解读

参数解释

tcp_syncookies - BOOLEAN  
        Only valid when the kernel was compiled with CONFIG_SYN_COOKIES  
        Send out syncookies when the syn backlog queue of a socket  
        overflows. This is to prevent against the common 'SYN flood attack'  
        Default: 1  
.
        Note, that syncookies is fallback facility.  
        It MUST NOT be used to help highly loaded servers to stand  
        against legal connection rate. If you see SYN flood warnings  
        in your logs, but investigation shows that they occur  
        because of overload with legal connections, you should tune  
        another parameters until this warning disappear.  
        See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow.  
.
        syncookies seriously violate TCP protocol, do not allow  
        to use TCP extensions, can result in serious degradation  
        of some services (f.e. SMTP relaying), visible not by you,  
        but your clients and relays, contacting you. While you see  
        SYN flood warnings in logs not being really flooded, your server  
        is seriously misconfigured.  
.
        If you want to test which effects syncookies have to your  
        network connections you can set this knob to 2 to enable  
        unconditionally generation of syncookies.  
AI 代码解读

推荐设置

net.ipv4.tcp_syncookies=1  
.
防止syn flood攻击 
AI 代码解读
  1. 参数
net.ipv4.tcp_timestamps
AI 代码解读

支持系统

CentOS 6, 7       
AI 代码解读

参数解释

tcp_timestamps - BOOLEAN  
        Enable timestamps as defined in RFC1323.  
AI 代码解读

推荐设置

net.ipv4.tcp_timestamps=1  
.
tcp_timestampstcp 协议中的一个扩展项,通过时间戳的方式来检测过来的包以防止 PAWS(Protect Against Wrapped  Sequence numbers),可以提高 tcp 的性能。
AI 代码解读
  1. 参数
net.ipv4.tcp_tw_recycle
net.ipv4.tcp_tw_reuse
net.ipv4.tcp_max_tw_buckets
AI 代码解读

支持系统

CentOS 6, 7       
AI 代码解读

参数解释

tcp_tw_recycle - BOOLEAN  
        Enable fast recycling TIME-WAIT sockets. Default value is 0.  
        It should not be changed without advice/request of technical  
        experts.  
.
tcp_tw_reuse - BOOLEAN  
        Allow to reuse TIME-WAIT sockets for new connections when it is  
        safe from protocol viewpoint. Default value is 0.  
        It should not be changed without advice/request of technical  
        experts.  
.
tcp_max_tw_buckets - INTEGER
        Maximal number of timewait sockets held by system simultaneously.
        If this number is exceeded time-wait socket is immediately destroyed
        and warning is printed. 
    This limit exists only to prevent simple DoS attacks, 
    you _must_ not lower the limit artificially, 
        but rather increase it (probably, after increasing installed memory),  
        if network conditions require more than default value. 
AI 代码解读

推荐设置

net.ipv4.tcp_tw_recycle=0  
net.ipv4.tcp_tw_reuse=1  
net.ipv4.tcp_max_tw_buckets = 2xxxxx  
.
net.ipv4.tcp_tw_recycle和net.ipv4.tcp_timestamps不建议同时开启  
AI 代码解读
  1. 参数
net.ipv4.tcp_rmem
net.ipv4.tcp_wmem
AI 代码解读

支持系统

CentOS 6, 7       
AI 代码解读

参数解释

tcp_wmem - vector of 3 INTEGERs: min, default, max  
        min: Amount of memory reserved for send buffers for TCP sockets.  
        Each TCP socket has rights to use it due to fact of its birth.  
        Default: 1 page  
.
        default: initial size of send buffer used by TCP sockets.  This  
        value overrides net.core.wmem_default used by other protocols.  
        It is usually lower than net.core.wmem_default.  
        Default: 16K  
.
        max: Maximal amount of memory allowed for automatically tuned  
        send buffers for TCP sockets. This value does not override  
        net.core.wmem_max.  Calling setsockopt() with SO_SNDBUF disables  
        automatic tuning of that socket's send buffer size, in which case  
        this value is ignored.  
        Default: between 64K and 4MB, depending on RAM size.  
.
tcp_rmem - vector of 3 INTEGERs: min, default, max  
        min: Minimal size of receive buffer used by TCP sockets.  
        It is guaranteed to each TCP socket, even under moderate memory  
        pressure.  
        Default: 1 page  
.
        default: initial size of receive buffer used by TCP sockets.  
        This value overrides net.core.rmem_default used by other protocols.  
        Default: 87380 bytes. This value results in window of 65535 with  
        default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit  
        less for default tcp_app_win. See below about these variables.  
.
        max: maximal size of receive buffer allowed for automatically  
        selected receiver buffers for TCP socket. This value does not override  
        net.core.rmem_max.  Calling setsockopt() with SO_RCVBUF disables  
        automatic tuning of that socket's receive buffer size, in which  
        case this value is ignored.  
        Default: between 87380B and 6MB, depending on RAM size.  
AI 代码解读

推荐设置

net.ipv4.tcp_rmem=8192 87380 16777216  
net.ipv4.tcp_wmem=8192 65536 16777216  
.
许多数据库的推荐设置,提高网络性能
AI 代码解读
  1. 参数
net.nf_conntrack_max
net.netfilter.nf_conntrack_max
AI 代码解读

支持系统

CentOS 6  
AI 代码解读

参数解释

nf_conntrack_max - INTEGER  
        Size of connection tracking table.  
    Default value is nf_conntrack_buckets value * 4.  
AI 代码解读

推荐设置

net.nf_conntrack_max=1xxxxxx  
net.netfilter.nf_conntrack_max=1xxxxxx  
AI 代码解读
  1. 参数
vm.dirty_background_bytes 
vm.dirty_expire_centisecs 
vm.dirty_ratio 
vm.dirty_writeback_centisecs 
AI 代码解读

支持系统

CentOS 6, 7      
AI 代码解读

参数解释

==============================================================  
.
dirty_background_bytes  
.
Contains the amount of dirty memory at which the background kernel  
flusher threads will start writeback.  
.
Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only  
one of them may be specified at a time. When one sysctl is written it is  
immediately taken into account to evaluate the dirty memory limits and the  
other appears as 0 when read.  
.
==============================================================  
.
dirty_background_ratio  
.
Contains, as a percentage of total system memory, the number of pages at which  
the background kernel flusher threads will start writing out dirty data.  
.
==============================================================  
.
dirty_bytes  
.
Contains the amount of dirty memory at which a process generating disk writes  
will itself start writeback.  
.
Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be  
specified at a time. When one sysctl is written it is immediately taken into  
account to evaluate the dirty memory limits and the other appears as 0 when  
read.  
.
Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any  
value lower than this limit will be ignored and the old configuration will be  
retained.  
.
==============================================================  
.
dirty_expire_centisecs  
.
This tunable is used to define when dirty data is old enough to be eligible  
for writeout by the kernel flusher threads.  It is expressed in 100'ths  
of a second.  Data which has been dirty in-memory for longer than this  
interval will be written out next time a flusher thread wakes up.  
.
==============================================================  
.
dirty_ratio  
.
Contains, as a percentage of total system memory, the number of pages at which  
a process which is generating disk writes will itself start writing out dirty  
data.  
.
==============================================================  
.
dirty_writeback_centisecs  
.
The kernel flusher threads will periodically wake up and write `old' data  
out to disk.  This tunable expresses the interval between those wakeups, in  
100'ths of a second.  
.
Setting this to zero disables periodic writeback altogether.  
.
==============================================================  
AI 代码解读

推荐设置

vm.dirty_background_bytes = 4096000000  
vm.dirty_expire_centisecs = 6000  
vm.dirty_ratio = 80  
vm.dirty_writeback_centisecs = 50  
.
减少数据库进程刷脏页的频率,dirty_background_bytes根据实际IOPS能力以及内存大小设置  
AI 代码解读
  1. 参数
vm.extra_free_kbytes
AI 代码解读

支持系统

CentOS 6  
AI 代码解读

参数解释

extra_free_kbytes  
.
This parameter tells the VM to keep extra free memory 
between the threshold where background reclaim (kswapd) kicks in, 
and the threshold where direct reclaim (by allocating processes) kicks in.  
.
This is useful for workloads that require low latency memory allocations  
and have a bounded burstiness in memory allocations, 
for example a realtime application that receives and transmits network traffic  
(causing in-kernel memory allocations) with a maximum total message burst  
size of 200MB may need 200MB of extra free memory to avoid direct reclaim  
related latencies.  
.
目标是尽量让后台进程回收内存,比用户进程提早多少kbytes回收,因此用户进程可以快速分配内存。  
AI 代码解读

推荐设置

vm.extra_free_kbytes=4xxxxxx  
AI 代码解读
  1. 参数
vm.min_free_kbytes
AI 代码解读

支持系统

CentOS 6, 7       
AI 代码解读

参数解释

min_free_kbytes:  
.
This is used to force the Linux VM to keep a minimum number  
of kilobytes free.  The VM uses this number to compute a  
watermark[WMARK_MIN] value for each lowmem zone in the system.  
Each lowmem zone gets a number of reserved free pages based  
proportionally on its size.  
.
Some minimal amount of memory is needed to satisfy PF_MEMALLOC  
allocations; if you set this to lower than 1024KB, your system will  
become subtly broken, and prone to deadlock under high loads.  
.
Setting this too high will OOM your machine instantly.  
AI 代码解读

推荐设置

vm.min_free_kbytes = 2xxxxxx  
.
防止在高负载时系统无响应,减少内存分配死锁概率。  
AI 代码解读
  1. 参数
vm.mmap_min_addr
AI 代码解读

支持系统

CentOS 6, 7     
AI 代码解读

参数解释

mmap_min_addr  
.
This file indicates the amount of address space  which a user process will  
be restricted from mmapping.  Since kernel null dereference bugs could  
accidentally operate based on the information in the first couple of pages  
of memory userspace processes should not be allowed to write to them.  By  
default this value is set to 0 and no protections will be enforced by the  
security module.  Setting this value to something like 64k will allow the  
vast majority of applications to work correctly and provide defense in depth  
against future potential kernel bugs.  
AI 代码解读

推荐设置

vm.mmap_min_addr=6xxxx  
.
防止内核隐藏的BUG导致的问题
AI 代码解读
  1. 参数
vm.overcommit_memory 
vm.overcommit_ratio 
AI 代码解读

支持系统

CentOS 6, 7       
AI 代码解读

参数解释

==============================================================  
.
overcommit_kbytes:  
.
When overcommit_memory is set to 2, the committed address space is not  
permitted to exceed swap plus this amount of physical RAM. See below.  
.
Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one  
of them may be specified at a time. Setting one disables the other (which  
then appears as 0 when read).  
.
==============================================================  
.
overcommit_memory:  
.
This value contains a flag that enables memory overcommitment.  
.
When this flag is 0, 
the kernel attempts to estimate the amount  
of free memory left when userspace requests more memory.  
.
When this flag is 1, 
the kernel pretends there is always enough memory until it actually runs out.  
.
When this flag is 2, 
the kernel uses a "never overcommit"  
policy that attempts to prevent any overcommit of memory.  
Note that user_reserve_kbytes affects this policy.  
.
This feature can be very useful because there are a lot of  
programs that malloc() huge amounts of memory "just-in-case"  
and don't use much of it.  
.
The default value is 0.  
.
See Documentation/vm/overcommit-accounting and  
security/commoncap.c::cap_vm_enough_memory() for more information.  
.
==============================================================  
.
overcommit_ratio:  
.
When overcommit_memory is set to 2, 
the committed address space is not permitted to exceed 
      swap + this percentage of physical RAM.  
See above.  
.
==============================================================  
AI 代码解读

推荐设置

vm.overcommit_memory = 0  
vm.overcommit_ratio = 90  
.
vm.overcommit_memory = 0vm.overcommit_ratio可以不设置 
AI 代码解读
  1. 参数
vm.swappiness 
AI 代码解读

支持系统

CentOS 6, 7       
AI 代码解读

参数解释

swappiness  
.
This control is used to define how aggressive the kernel will swap  
memory pages.  
Higher values will increase agressiveness, lower values  
decrease the amount of swap.  
.
The default value is 60.  
AI 代码解读

推荐设置

vm.swappiness = 0  
AI 代码解读
  1. 参数
vm.zone_reclaim_mode 
AI 代码解读

支持系统

CentOS 6, 7       
AI 代码解读

参数解释

zone_reclaim_mode:  
.
Zone_reclaim_mode allows someone to set more or less aggressive approaches to  
reclaim memory when a zone runs out of memory. If it is set to zero then no  
zone reclaim occurs. Allocations will be satisfied from other zones / nodes  
in the system.  
.
This is value ORed together of  
.
1       = Zone reclaim on  
2       = Zone reclaim writes dirty pages out  
4       = Zone reclaim swaps pages  
.
zone_reclaim_mode is disabled by default.  For file servers or workloads  
that benefit from having their data cached, zone_reclaim_mode should be  
left disabled as the caching effect is likely to be more important than  
data locality.  
.
zone_reclaim may be enabled if it's known that the workload is partitioned  
such that each partition fits within a NUMA node and that accessing remote  
memory would cause a measurable performance reduction.  The page allocator  
will then reclaim easily reusable pages (those page cache pages that are  
currently not used) before allocating off node pages.  
.
Allowing zone reclaim to write out pages stops processes that are  
writing large amounts of data from dirtying pages on other nodes. Zone  
reclaim will write out dirty pages if a zone fills up and so effectively  
throttle the process. This may decrease the performance of a single process  
since it cannot use all of system memory to buffer the outgoing writes  
anymore but it preserve the memory on other nodes so that the performance  
of other processes running on other nodes will not be affected.  
.
Allowing regular swap effectively restricts allocations to the local  
node unless explicitly overridden by memory policies or cpuset  
configurations.  
AI 代码解读

推荐设置

vm.zone_reclaim_mode=0  
.
不使用NUMA
AI 代码解读
  1. 参数
net.ipv4.ip_local_port_range
AI 代码解读

支持系统

CentOS 6, 7       
AI 代码解读

参数解释

ip_local_port_range - 2 INTEGERS
        Defines the local port range that is used by TCP and UDP to
        choose the local port. The first number is the first, the
        second the last local port number. The default values are
        32768 and 61000 respectively.
.
ip_local_reserved_ports - list of comma separated ranges
        Specify the ports which are reserved for known third-party
        applications. These ports will not be used by automatic port
        assignments (e.g. when calling connect() or bind() with port
        number 0). Explicit port allocation behavior is unchanged.
.
        The format used for both input and output is a comma separated
        list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and
        10). Writing to the file will clear all previously reserved
        ports and update the current list with the one given in the
        input.
.
        Note that ip_local_port_range and ip_local_reserved_ports
        settings are independent and both are considered by the kernel
        when determining which ports are available for automatic port
        assignments.
.
        You can reserve ports which are not in the current
        ip_local_port_range, e.g.:
.
        $ cat /proc/sys/net/ipv4/ip_local_port_range
        32000   61000
        $ cat /proc/sys/net/ipv4/ip_local_reserved_ports
        8080,9148
.
        although this is redundant. However such a setting is useful
        if later the port range is changed to a value that will
        include the reserved ports.
.
        Default: Empty
AI 代码解读

推荐设置

net.ipv4.ip_local_port_range=40000 65535  
.
限制本地动态端口分配范围,防止占用监听端口。
AI 代码解读
  1. 参数
  vm.nr_hugepages
AI 代码解读

支持系统

CentOS 6, 7
AI 代码解读

参数解释

==============================================================
nr_hugepages
Change the minimum size of the hugepage pool.
See Documentation/vm/hugetlbpage.txt
==============================================================
nr_overcommit_hugepages
Change the maximum size of the hugepage pool. The maximum is
nr_hugepages + nr_overcommit_hugepages.
See Documentation/vm/hugetlbpage.txt
.
The output of "cat /proc/meminfo" will include lines like:
......
HugePages_Total: vvv
HugePages_Free:  www
HugePages_Rsvd:  xxx
HugePages_Surp:  yyy
Hugepagesize:    zzz kB
.
where:
HugePages_Total is the size of the pool of huge pages.
HugePages_Free  is the number of huge pages in the pool that are not yet
                allocated.
HugePages_Rsvd  is short for "reserved," and is the number of huge pages for
                which a commitment to allocate from the pool has been made,
                but no allocation has yet been made.  Reserved huge pages
                guarantee that an application will be able to allocate a
                huge page from the pool of huge pages at fault time.
HugePages_Surp  is short for "surplus," and is the number of huge pages in
                the pool above the value in /proc/sys/vm/nr_hugepages. The
                maximum number of surplus huge pages is controlled by
                /proc/sys/vm/nr_overcommit_hugepages.
.
/proc/filesystems should also show a filesystem of type "hugetlbfs" configured
in the kernel.
.
/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge
pages in the kernel's huge page pool.  "Persistent" huge pages will be
returned to the huge page pool when freed by a task.  A user with root
privileges can dynamically allocate more or free some persistent huge pages
by increasing or decreasing the value of 'nr_hugepages'.
AI 代码解读

推荐设置

如果要使用PostgreSQL的huge page,建议设置它。  
大于数据库需要的共享内存即可。  
AI 代码解读

数据库关心的资源限制

  1. 通过/etc/security/limits.conf设置,或者ulimit设置
  2. 通过/proc/$pid/limits查看当前进程的设置
#        - core - limits the core file size (KB)
#        - memlock - max locked-in-memory address space (KB)
#        - nofile - max number of open files
#        - nproc - max number of processes
以上四个是非常关心的配置
....
#        - data - max data size (KB)
#        - fsize - maximum filesize (KB)
#        - rss - max resident set size (KB)
#        - stack - max stack size (KB)
#        - cpu - max CPU time (MIN)
#        - as - address space limit (KB)
#        - maxlogins - max number of logins for this user
#        - maxsyslogins - max number of logins on the system
#        - priority - the priority to run user process with
#        - locks - max number of file locks the user can hold
#        - sigpending - max number of pending signals
#        - msgqueue - max memory used by POSIX message queues (bytes)
#        - nice - max nice priority allowed to raise to values: [-20, 19]
#        - rtprio - max realtime priority
AI 代码解读

数据库关心的IO调度规则

  1. 目前操作系统支持的IO调度策略包括cfq, deadline, noop 等。
/kernel-doc-xxx/Documentation/block
-r--r--r-- 1 root root   674 Apr  8 16:33 00-INDEX
-r--r--r-- 1 root root 55006 Apr  8 16:33 biodoc.txt
-r--r--r-- 1 root root   618 Apr  8 16:33 capability.txt
-r--r--r-- 1 root root 12791 Apr  8 16:33 cfq-iosched.txt
-r--r--r-- 1 root root 13815 Apr  8 16:33 data-integrity.txt
-r--r--r-- 1 root root  2841 Apr  8 16:33 deadline-iosched.txt
-r--r--r-- 1 root root  4713 Apr  8 16:33 ioprio.txt
-r--r--r-- 1 root root  2535 Apr  8 16:33 null_blk.txt
-r--r--r-- 1 root root  4896 Apr  8 16:33 queue-sysfs.txt
-r--r--r-- 1 root root  2075 Apr  8 16:33 request.txt
-r--r--r-- 1 root root  3272 Apr  8 16:33 stat.txt
-r--r--r-- 1 root root  1414 Apr  8 16:33 switching-sched.txt
-r--r--r-- 1 root root  3916 Apr  8 16:33 writeback_cache_control.txt
AI 代码解读

如果你要详细了解这些调度策略的规则,可以查看WIKI或者看内核文档。

从这里可以看到它的调度策略

cat /sys/block/vdb/queue/scheduler 
noop [deadline] cfq 
AI 代码解读

修改

echo deadline > /sys/block/hda/queue/scheduler
AI 代码解读

或者修改启动参数

grub.conf
elevator=deadline
AI 代码解读

从很多测试结果来看,数据库使用deadline调度,性能会更稳定一些。

其他

  1. 关闭透明大页
  2. 禁用NUMA
  3. SSD的对齐
相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
打赏
0
1
0
1
20691
分享
相关文章
探索操作系统的心脏:内核与用户空间的交互
在数字世界的每一次呼吸中,操作系统扮演着至关重要的角色。本文将深入探讨操作系统的核心组件——内核与用户空间之间的神秘舞蹈。通过直观的比喻和生动的代码片段,我们将一窥这场幕后的交响曲,了解它们是如何协同工作以支持我们的计算需求的。从简单的文件读写到复杂的网络通信,每一个操作背后都隐藏着内核与用户空间之间精妙的互动。准备好跟随我们的脚步,一起揭开操作系统的神秘面纱。
59 3
OS-Copilot参数功能全面测试报告
作为一名运维工程师,我主要负责云资源的运维和管理。通过使用OS Copilot的-t/-f/管道功能,我顺利解决了环境快速搭建的问题,例如Tomcat的快速部署。具体步骤包括购买ECS服务器、配置安全组、远程登录并安装OS Copilot。使用-f参数成功安装并启动Tomcat,自动配置JDK,并通过|管道功能验证了生成内容的正确性。整个过程非常流畅,极大提升了工作效率。
64 12
深入解析Linux操作系统的内核优化策略
本文旨在探讨Linux操作系统内核的优化策略,包括内核参数调整、内存管理、CPU调度以及文件系统性能提升等方面。通过对这些关键领域的分析,我们可以理解如何有效地提高Linux系统的性能和稳定性,从而为用户提供更加流畅和高效的计算体验。
137 17
|
3月前
|
探索操作系统的心脏:内核与用户空间的交互
在数字世界的每一次点击和命令背后,隐藏着一个复杂而精妙的操作系统世界。本文将带你走进这个世界的核心,揭示内核与用户空间的神秘交互。通过深入浅出的解释和直观的代码示例,我们将一起理解操作系统如何协调硬件资源,管理进程和内存,以及提供文件系统服务。无论你是编程新手还是资深开发者,这篇文章都将为你打开一扇通往操作系统深层原理的大门。让我们一起开始这段旅程,探索那些支撑我们日常数字生活的技术基石吧!
86 6
Linux操作系统的内核优化与性能调优####
本文深入探讨了Linux操作系统内核的优化策略与性能调优方法,旨在为系统管理员和高级用户提供一套实用的指南。通过分析内核参数调整、文件系统选择、内存管理及网络配置等关键方面,本文揭示了如何有效提升Linux系统的稳定性和运行效率。不同于常规摘要仅概述内容的做法,本摘要直接指出文章的核心价值——提供具体可行的优化措施,助力读者实现系统性能的飞跃。 ####
Linux操作系统的内核优化与实践####
本文旨在探讨Linux操作系统内核的优化策略与实际应用案例,深入分析内核参数调优、编译选项配置及实时性能监控的方法。通过具体实例讲解如何根据不同应用场景调整内核设置,以提升系统性能和稳定性,为系统管理员和技术爱好者提供实用的优化指南。 ####
探索操作系统的内核——从理论到实践
操作系统是计算机科学的核心,它像一位默默无闻的指挥官,协调着硬件和软件之间的复杂关系。本文将深入操作系统的心脏——内核,通过直观的解释和丰富的代码示例,揭示其神秘面纱。我们将一起学习进程管理、内存分配、文件系统等关键概念,并通过实际代码,体验内核编程的魅力。无论你是初学者还是有经验的开发者,这篇文章都将带给你新的视角和知识。
深入探索Linux操作系统的内核机制
本文旨在为读者提供一个关于Linux操作系统内核机制的全面解析。通过探讨Linux内核的设计哲学、核心组件、以及其如何高效地管理硬件资源和系统操作,本文揭示了Linux之所以成为众多开发者和组织首选操作系统的原因。不同于常规摘要,此处我们不涉及具体代码或技术细节,而是从宏观的角度审视Linux内核的架构和功能,为对Linux感兴趣的读者提供一个高层次的理解框架。
操作系统的心脏——深入理解内核机制
在本文中,我们揭开操作系统内核的神秘面纱,探索其作为计算机系统核心的重要性。通过详细分析内核的基本功能、类型以及它如何管理硬件资源和软件进程,我们将了解内核是如何成为现代计算不可或缺的基础。此外,我们还会探讨内核设计的挑战和未来趋势,为读者提供一个全面的内核知识框架。
探索操作系统的心脏:内核与用户空间的交互之旅
在数字世界的无限广阔中,操作系统扮演着枢纽的角色,连接硬件与软件,支撑起整个计算生态。本篇文章将带领读者深入操作系统的核心——内核,揭示其与用户空间的神秘交互。我们将透过生动的例子和易于理解的比喻,深入浅出地探讨这一复杂主题,旨在为非专业读者揭开操作系统的神秘面纱,同时为有一定基础的读者提供更深层次的认识。从进程管理到内存分配,从文件系统到设备驱动,每一个环节都是精确而优雅的舞蹈,它们共同编织出稳定而高效的计算体验。让我们开始这场奇妙之旅,一探操作系统背后的科学与艺术。
52 5
AI助理

你好,我是AI助理

可以解答问题、推荐解决方案等