The clients should run on a separated host than the server. Client host(s) should be close to server host (preferably same subnetwork).
It can be good to collect metrics on some specific threads. For example, to see how a worker thread behaves, or listener, or database threads..
Numa:
cat /usr/lib/systemd/system/dirsrv@.service
...
ExecStart=numactl --cpunodebind=0 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-%i -i /run/dirsrv/slapd-%i.pid
systemctl daemon-reload
Tuning: 10K database entry. entry/db 1G/400Mb. workers = 8. client = –bindDN ‘cn=Directory Manager’ –baseDN dc=example,dc=com –scope sub –filter ‘(uid=use[1-10000])’ –attribute givenName –attribute sn –attribute mail –numThreads 30
Run client on another host else the server take all CPU and wait for clients
Shared structures are structures (locks, counters,…) that are accessed by many threads. A shared structure becomes more and more hot when a CPU that access the structure has to retrieve it more and more frequently from another CPU that modified it in its cache. A lock is usually a hot structure accessed from all CPU, it can create contention and this is normal. A lock is located into a CPU cache line, so a CPU needs to acquire to cache line that is located in another CPU (HITM). A problem is if the cache line contains several hot structures. In such case there is a useless extra contention on structure A due to false cacheline sharing with B not related to A. To prevent this behavior we should monitor cache line heat and align hot structure so that each of them fit into a dedicated cache line.
A tool to monitor cache line heat is perf c2c.
perf c2c record --call-graph dwarf -p `pidof ns-slapd` -o perf_call_c2c # after 4-5sec key ^C
perf c2c report -i perf_call_c2c --stdio > perf_call_c2c.rep
vi perf_call_c2c.rep
=================================================
Shared Data Cache Line Table
=================================================
#
# ----------- Cacheline ---------- Total Tot ----- LLC Load Hitm ----- ---- Store Reference ---- --- Load Dram ---- LLC Total ----- Core Load Hit ----- -- LLC Load Hit --
# Index Address Node PA cnt records Hitm Total Lcl Rmt Total L1Hit L1Miss Lcl Rmt Ld Miss Loads FB L1 L2 Llc Rmt
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ........ ........ ....... ....... ....... ....... ....... ........ ........
#
0 0x7fa6882e1f80 0 1624 4789 6.50% 648 648 0 1211 667 544 0 0 0 3578 2380 125 1 424 0
#
# database mutex: __db_tas_mutex_lock
#
1 0x7fa69aa23340 0 1464 2331 4.33% 432 432 0 582 582 0 0 0 0 1749 695 503 1 118 0
#
# 389DS plugins global lock
#
2 0x7fa69aaf5480 0 3237 4606 3.84% 383 383 0 1076 1053 23 0 0 0 3530 1828 1164 0 155 0
#
# 389DS entrycache lock
#
3 0x7fa6882e1440 0 743 1682 1.85% 185 185 0 686 540 146 0 0 0 996 721 32 0 58 0
#
# database mutex: __db_tas_mutex_lock
#
4 0x7fa69aaf5500 0 395 1042 1.81% 181 181 0 361 356 5 0 0 0 681 30 428 0 42 0
#
# 389DS entrycache lock
#
5 0x7fa6882e1f40 0 586 2732 1.54% 154 154 0 1570 1553 17 0 0 0 1162 0 969 0 39 0
#
# database mutex: __db_tas_mutex_lock
#
6 0x7fa65d462140 0 87 175 1.51% 151 151 0 10 5 5 0 0 0 165 1 0 0 13 0
#
# database cursor: __db_cursor_int
#
7 0x7fa669c722c0 0 1 431 1.49% 149 149 0 275 128 147 0 0 0 156 1 2 0 4 0
#
# database memory pool: _memp_fput
#
8 0x7fa669c72200 0 172 329 1.44% 144 144 0 8 8 0 0 0 0 321 163 7 0 7 0
#
# database memory pool: _memp_fget
#
9 0x7fa695ad9c00 0 740 1243 1.34% 134 134 0 318 318 0 0 0 0 925 449 295 1 46 0
#
# 389 VLV index
#
Those are examples of commands to monitor the activity at the host level
Under heavy SRCH load, the servers shows high syscalls rate (futex): perf top -g -d 10
Samples: 2M of event 'cycles', 4000 Hz, Event count (approx.): 496734082490 lost: 0/0 drop: 0/0
Children Self Shared Object Symbol
- 22.15% 0.27% [kernel] [k] entry_SYSCALL_64_after_hwframe ◆
- 21.89% entry_SYSCALL_64_after_hwframe ▒
- do_syscall_64 ▒
+ 4.15% __x64_sys_futex ▒
+ 3.83% __x64_sys_setsockopt ▒
+ 3.60% __x64_sys_poll ▒
+ 2.77% __x64_sys_sendto ▒
+ 1.88% ksys_write ▒
+ 1.24% __x64_sys_recvfrom ▒
+ 0.70% syscall_slow_exit_work ▒
- 21.89% 2.93% [kernel] [k] do_syscall_64 ▒
- 18.96% do_syscall_64 ▒
+ 4.15% __x64_sys_futex ▒
+ 3.83% __x64_sys_setsockopt ▒
+ 3.60% __x64_sys_poll ▒
+ 2.77% __x64_sys_sendto ▒
+ 1.88% ksys_write ▒
+ 1.24% __x64_sys_recvfrom ▒
+ 0.69% syscall_slow_exit_work ▒
- 0.52% __poll
+ 4.93% 0.08% libc-2.28.so [.] __poll ▒
+ 4.79% 0.06% libc-2.28.so [.] __GI___setsockopt ▒
+ 4.75% 4.60% libdb-5.3.so [.] __db_tas_mutex_lock ▒
+ 4.19% 0.02% perf [.] hist_entry_iter__add ▒
+ 4.14% 0.13% [kernel] [k] __x64_sys_futex ▒
+ 4.01% 0.14% [kernel] [k] do_idle ▒
+ 4.00% 0.14% [kernel] [k] do_futex ▒
+ 3.82% 0.00% [unknown] [.] 0000000000000000 ▒
+ 3.79% 0.03% [kernel] [k] __x64_sys_setsockopt ▒
+ 3.76% 0.05% [kernel] [k] __sys_setsockopt ▒
+ 3.64% 0.05% libpthread-2.28.so [.] __libc_send ▒
+ 3.57% 0.25% libpthread-2.28.so [.] pthread_cond_signal@@GLIBC_2.3.2 ▒
+ 3.56% 0.08% [kernel] [k] __x64_sys_poll
Monitoring the workqueue with ‘cn=monitor’ (-b “cn=monitor” -s base opsinitiated opscompleted readwaiters) is too impacting (locking the connections/conntable). Prefere
gdb -ex 'set confirm off' -ex 'set pagination off' -ex 'print work_q_size' -ex 'print work_q_size_max' -ex 'print *ops_initiated' -ex 'print *ops_completed' -ex 'quit' /usr/sbin/ns-slapd `pidof ns-slapd`
pstack impacts a lot (almost stop) the server. It can be usefull if a thread is spending a lot of time in a same area of code
The tool shows how much time each tread is spending offcpu. Here the listener 38701 is pretty active but can sleep up to 255usec
/usr/share/bcc/tools/cpudist -POL -p `pidof ns-slapd` 1 1
tid = 38701 ns-slapd
usecs : count distribution
0 -> 1 : 837 |** |
2 -> 3 : 15145 |************************************** |
4 -> 7 : 15696 |****************************************|
8 -> 15 : 12510 |******************************* |
16 -> 31 : 4895 |************ |
32 -> 63 : 1854 |**** |
64 -> 127 : 0 | |
128 -> 255 : 16 | |
tid = 38756 ns-slapd
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 198 |****************************************|
After selecting a worker LWP (using pstack), you may trace (for 3-4 sec) the syscalls done by the thread
Summary of events:
ns-slapd (108026), 1784094 events, 100.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
futex 328704 3054.546 0.000 0.009 346.687 17.23%
setsockopt 124702 656.760 0.001 0.005 0.034 0.14%
poll 125437 655.709 0.002 0.005 100.141 15.28%
sendto 124501 464.307 0.001 0.004 0.039 0.14%
write 125026 462.438 0.001 0.004 0.031 0.10%
recvfrom 62799 248.383 0.002 0.004 0.030 0.11%
madvise 3 0.066 0.014 0.022 0.034 26.97%
sched_yield 10 0.043 0.004 0.004 0.004 1.69%
The following command shows a sorted list of routines (kernel and user) where the server spend the most of the time.
perf top -t 65457 (for a worker)
Samples: 486K of event 'cycles', 4000 Hz, Event count (approx.): 257233768218 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
2.84% [kernel] [k] do_syscall_64
2.50% libslapd.so.0.1.0 [.] slapi_pblock_get
2.09% libpthread-2.28.so [.] __pthread_mutex_lock
2.08% libdb-5.3.so [.] __db_tas_mutex_lock
1.43% libdb-5.3.so [.] __db_tas_mutex_unlock
1.32% libpthread-2.28.so [.] __pthread_mutex_unlock_usercnt
1.30% libjemalloc.so.2 [.] free
1.30% libslapd.so.0.1.0 [.] attrlist_find
1.25% libslapd.so.0.1.0 [.] slapi_pblock_set
1.08% libc-2.28.so [.] vfprintf
1.06% [kernel] [k] entry_SYSCALL_64
1.03% libslapd.so.0.1.0 [.] slapi_log_error
0.99% [kernel] [k] syscall_return_via_sysret
0.94% [kernel] [k] __fget
futex is a syscall called when an action (lock/unlock) on a mutex needs to notify others threads. Way to find mutex the 10 hottest mutex in contention.
perf trace -t <tid> --call-graph dwarf -e futex -o /tmp/perf_trace_dwarf_futex
grep futex /tmp/perf_trace_dwarf_futex |sed -e 's/^.* futex/futex/' -e 's/op:.*$//' |sort |uniq -c |sort -n | tail -10
https://github.com/389ds/389-ds-base/issues/4308: performance search rate: checking if an entry is a referral is expensive https://github.com/389ds/389-ds-base/issues/4315: performance search rate: nagle triggers high rate of setsocketopt https://github.com/389ds/389-ds-base/issues/4316: performance search rate: useless poll on network send callback