190 lines
6.5 KiB
Plaintext
190 lines
6.5 KiB
Plaintext
perf-amd-ibs(1)
|
|
===============
|
|
|
|
NAME
|
|
----
|
|
perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool
|
|
|
|
SYNOPSIS
|
|
--------
|
|
[verse]
|
|
'perf record' -e ibs_op//
|
|
'perf record' -e ibs_fetch//
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
|
|
Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP)
|
|
profiling support on AMD platforms. IBS has two independent components: IBS
|
|
Op and IBS Fetch. IBS Op sampling provides information about instruction
|
|
execution (micro-op execution to be precise) with details like d-cache
|
|
hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch
|
|
behavior etc. IBS Fetch sampling provides information about instruction fetch
|
|
with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is
|
|
per-smt-thread i.e. each SMT hardware thread contains standalone IBS units.
|
|
|
|
Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited
|
|
using the Linux perf utility. The following files will be created at boot time
|
|
if IBS is supported by the hardware and kernel.
|
|
|
|
/sys/bus/event_source/devices/ibs_op/
|
|
/sys/bus/event_source/devices/ibs_fetch/
|
|
|
|
IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports
|
|
one event: fetch ops.
|
|
|
|
IBS PMUs do not have user/kernel filtering capability and thus it requires
|
|
CAP_SYS_ADMIN or CAP_PERFMON privilege.
|
|
|
|
IBS VS. REGULAR CORE PMU
|
|
------------------------
|
|
|
|
IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has
|
|
no skid. Whereas the IP recorded by regular core PMU will have some skid
|
|
(sample was generated at IP X but perf would record it at IP X+n). Hence,
|
|
regular core PMU might not help for profiling with instruction level
|
|
precision. Further, IBS provides additional information about the sample in
|
|
question. On the other hand, regular core PMU has it's own advantages like
|
|
plethora of events, counting mode (less interference), up to 6 parallel
|
|
counters, event grouping support, filtering capabilities etc.
|
|
|
|
Three regular core PMU events are internally forwarded to IBS Op PMU when
|
|
precise_ip attribute is set:
|
|
|
|
-e cpu-cycles:p becomes -e ibs_op//
|
|
-e r076:p becomes -e ibs_op//
|
|
-e r0C1:p becomes -e ibs_op/cnt_ctl=1/
|
|
|
|
EXAMPLES
|
|
--------
|
|
|
|
IBS Op PMU
|
|
~~~~~~~~~~
|
|
|
|
System-wide profile, cycles event, sampling period: 100000
|
|
|
|
# perf record -e ibs_op// -c 100000 -a
|
|
|
|
Per-cpu profile (cpu10), cycles event, sampling period: 100000
|
|
|
|
# perf record -e ibs_op// -c 100000 -C 10
|
|
|
|
Per-cpu profile (cpu10), cycles event, sampling freq: 1000
|
|
|
|
# perf record -e ibs_op// -F 1000 -C 10
|
|
|
|
System-wide profile, uOps event, sampling period: 100000
|
|
|
|
# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a
|
|
|
|
Same command, but also capture IBS register raw dump along with perf sample:
|
|
|
|
# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples
|
|
|
|
System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward)
|
|
|
|
# perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a
|
|
|
|
Per process(upstream v6.2 onward), uOps event, sampling period: 100000
|
|
|
|
# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234
|
|
|
|
Per process(upstream v6.2 onward), uOps event, sampling period: 100000
|
|
|
|
# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls
|
|
|
|
To analyse recorded profile in aggregate mode
|
|
|
|
# perf report
|
|
/* Select a line and press 'a' to drill down at instruction level. */
|
|
|
|
To go over each sample
|
|
|
|
# perf script
|
|
|
|
Raw dump of IBS registers when profiled with --raw-samples
|
|
|
|
# perf report -D
|
|
/* Look for PERF_RECORD_SAMPLE */
|
|
|
|
Example register raw dump:
|
|
|
|
ibs_op_ctl: 000002c30006186a MaxCnt 100000 L3MissOnly 0 En 1
|
|
Val 1 CntCtl 0=cycles CurCnt 707
|
|
IbsOpRip: ffffffff8204aea7
|
|
ibs_op_data: 0000010002550001 CompToRetCtr 1 TagToRetCtr 597
|
|
BrnRet 0 RipInvalid 0 BrnFuse 0 Microcode 1
|
|
ibs_op_data2: 0000000000000013 RmtNode 1 DataSrc 3=DRAM
|
|
ibs_op_data3: 0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0
|
|
DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0
|
|
DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0
|
|
DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1
|
|
DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes
|
|
OpDcMissOpenMemReqs 12 DcMissLat 0 TlbRefillLat 0
|
|
IbsDCLinAd: ff110008a5398920
|
|
IbsDCPhysAd: 00000008a5398920
|
|
|
|
IBS applied in a real world usecase
|
|
|
|
~90% regression was observed in tbench with specific scheduler hint
|
|
which was counter intuitive. IBS profile of good and bad run captured
|
|
using perf helped in identifying exact cause of the problem:
|
|
|
|
https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com
|
|
|
|
IBS Fetch PMU
|
|
~~~~~~~~~~~~~
|
|
|
|
Similar commands can be used with Fetch PMU as well.
|
|
|
|
System-wide profile, fetch ops event, sampling period: 100000
|
|
|
|
# perf record -e ibs_fetch// -c 100000 -a
|
|
|
|
System-wide profile, fetch ops event, sampling period: 100000, Random enable
|
|
|
|
# perf record -e ibs_fetch/rand_en=1/ -c 100000 -a
|
|
|
|
Random enable adds small degree of variability to sample period. This
|
|
helps in cases like long running loops where PMU is tagging the same
|
|
instruction over and over because of fixed sample period.
|
|
|
|
etc.
|
|
|
|
PERF MEM AND PERF C2C
|
|
---------------------
|
|
|
|
perf mem is a memory access profiler tool and perf c2c is a shared data
|
|
cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD.
|
|
Below is a simple example of the perf mem tool.
|
|
|
|
# perf mem record -c 100000 -- make
|
|
# perf mem report
|
|
|
|
A normal perf mem report output will provide detailed memory access profile.
|
|
However, it can also be aggregated based on output fields. For example:
|
|
|
|
# perf mem report -F mem,sample,snoop
|
|
Samples: 3M of event 'ibs_op//', Event count (approx.): 23524876
|
|
Memory access Samples Snoop
|
|
N/A 1903343 N/A
|
|
L1 hit 1056754 N/A
|
|
L2 hit 75231 N/A
|
|
L3 hit 9496 HitM
|
|
L3 hit 2270 N/A
|
|
RAM hit 8710 N/A
|
|
Remote node, same socket RAM hit 3241 N/A
|
|
Remote core, same node Any cache hit 1572 HitM
|
|
Remote core, same node Any cache hit 514 N/A
|
|
Remote node, same socket Any cache hit 1216 HitM
|
|
Remote node, same socket Any cache hit 350 N/A
|
|
Uncached hit 18 N/A
|
|
|
|
Please refer to their man page for more detail.
|
|
|
|
SEE ALSO
|
|
--------
|
|
|
|
linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
|
|
linkperf:perf-mem[1], linkperf:perf-c2c[1]
|