## Identifying Embedded Real-Time Latency Issues: I-cache and Locks

Finding and fixing the largest causes of latency in a real-time Linux system is a somewhat well known craft. Finding the last 10% of the causes of excessive latency can be a black art. This talk explores the black art, providing insights into the impact of I-cache and locks on real-time latency on an SMP ARM embedded system. As a result of this talk, the audience will be able to identify some signatures of I-cache and lock issues and will have learned some alternative approaches to investigate latency.

This talk is accessible to all technical levels. Those with a basic knowledge of computer architecture will understand the information. Experts will gain valuable insights to add to their toolkit.

Frank Rowand, Sony Corporation of America September 28, 2010

#### Overview

I am going to describe part of an investigation of a relatively small code path that contributes significantly to the maximum interrupts disabled duration for some of my test systems.

I will discuss tools, some of the detective work, and some of the conclusions.

## The Well Known Craft

Minimizing maximum interrupts disabled duration is often a crucial part of ensuring adequate real time latency.

### The Well Known Craft

Measuring interrupts disabled duration

Can use existing tools

The next slides are an example report of the ftrace irqsoff tracer, with the latency-format option enabled, from Documentation/trace/ftrace.txt

latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) | task: ls-4339 (uid:0 nice:0 policy:0 rt\_prio:0) => started at: \_\_alloc\_pages\_internal => ended at: \_\_alloc\_pages\_internal # ----> CPU# / \_----=> irqs-off # / ----=> need-resched # || / \_---=> hardirq/softirq # # || / \_--=> preempt-depth # # delay pid ||||| time | # caller cmd #  $\mathbf{i}$ ls-4339 0...1 Ous+: get\_page\_from\_freelist (\_\_alloc\_pages\_internal) ls-4339 0d..1 3us : rmqueue\_bulk (get\_page\_from\_freelist) ls-4339 0d..1 3us : \_spin\_lock (rmqueue\_bulk) ls-4339 0d..1 4us : add\_preempt\_count (\_spin\_lock) 4us : \_\_\_rmqueue (rmqueue\_bulk) ls-4339 0d..2 ls-4339 0d..2 5us : \_\_rmqueue\_smallest (\_\_rmqueue) ls-4339 5us : \_\_mod\_zone\_page\_state (\_\_rmgueue\_smallest) 0d..2 6us : \_\_\_rmqueue (rmqueue\_bulk) ls-4339 0d..2 6us : \_\_rmqueue\_smallest (\_\_rmqueue) ls-4339 0d..2 ls-4339 0d..2 7us : \_\_mod\_zone\_page\_state (\_\_rmqueue\_smallest) 7us : rmqueue (rmqueue bulk) ls-4339 0d..2 8us : rmgueue smallest ( rmgueue) ls-4339 0d..2 [...] 46us : \_\_\_rmgueue\_smallest (\_\_\_rmgueue) ls-4339 0d..2 1s - 4339 $0d \dots 2$ 47us : mod zone page state ( rmqueue smallest)

latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)

| #  |           |     | delay  | y                                                        |
|----|-----------|-----|--------|----------------------------------------------------------|
| #  | cmd pid   |     | time   | caller                                                   |
| #  | $\land$ / |     | \      | /                                                        |
|    | ls-4339   | 01  | 0us+:  | <pre>get_page_from_freelist (alloc_pages_internal)</pre> |
|    | ls-4339   | 0d1 | 3us :  | rmqueue_bulk (get_page_from_freelist)                    |
|    | ls-4339   | 0d1 | 3us :  | _spin_lock (rmqueue_bulk)                                |
|    | ls-4339   | 0d1 | 4us :  | add_preempt_count (_spin_lock)                           |
|    | ls-4339   | 0d2 | 4us :  | rmqueue (rmqueue_bulk)                                   |
|    | ls-4339   | 0d2 | 5us :  | rmqueue_smallest (rmqueue)                               |
|    | ls-4339   | 0d2 | 5us :  | mod_zone_page_state (rmqueue_smallest)                   |
|    | ls-4339   | 0d2 | 6us :  | rmqueue (rmqueue_bulk)                                   |
|    | ls-4339   | 0d2 | 6us :  | rmqueue_smallest (rmqueue)                               |
|    | ls-4339   | 0d2 |        | mod_zone_page_state (rmqueue_smallest)                   |
|    | ls-4339   | 0d2 | 7us :  | rmqueue (rmqueue_bulk)                                   |
|    | ls-4339   | 0d2 | 8us :  | rmqueue_smallest (rmqueue)                               |
| [. | ]         |     |        |                                                          |
|    | ls-4339   |     |        | rmqueue_smallest (rmqueue)                               |
|    | ls-4339   |     |        | mod_zone_page_state (rmqueue_smallest)                   |
|    | ls-4339   |     |        | rmqueue (rmqueue_bulk)                                   |
|    | ls-4339   | 0d2 | 48us : | rmqueue_smallest (rmqueue)                               |
|    | ls-4339   | 0d2 |        | mod_zone_page_state (rmqueue_smallest)                   |
|    | ls-4339   | 0d2 | 49us : | _spin_unlock (rmqueue_bulk)                              |
|    | ls-4339   |     |        | <pre>sub_preempt_count (_spin_unlock)</pre>              |
|    | ls-4339   | 0d1 | 50us : | <pre>get_page_from_freelist (alloc_pages_internal)</pre> |
|    | ls-4339   | 0d2 | 51us : | <pre>trace_hardirqs_on (alloc_pages_internal)</pre>      |

# My Motivation, part 1

Minimizing maximum interrupts disabled duration is often a crucial part of ensuring adequate real time latency.

A significant contributor to maximum interrupts disabled duration in many of my projects has been do\_local\_timer().

The following graphs show an example distribution of do\_local\_timer() durations from a modified 2.6.23.17-rt14.

## My Motivation, part 1

Things to notice about the graphs:

- The tails -- the elusive 10%
- The consistency between test runs





# Capturing do\_local\_timer() data

Custom tool (the "lite tracer")

Similar to the ftrace irqsoff tracer, but lower overhead and additional features

- Reports each new high-water critical period (eg IRQs disabled duration)
- Reports frequency distribution of duration of instrumented code paths (histogram)

# Digression: do\_local\_timer() info

- Triggered on each cpu by a per cpu irq source (on the boards used for this presentation)
- Periodic, once every 10 msec (100 times/sec)
- Thus ~10 msec of other activity which can impact the i-cache between each occurrence of do\_local\_timer()
- Executes in interrupt context with interrupts disabled

## Moving Forward

More recent kernel: modified 2.6.29.6-rt14

A different processor on a different board

do\_local\_timer() is still an important portion of maximum interrupts disabled duration

The do\_local\_timer() duration frequency graph looks different, yet still has a tail



## My Motivation, part 2

My colleagues added PMU counters to the lite tracer reports of maximum IRQ disabled duration:

#mark : ----- 9 max: 145 ----# start eip0: c03a8294 \_\_irq\_svc+0x34
# 6 do\_local\_timer+0x40 284718 284102 27556
# 96 do\_local\_timer+0x80 317730 308527 30419
# 98 \_text+0x98 318015 308634 30486
# 138 \_text+0xc0 332232 317411 32512
# end eip0: c03a5528 preempt\_schedule\_irq+0x48

Performance Monitor (PMU) counters: cycle count Instruction cache stall cycles Instructions executed

#### My Motivation, part 2

# 6 do\_local\_timer+0x40 284718 284102 27556 # 96 do\_local\_timer+0x80 317730 308527 30419

elapsed usec 96 - 6 = 90

PMU counters:

cycle count317730 - 284718 = 33012I-stall cycles308527 - 284102 = 2442574%Instructions30419 - 27556 = 2863

cycles per instruction 33012 / 2863 = 11.5I-stall cycles per instruction 24425 / 2863 = 8.5

# My First PMU Data

Collecting do\_local\_timer() PMU data for some instances of maximum interrupts disabled provided the following data and graph

| #mark | : 1         | max:        | 4   |  |
|-------|-------------|-------------|-----|--|
| #mark | : 2         | max:        | 7   |  |
| #mark | : 3         | max:        | 8   |  |
| #mark | : 4         | max:        | 30  |  |
| #mark | : 5         | max:        | 97  |  |
| #     | 6 do_local  | _timer+0x40 |     |  |
| #     | 93 do_local | _timer+0x80 |     |  |
| #mark | : 6         | max:        | 126 |  |
| #     | 6 do_local  | _timer+0x40 |     |  |
| #     | 93 do_local | _timer+0x80 |     |  |
| #mark | : 7         | max:        | 132 |  |
| #     | 6 do_local  | _timer+0x40 |     |  |
| #     | 92 do_local | _timer+0x80 |     |  |
| #mark | : 8         | max:        | 138 |  |
| #     | 6 do_local  | _timer+0x40 |     |  |
| #     | 93 do_local | _timer+0x80 |     |  |
| #mark | : 9         | max:        | 145 |  |
| #     | 6 do_local  | _timer+0x40 |     |  |
| #     | 96 do_local | _timer+0x80 |     |  |
| #mark | : 10        | max:        | 147 |  |
| #     | 7 do_local  | _timer+0x40 |     |  |
| #     | 98 do_local | _timer+0x80 |     |  |

| usec           | ccnt                                      | i-stall<br>cycles<br>                     |  |  |  |  |  |
|----------------|-------------------------------------------|-------------------------------------------|--|--|--|--|--|
| 95<br>93<br>94 | 35641<br>35242<br>34644<br>34242<br>34107 | 26480<br>23675<br>26020<br>25255<br>23934 |  |  |  |  |  |
|                |                                           |                                           |  |  |  |  |  |
| 85<br>85       | 31006<br>30911                            | 23436<br>23472                            |  |  |  |  |  |
|                | 30576                                     | 22745                                     |  |  |  |  |  |
|                | 29142                                     | 20106                                     |  |  |  |  |  |
|                | 26209                                     | 17026                                     |  |  |  |  |  |



#### Observations

I-cache misses and I-cache stall cycles for the do\_local\_timer() code path are large when do\_local\_timer() duration is large.

I-cache misses and I-cache stall cycles for the do\_local\_timer() code path are smaller when do\_local\_timer() duration is smaller.

### Observations, Question

I-cache misses and I-cache stall cycles for the do\_local\_timer() code path are large when do\_local\_timer() duration is large.

I-cache misses and I-cache stall cycles for the do\_local\_timer() code path are smaller when do\_local\_timer() duration is smaller.

Are I-cache misses the cause of long do\_local\_timer() duration?

### More Instrumentation Needed

Reports of PMU data each time a new maximum interrupts disabled duration occurs provides information about a very small fraction of all executions of do\_local\_timer().

### More Instrumentation Needed

Reports of PMU data each time a new maximum interrupts disabled duration occurs provides information about a very small fraction of all executions of do\_local\_timer().

Solution:

Modify the lite tracer to collect PMU data about every occurrence of do\_local\_timer().

#### **Previous Instrumentation**

The lite tracer already collected the duration of an instrumented code path (in this case, do\_local\_timer()).



### Added Instrumentation

Collect delta of the PMU counters for the instrumented code path.

Examples of PMU counters:

- cycle count
- I-cache misses
- I-cache miss stall cycles
- D-cache misses
- D-cache miss stall cycles
- instructions executed





#### Conclusions

The graphs of duration (usec) and I-cache stall cycles appear similar.

### Conclusions, Theories

The graphs of duration (usec) and I-cache stall cycles appear similar.

Increased do\_local\_timer() duration may be caused by increased number of instructions.

Increased do\_local\_timer() duration may be caused by increased I-cache stall cycles.

#### Theories – How To Test?

The graphs of duration (usec) and I-cache stall cycles appear similar.

Increased do\_local\_timer() duration may be caused by increased number of instructions.

Measure number of instructions executed.





#### Theories – How To Test?

Increased do\_local\_timer() duration may be caused by increased number of instructions.

Measure number of instructions executed.

For 99.7% of the events, the number of instructions executed was in the range 2700 ... 3000.

The large range of do\_local\_timer() duration is not directly explained by number of instructions.

Increased do\_local\_timer() duration may be caused by increased I-cache miss stall cycles.

Invalidate the I-cache before calling do\_local\_timer(). Do I-cache stall cycles increase?

Increased do\_local\_timer() duration may be caused by increased I-cache miss stall cycles.

Invalidate the I-cache before calling do\_local\_timer(). Do I-cache stall cycles increase?

Decrease I-cache stall cycles by "pre-fetching" do\_local\_timer() text while interrupts are enabled. Do I-cache stall cycles decrease?









# What about IRQs disabled?

Duration of do\_local\_timer() appears to be correlated to I-cache miss stall cycles.

# What about IRQs disabled?

Duration of do\_local\_timer() appears to be correlated to I-cache miss stall cycles.

Does this theory help solve the original problem of large maximum interrupt disabled duration?

(Only have data for baseline and worst case. "Pre-fetch" case was not pursued because the processor being tested does not have an instruction to pre-load cache.)





Increased do\_local\_timer() duration may be caused by increased I-cache miss stall cycles.

Previous results are consistent with the conclusions, but not proof.

Need more new instrumentation...

Add a trace of PMU data to the lite tracer.



The next graphs are scattergrams of cycle count vs I-cache stall cycles





Increased do\_local\_timer() duration may be caused by increased I-cache miss stall cycles.

With this data, I was finally felt there was proof of the theory.

Increased do\_local\_timer() duration may be caused by increased I-cache miss stall cycles.

Further instrumentation and analysis of the PMU trace data results in a more nuanced and complex understanding of the do\_local\_timer() duration graphs, with implications for further improvements.

Unfortunately time constraints preclude including that analysis in this presentation.

## How To Fix?

Increased do\_local\_timer() duration may be caused by increased I-cache miss stall cycles.

The focus of this presentation is finding the causes of the performance problem, not fixing them. But a few possible solutions will be listed.

Then an actual solution will be presented that will lead to more questions about the data.

# How To Fix (1)?

Increased do\_local\_timer() duration may be caused by increased I-cache miss stall cycles.

Reduce number of I-cache misses by increasing I-cache size.

(Exercise for the student: why are L1 caches typically so small?)

# How To Fix (2)?

Increased do\_local\_timer() duration may be caused by increased I-cache miss stall cycles.

Reduce number of I-cache misses by locking critical period code into I-cache.

Not tested because the processor on the target board does not implement cache locking.

# How To Fix (3)?

Increased do\_local\_timer() duration may be caused by increased I-cache miss stall cycles.

Decrease **cost of I-cache miss** by placing the critical section code in high speed memory.

Not tested because the processor on the target board does not implement high speed memory.

# How To Fix (4)?

Increased do\_local\_timer() duration may be caused by increased I-cache miss stall cycles.

Decrease cost of I-cache miss by adding an L2 cache.

## Move to different hardware

The previous tests were conducted on a processor that does not have an L2 cache.

Will try some experiments on a processor with an L2 cache.

The previous graphs are for a 400 Mhz processor.

The following graphs are for a 210 Mhz processor.















#### Result

L2 cache is effective for this scenario.

The L1 I-cache misses are often serviced from the L2 cache, which has a much lower latency than memory.

But there is still a long tail on the latency graph.

## What is causing the tail?

The Well Known Craft may provide some theories, but not a conclusive answer.











## Number of Instructions Tail

instruction count tail outlier =~ 2.9 \* peak

I-cache miss stall tail outlier =~ 1.2 \* peak

This suggests that the instruction tail has a low I-cache miss stall rate.

## Number of Instructions Tail

instruction count tail outlier =~ 2.9 \* peak

I-cache miss stall tail outlier =~ 1.2 \* peak

This suggests that the instruction tail has a low I-cache miss stall rate.

This is consistent with code spinning in a loop.

## Number of Instructions Tail

instruction count tail outlier =~ 2.9 \* peak

I-cache miss stall tail outlier =~ 1.2 \* peak

This suggests that the instruction tail has a low I-cache miss stall rate.

This is consistent with code spinning in a loop.

This is consistent with waiting on a spinlock.



Adding the L2 cache reduces the magnitude of the number of instructions tail, but does not eliminate it.



# The Well Known Craft (mostly)

#### /proc/lock\_stat

| lock_stat version 0.3.2                             |                   |                             |                                                                                   |                                      |                                      |                             |
|-----------------------------------------------------|-------------------|-----------------------------|-----------------------------------------------------------------------------------|--------------------------------------|--------------------------------------|-----------------------------|
| class name                                          | content<br>bounce | ion<br>count                | min                                                                               | - wait tir<br>avg                    | ne<br>max                            | total                       |
| (raw_spinlock_t *)(&rq->lock)<br>  cpu 0<br>  cpu 1 | : 0               | 94<br><mark>35</mark><br>59 | 7.32<br><mark>7.32</mark><br>9.45                                                 | 14.18<br><mark>14.05</mark><br>14.26 | 32.22<br><mark>32.22</mark><br>30.09 | 1333.29<br>491.73<br>841.56 |
| (raw_spinlock_t *)(&rq->lock)                       |                   | 94                          | [ <c016cbbc>] task_rq_lock+0x38/0x78</c016cbbc>                                   |                                      |                                      |                             |
| (raw_spinlock_t *)(&rq->lock)                       |                   | 94                          | [ <c0< td=""><td>16cbbc&gt;] 1</td><td>task_rq_lock</td><td>+0x38/0x78</td></c0<> | 16cbbc>] 1                           | task_rq_lock                         | +0x38/0x78                  |

# The Well Known Craft (mostly)

/proc/lock\_stat

But modified to

- only collect data during do\_local\_timer()
- report wait and hold time histograms



#### Result

The rq->lock wait time tail is consistent with the theory that the number of instructions tail is caused by lock contention. Both tails have a small number of events spread randomly across the tail, with a small number of occurrences of any specific value.

But this is not proof.



#### trace\_irqs\_off\_106\_f-34\_c-0\_1\_S-0 cpu 0



 $trace_irqs_off_106_f-34_c-0_1_S-1$ 

# Where Is The Lock Attempt?

Determined through existing tools

- /proc/lock\_stat
   provided the site of the lock
- ftrace function trace provided the call graph from do\_local\_timer() to the site of the lock

# Where Is The Lock Attempt?

Extracted from the ftrace call graph:

```
run_local_timers()
raise_softirq(TIMER_SOFTIRQ)
wakeup_softirqd()
wake_up_process()
try_to_wake_up()
task_rq_lock()
spin_lock_irqsave(&rq->lock, *flags)
```

# How To Eliminate Lock Wait?

Experiment to determine whether contended rq->lock is the cause of the number of instructions tail.

# How To Eliminate Lock Wait?

Experiment to determine whether contended rq->lock is the cause of the number of instructions tail.

While trying to acquire the rq->lock, on this specific code path, give up after a short number of failed attempts. Thus do not wake up the TIMER\_SOFTIRQ process.

The TIMER\_SOFTIRQ wakeup will be woken up at the next tick, so some timers will occur late.









































#### Result

Reducing lock contention greatly shrank the number of instructions executed tail.

The experimental implementation is not likely to be accepted in mainline, but shows that reducing rq->lock contention can reduce maximum interrupts disabled duration in some scenarios.

The correct solution is likely an enhancement of the scheduler locking algorithms.

## Implications

Does eliminating or reducing the magnitude of the number of instruction tail reduce the maximum interrupts disabled duration?

On some test boards: YES

On other test boards: NO









# Reality Check - Life Is Not Simple

The next slides show that when the same test is measured many times, there are several distinct patterns that can be seen.

This presentation has ignored such complexity and only presented a single pattern of most graphs.

When you measure your own system, expect to have to analyze the messier picture.





Existing kernel instrumentation sometimes provides imprecise or insufficient data for finding the last 10% of the causes of excessive latency.

This may result in guessing about the causes (aka "informed theories").

The data required to confirm or disprove the informed theories can be acquired by extending existing tools or creating new tools.

Improved tools can also provide the insights needed to be able to ask the correct question.

Hardware performance monitor data can be extremely useful.

This presentation provided some examples of making inferences, creating theories, ruling out incorrect theories, collecting data to confirm correct theories, and gaining new insights from enhanced data collection.

I-cache issues can have a negative impact on maximum interrupts disabled duration.

An L2 cache can decrease maximum interrupts disabled duration.

runqueue lock contention can have a negative impact on maximum interrupts disabled duration.

**Older Kernels** 

- "Roll Your Own"
- Out of Tree Tools

perfmon2 http://perfmon2.sourceforge.net/

Newer Kernels

- "Roll Your Own"

Newer Kernels

- perf

\$KERNEL\_SRC/tools/perf

\$KERNEL\_SRC/tools/perf/Documentation/

https://perf.wiki.kernel.org/index.php/Main\_Page

Newer Kernels

- perf

It should be possible to create a PMU trace of do\_local\_timer() using the perf kernel API.

I have not tried this yet.

#### Questions?

#### How to get a copy of the slides

#### 1) leave a business card with me

2) frank.rowand@am.sony.com