Early context tracking patch set: fixing perf & ftrace losing events

Some time ago, while using perf to check the automaton model, I noticed that perf was losing events. The same was reproducible with ftrace.

Steve pointed to a problem in the identification of the context execution used by the recursion control.

Currently, recursion control uses the preempt_counter to identify the current context. The NMI/HARD/SOFT IRQ counters are set in the preempt_counter in the irq_enter/exit functions.

In a trace, they are set like this:

 0)   ==========> |
 0)               |  do_IRQ() {		/* First C function */
 0)               |    irq_enter() {
 0)               |      		/* set the IRQ context. */
 0)   1.081 us    |    }
 0)               |    handle_irq() {
 0)               |     		/* IRQ handling code */
 0) + 10.290 us   |    }
 0)               |    irq_exit() {
 0)               |      		/* unset the IRQ context. */
 0)   6.657 us    |    }
 0) + 18.995 us   |  }
 0)   <========== |

As one can see, functions (and events) that take place before the set and after unset the preempt_counter are identified in the wrong context, causing the miss interpretation that recursion is taking place. When this happens, events are dropped.

To resolve this problem, the set/unset of the IRQ/NMI context needs to be done before the execution of the first C execution, and after its return. By doing so, and using this method to identify the context in the trace recursion protection, no more events are lost.

A possible solution is to use a per-cpu variable set and unset in the entry point of NMI/IRQs, before calling the C handler.

This possible solution is presented in this patch series as a proof of concept, for x86_64. Let’s see what kind of comments we will receive!

New tracepoints for External IRQ and NMIs

The irq_vectors class of tracepoints adds two events for each interrupt: One at the entry and one at the return of each interrupt handler. They are very useful in the analysis of latency, as they point to the occurrence of an interrupt, and how long did it take to run.

However, there are two cases of interrupt which these tracepoints were not present: NMIs and regular device interrupt. So I suggested two new tracepoints, for these cases.

The output of these new tracepoints looks like this:

 idle-0     [000] d.h.   102.890935: external_interrupt_entry: vector=35
 idle-0     [000] d.h.   102.890960: external_interrupt_exit: vector=35
 idle-0     [000] d.Z.   179.594315: nmi_entry: vector=2
 idle-0     [000] d.Z.   179.594396: nmi_exit: vector=2

The discussion of the patch set can be found here:

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1969152.html

V5: x86/jump_label: Bound IPIs sent when updating a static key

Today I’ve sent the v5 of the x86/jump_label: Bound IPIs sent when updating a static key patch set. The list of changes from the previous version is:

  • Renamed jump_label_can_update_check() to jump_label_can_update() (Borislav Petkov)
  • Switch jump_label_can_update() return values to bool (Borislav Petkov)
  • Accept the fact that some lines will be > 80 characters (Borislav Petkov)
  • Local variables are now in the inverted Christmas three order (Borislav Petkov)
  • Removed superfluous helpers. (Borislav Petkov)
  • Renamed text_to_poke to text_patch_loc, and handler to detour (Borislav Petkov & Steven Rostedt)
  • Re-applied the suggestion of not using BUG() from steven (Masami Hiramatsu)
  • arch_jump_label_transform_queue() now returns 0 on success and
    -ENOSPC/EINVAL on errors (Masami Hiramatsu)

I think we are not far from an acceptance. The code introduces a lot of optimizations suggested by Masami, Steven, and Borislav! But it might have one or two rounds of small fixes/polishment.

Here is the link for the submission:

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1968209.html

Another bug found with the model

While running the model against the 4.19-rt kernel, the following non-possible event chain was faced:

+ T   32019    2564.541336    [000]              preempt_disable ->    q8250 
+ T   32019    2564.541336    [000]             local_irq_enable ->   q13544 
+ T   32019    2564.541337    [000]               preempt_enable ->   q19599 
+ T   32019    2564.541337    [000]                   write_lock ->   q21325 
+ T   32019    2564.541337    [000]                   mutex_lock ->   q16217 
+ T   32019    2564.541338    [000]               mutex_acquired ->   q21325 
+ T   32019    2564.541338    [000]    sched_set_state_sleepable ->    q8561 
+ T   32019    2564.541338    [000]            local_irq_disable ->    q3392 
+ T   32019    2564.541338    [000]              preempt_disable ->   q20336 
+ T   32019    2564.541339    [000]     sched_set_state_runnable ->    q9970 
+ T   32019    2564.541339    [000]             local_irq_enable ->   q15342 
+ T   32019    2564.541339    [000]               preempt_enable ->   q21325 
+ T   32019    2564.541339    [000]               write_acquired ->   q19599 
+ T   32019    2564.541340    [000]            local_irq_disable ->   q14441 
+ T   32019    2564.541340    [000]              preempt_disable ->    q8250 
+ T   32019    2564.541342    [000]             local_irq_enable ->   q13544 
+ I   32019    2564.541344    [000]         hw_local_irq_disable ->   q18001 
+ I   32019    2564.541345    [000]                   mutex_lock
        2564.541345 event mutex_lock is not expected in state q18001
===== reseting model =====

In summary, a thread was running, then IRQs are disabled by the hardware (due to an interrupt), and a mutex lock was tried. This operation is not permitted.

The raw trace of these events shows the following events:

bash 32019 [000]  2564.541337:                   lock:rw_sem_request: write 0xffff9e992418fa90F unlink_file_vma+0x2d
bash 32019 [000]  2564.541337:                 lock:rt_mutex_request: 0xffff9e992418fa98F __down_write_common+0x40
bash 32019 [000]  2564.541338:                lock:rt_mutex_acquired: 0xffff9e992418fa98F __down_write_common+0x40
bash 32019 [000]  2564.541338:                 sched:sched_set_state: sleepable
bash 32019 [000]  2564.541338:                preemptirq:irq_disable: caller=__down_write_common+0x8b parent=(nil)F
bash 32019 [000]  2564.541338:            preemptirq:preempt_disable: caller=__down_write_common+0x95 parent=__down_write_common+0x95
bash 32019 [000]  2564.541339:                 sched:sched_set_state: runnable
bash 32019 [000]  2564.541339:                 preemptirq:irq_enable: caller=__down_write_common+0x337 parent=(nil)F
bash 32019 [000]  2564.541339:             preemptirq:preempt_enable: caller=__down_write_common+0x34b parent=__down_write_common+0x34b
bash 32019 [000]  2564.541339:                  lock:rw_sem_acquired: write 0xffff9e992418fa90F unlink_file_vma+0x2d
bash 32019 [000]  2564.541340:                   lock:rw_sem_release: write 0xffff9e992418fa90F free_pgtables+0xbe
bash 32019 [000]  2564.541340:                preemptirq:irq_disable: caller=__up_write+0x2c parent=(nil)F
bash 32019 [000]  2564.541340:            preemptirq:preempt_disable: caller=__up_write+0x36 parent=__up_write+0x36
bash 32019 [000]  2564.541342:                 preemptirq:irq_enable: caller=__up_write_unlock+0x75 parent=(nil)F
bash 32019 [000]  2564.541344:                preemptirq:irq_disable: caller=trace_hardirqs_off_thunk+0x1a parent=interrupt_entry+0xda
bash 32019 [000]  2564.541344:         irq_vectors:local_timer_entry: vector=236
bash 32019 [000]  2564.541345:                 lock:rt_mutex_request: pendingb_lock+0x0 queue_work_on+0x41

And voilà! a real-time mutex is being taken in the timer IRQ.

This violates two specifications, that specifies that it is not possible to take mutexes in with the interrupts disabled and that IRQ handlers do not take mutexes, as in Figure bellow:


Mutex cannot be taken with interrupts disabled.

Operations that are disabled in the interrupt handlers.


It is possible then to observe the code path that takes the rt_spin_lock (a mutex, in the practice):

 0)   ==========> |
 0)               |  smp_apic_timer_interrupt() {
 0)               |    __irq_enter_early_tp() {
 0)               |      rcu_irq_enter() {
 0)   0.690 us    |        rcu_nmi_enter();
 0)   2.032 us    |      }
 0)   3.361 us    |    }
 0)               |    irq_enter() {
 0)               |      rcu_irq_enter() {
 0)   0.635 us    |        rcu_nmi_enter();
 0)   1.792 us    |      }
 0)               |      tick_irq_enter() {
 0)   0.792 us    |        tick_check_oneshot_broadcast_this_cpu();
 0)   0.897 us    |        ktime_get();
 0)               |        update_ts_time_stats.constprop.32() {
 0)   0.594 us    |          nr_iowait_cpu();
 0)   1.926 us    |        }
 0)               |        tick_do_update_jiffies64.part.17() {
 0)   0.668 us    |          preempt_count_add();
 0)   0.594 us    |          preempt_count_add();
 0)               |          do_timer() {
 0)   0.676 us    |            calc_global_load();
 0)   1.808 us    |          }
 0)   0.638 us    |          preempt_count_sub();
 0)   0.551 us    |          preempt_count_sub();
 0)               |          update_wall_time() {
 0)               |            timekeeping_advance() {
 0)   0.567 us    |              preempt_count_add();
 0)   0.554 us    |              ntp_tick_length();
 0)   0.552 us    |              ntp_tick_length();
 0)   0.560 us    |              preempt_count_add();
 0)               |              timekeeping_update() {
 0)   0.552 us    |                ntp_get_next_leap();
 0)   0.769 us    |                update_vsyscall();
 0)               |                raw_notifier_call_chain() {
 0)   0.554 us    |                  notifier_call_chain();
 0)   1.621 us    |                }
 0)   0.565 us    |                update_fast_timekeeper();
 0)   0.576 us    |                update_fast_timekeeper();
 0)   7.533 us    |              }
 0)   0.614 us    |              preempt_count_sub();
 0)   0.608 us    |              preempt_count_sub();
 0) + 16.109 us   |            }
 0) + 17.268 us   |          }
 0) + 25.610 us   |        }
 0) + 32.431 us   |      }
 0)               |      irqtime_account_irq() {
 0)   0.547 us    |        in_serving_softirq();
 0)   2.045 us    |      }
 0)   0.561 us    |      preempt_count_add();
 0) + 39.845 us   |    }
 0)               |    hrtimer_interrupt() {
 0)   0.626 us    |      preempt_count_add();
 0)   0.786 us    |      ktime_get_update_offsets_now();
 0)               |      __hrtimer_run_queues.constprop.28() {
 0)   1.021 us    |        __remove_hrtimer();
 0)   0.574 us    |        preempt_count_sub();
 0)               |        watchdog_timer_fn() {
 0)               |          completion_done() {
 0)   0.541 us    |            preempt_count_add();
 0)   0.554 us    |            preempt_count_sub();
 0)   2.946 us    |          }
 0)               |          queue_work_on() {
 0)   0.597 us    |            migrate_disable();
 0)               |            rt_spin_lock() {
 0)   0.554 us    |              migrate_disable();
 0)               |              /* rt_mutex_request: pendingb_lock+0x0/0x40 return_to_handler+0x0/0x36 */
 0)               |              /* rt_mutex_acquired: pendingb_lock+0x0/0x40 return_to_handler+0x0/0x36 */
 0)   3.350 us    |            }
 0)               |            __queue_work() {
 0)   0.542 us    |              __rcu_read_lock();
 0)   0.593 us    |              get_work_pool();
 0)               |              rt_spin_lock() {
 0)   0.562 us    |                migrate_disable();
 0)               |                /* rt_mutex_request: cpu_worker_pools+0x0/0x680 return_to_handler+0x0/0x36 */
 0)               |                /* rt_mutex_acquired: cpu_worker_pools+0x0/0x680 return_to_handler+0x0/0x36 */
 0)   2.446 us    |              }
 0)               |              insert_work() {
 0)   0.563 us    |                get_pwq.isra.32();
 0)               |                wake_up_worker() {
 0)   0.547 us    |                  preempt_count_add();
 0)               |                  wake_up_process() {
 0)               |                    try_to_wake_up() {
 0)   0.560 us    |                      preempt_count_add();
 0)   0.558 us    |                      preempt_count_add();
 0)               |                      update_rq_clock() {
 0)   0.590 us    |                        kvm_steal_clock();
 0)   1.820 us    |                      }
 0)               |                      activate_task() {
 0)               |                        enqueue_task_fair() {
 0)               |                          enqueue_entity() {
 0)   0.580 us    |                            update_curr();
 0)   0.626 us    |                            update_cfs_group();
 0)   0.587 us    |                            __enqueue_entity();
 0)   4.352 us    |                          }
 0)   0.553 us    |                          hrtick_update();
 0)   6.598 us    |                        }
 0)   7.895 us    |                      }
 0)               |                      check_preempt_curr() {
 0)   0.629 us    |                        resched_curr();
 0)   2.035 us    |                      }
 0)   0.555 us    |                      preempt_count_sub();
 0)   0.548 us    |                      preempt_count_sub();
 0) + 19.316 us   |                    }
 0) + 20.400 us   |                  }
 0)   0.567 us    |                  preempt_count_sub();
 0) + 23.781 us   |                }
 0) + 26.107 us   |              }
 0)               |              rt_spin_unlock() {
 0)               |                /* rt_mutex_release: cpu_worker_pools+0x0/0x680 return_to_handler+0x0/0x36 */
 0)   0.580 us    |                migrate_enable();
 0)   2.343 us    |              }
 0)   0.576 us    |              __rcu_read_unlock();
 0) + 36.708 us   |            }
 0)               |            queue_delayed_work_on.part.49() {
 0)               |              rt_spin_unlock() {
 0)               |                /* rt_mutex_release: pendingb_lock+0x0/0x40 return_to_handler+0x0/0x36 */
 0)   0.547 us    |                migrate_enable();
 0)   2.034 us    |              }
 0)   0.968 us    |              migrate_enable();
 0)   4.576 us    |            }
 0) + 48.171 us   |          }
 0)   0.704 us    |          ktime_get();
 0)   0.552 us    |          hrtimer_forward();
 0)   0.622 us    |          __touch_watchdog();
 0) + 57.152 us   |        }
 0)   0.543 us    |        preempt_count_add();
 0)   0.771 us    |        enqueue_hrtimer();
 0) + 63.902 us   |      }
 0)               |      __hrtimer_get_next_event() {
 0)   1.038 us    |        __hrtimer_next_event_base();
 0)   0.673 us    |        __hrtimer_next_event_base();
 0)   3.437 us    |      }
 0)   0.558 us    |      preempt_count_sub();
 0)               |      tick_program_event() {
 0)               |        clockevents_program_event() {
 0)   0.658 us    |          ktime_get();
 0)   8.836 us    |          lapic_next_deadline();
 0) + 11.393 us   |        }
 0) + 12.489 us   |      }
 0) + 86.907 us   |    }
 0)               |    irq_exit() {
 0)   0.722 us    |      irqtime_account_irq();
 0)   0.602 us    |      preempt_count_sub();
 0)   0.572 us    |      idle_cpu();
 0)               |      rcu_irq_exit() {
 0)   0.588 us    |        rcu_nmi_exit();
 0)   1.648 us    |      }
 0)   6.500 us    |    }
 0)               |    __irq_exit_late_tp() {
 0)               |      rcu_irq_exit() {
 0)   0.572 us    |        rcu_nmi_exit();
 0)   1.624 us    |      }
 0)   2.690 us    |    }
 0) ! 146.366 us  |  }
 0)   <========== |

In the single core, the following chain of events can take place:

smp_apic_timer_interrupt(){
	hrtimer_interrupt() {
		__hrtimer_run_queues() {
			watchdog_timer_fn() {
				stop_one_cpu_nowait() {
					#ifdef !CONFIG_SMP
						schedule_work() {
							queue_work() {
								queue_work_on() {
								/* phew, long, ah!?.... */	
									local_lock_irqsave() {
										__local_lock_irqsave() {
											__local_lock_irq() {
												spin_lock_irqsave() {
													rt_spin_lock()...
														/* from here on you already know...*/
														/* a lot of } */

And this can cause a schedling while in atomic. Tracking down, the commit that introduced the stop_one_cpu_nowait() to the path was:

commit 9cf57731b63e37ed995b46690adc604891a9a28f
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Thu Jun 7 10:52:03 2018 +0200

    watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work
    
    Oleg suggested to replace the "watchdog/%u" threads with
    cpu_stop_work. That removes one thread per CPU while at the same time
    fixes softlockup vs SCHED_DEADLINE.
    
    But more importantly, it does away with the single
    smpboot_update_cpumask_percpu_thread() user, which allows
    cleanups/shrinkage of the smpboot interface.
    
    Suggested-by: Oleg Nesterov <oleg@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
</mingo@kernel.org></tglx@linutronix.de></peterz@infradead.org></torvalds@linux-foundation.org></peterz@infradead.org></oleg@redhat.com></peterz@infradead.org>

later modified by:

commit be45bf5395e0886a93fc816bbe41a008ec2e42e2
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Jul 13 12:42:08 2018 +0200

    watchdog/softlockup: Fix cpu_stop_queue_work() double-queue bug
    
    When scheduling is delayed for longer than the softlockup interrupt
    period it is possible to double-queue the cpu_stop_work, causing list
    corruption.
    
    Cure this by adding a completion to track the cpu_stop_work's
    progress.
    
    Reported-by: kernel test robot <lkp@intel.com>
    Tested-by: Rong Chen <rong.a.chen@intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Fixes: 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
    Link: http://lkml.kernel.org/r/20180713104208.GW2494@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
</mingo@kernel.org></tglx@linutronix.de></peterz@infradead.org></torvalds@linux-foundation.org></peterz@infradead.org></rong.a.chen@intel.com></lkp@intel.com></peterz@infradead.org>

The BUG report was made, and you can find it here:
https://www.spinics.net/lists/linux-rt-users/msg20376.html

ISORC 2019 Material

Here is the companion material for the paper: Untangling the Intricacies of Thread Synchronization in the PREEMPT_RT Linux Kernel by Daniel Bristot de Oliveira, Rômulo Silva de Oliveira, and Tommaso Cucinotta, to be presented at the ISORC 2019.

NOTE: The latest version of the model and patches can be found here: http://bristot.me/linux-task-model/

The model was developed using the Supremica tool. You can find more information about Supremica here.

The source code of the model can be found here. It can be opened with Supremica. From Supremica it is possible to visualize, edit, analyze, simulate and export the model.

In the Linux part, you can find the patch with perf task_model and tracepoints here. It applies in the version v4.14.15-rt13 of the development tree of the PREEMPT RT.

It is possible to use a virtual machine to run the experiments. Here is the definition of a QEMU/KVM virtual machine and the kernel config file that works with this VM

Linux Plumbers 2018 report

It was a success!

This year I ended up making five talks! All related to real-time Linux, as expected. As feedback, I think that the most important are:

– The idea of using automata model for explaining Linux’s dynamics and, mainly, to verify the behavior of the system was very welcome by the community. They want even more: To extend the idea to other subsystems of the kernel.

– The semi-partitioned scheduler is a wish for Linux!

– We need new metrics for the RT, and we need to integrate this with probabilistic execution time analysis. I was already planning to do that, but now I am even more motivated!

Deliver at least part of these ideas is my main motivation for the next year, and I hope to talk about them at the Linux Plumbers 2019!

Here are the slides of my presentations:

Mind the gap between real-time Linux and real-time theory

SCHED_DEADLINE desiderata and slightly crazy ideas

How can we catch problems that can break the PREEMPT_RT preemption model?

Beyond the latency: New metrics for the real-time kernel

The talks were recorded, I will post links here as soon as they post it!

Real-time Linux Summit 2018 – My slides

Hi!

Here are the slides of my presentation in this year’s Real-time Linux Summit.

The title is “Mind the gap: between real-time Linux and real-time theory (Part I).” It is about the creation of a formal model for tasks on Linux, in the intersection of the viewpoint of real-time Linux developers and real-time researchers.

I received excellent feedback from the Linux community. There is interest in applying my method for the verification of the PREEMPT-RT kernel, even for areas that I was not expecting, like static code analysis.

For a computer scientist, seeing practical results of applying a formal method (automata) to a real and complex system is the proof that those long hours studying theory are not in vain :-).

Modeling the Behavior of Threads in the PREEMPT_RT Linux Kernel Using Automata: All material

From the communication:

“On behalf of the Program Committee for EWiLi 2018, I am pleased to inform you that the following submission has been accepted for publication in the workshop: Modeling the Behavior of Threads in the PREEMPT_RT Linux Kernel Using Automata”

The best thing is that all the revisor’s comments are constructive. All three revisors gave an “accepted” on the paper (a paper can be: accepted, weakly accepted, borderline, weakly rejected, rejected)!

You can find the paper in this link: Paper

The presentation in this link: Slides

The model in this link: Model

The code is in this link: Code

All open-souce!

An Empirical Study on the Adequacy of MBPTA for Tasks Executed on a Complex Computer Architecture with Linux

The paper An Empirical Study on the Adequacy of MBPTA for Tasks Executed on a Complex Computer Architecture with Linux, which I am co-author, was awared with traveling funding by  the IEEE IES S&YP!

Karila and Luis Fernando made a video about the paper, check this out: