]>
Commit | Line | Data |
---|---|---|
db4df481 | 1 | ================ |
b30505c8 | 2 | Futex Requeue PI |
db4df481 | 3 | ================ |
b30505c8 DH |
4 | |
5 | Requeueing of tasks from a non-PI futex to a PI futex requires | |
6 | special handling in order to ensure the underlying rt_mutex is never | |
7 | left without an owner if it has waiters; doing so would break the PI | |
8 | boosting logic [see rt-mutex-desgin.txt] For the purposes of | |
9 | brevity, this action will be referred to as "requeue_pi" throughout | |
10 | this document. Priority inheritance is abbreviated throughout as | |
11 | "PI". | |
12 | ||
13 | Motivation | |
14 | ---------- | |
15 | ||
16 | Without requeue_pi, the glibc implementation of | |
17 | pthread_cond_broadcast() must resort to waking all the tasks waiting | |
18 | on a pthread_condvar and letting them try to sort out which task | |
19 | gets to run first in classic thundering-herd formation. An ideal | |
20 | implementation would wake the highest-priority waiter, and leave the | |
21 | rest to the natural wakeup inherent in unlocking the mutex | |
22 | associated with the condvar. | |
23 | ||
db4df481 | 24 | Consider the simplified glibc calls:: |
b30505c8 | 25 | |
db4df481 MCC |
26 | /* caller must lock mutex */ |
27 | pthread_cond_wait(cond, mutex) | |
28 | { | |
29 | lock(cond->__data.__lock); | |
30 | unlock(mutex); | |
31 | do { | |
32 | unlock(cond->__data.__lock); | |
33 | futex_wait(cond->__data.__futex); | |
34 | lock(cond->__data.__lock); | |
35 | } while(...) | |
36 | unlock(cond->__data.__lock); | |
37 | lock(mutex); | |
38 | } | |
b30505c8 | 39 | |
db4df481 MCC |
40 | pthread_cond_broadcast(cond) |
41 | { | |
42 | lock(cond->__data.__lock); | |
43 | unlock(cond->__data.__lock); | |
44 | futex_requeue(cond->data.__futex, cond->mutex); | |
45 | } | |
b30505c8 DH |
46 | |
47 | Once pthread_cond_broadcast() requeues the tasks, the cond->mutex | |
48 | has waiters. Note that pthread_cond_wait() attempts to lock the | |
49 | mutex only after it has returned to user space. This will leave the | |
50 | underlying rt_mutex with waiters, and no owner, breaking the | |
51 | previously mentioned PI-boosting algorithms. | |
52 | ||
53 | In order to support PI-aware pthread_condvar's, the kernel needs to | |
54 | be able to requeue tasks to PI futexes. This support implies that | |
55 | upon a successful futex_wait system call, the caller would return to | |
56 | user space already holding the PI futex. The glibc implementation | |
db4df481 | 57 | would be modified as follows:: |
b30505c8 DH |
58 | |
59 | ||
db4df481 MCC |
60 | /* caller must lock mutex */ |
61 | pthread_cond_wait_pi(cond, mutex) | |
62 | { | |
63 | lock(cond->__data.__lock); | |
64 | unlock(mutex); | |
65 | do { | |
66 | unlock(cond->__data.__lock); | |
67 | futex_wait_requeue_pi(cond->__data.__futex); | |
68 | lock(cond->__data.__lock); | |
69 | } while(...) | |
70 | unlock(cond->__data.__lock); | |
71 | /* the kernel acquired the mutex for us */ | |
72 | } | |
b30505c8 | 73 | |
db4df481 MCC |
74 | pthread_cond_broadcast_pi(cond) |
75 | { | |
76 | lock(cond->__data.__lock); | |
77 | unlock(cond->__data.__lock); | |
78 | futex_requeue_pi(cond->data.__futex, cond->mutex); | |
79 | } | |
b30505c8 DH |
80 | |
81 | The actual glibc implementation will likely test for PI and make the | |
82 | necessary changes inside the existing calls rather than creating new | |
83 | calls for the PI cases. Similar changes are needed for | |
84 | pthread_cond_timedwait() and pthread_cond_signal(). | |
85 | ||
86 | Implementation | |
87 | -------------- | |
88 | ||
89 | In order to ensure the rt_mutex has an owner if it has waiters, it | |
90 | is necessary for both the requeue code, as well as the waiting code, | |
91 | to be able to acquire the rt_mutex before returning to user space. | |
92 | The requeue code cannot simply wake the waiter and leave it to | |
93 | acquire the rt_mutex as it would open a race window between the | |
94 | requeue call returning to user space and the waiter waking and | |
95 | starting to run. This is especially true in the uncontended case. | |
96 | ||
97 | The solution involves two new rt_mutex helper routines, | |
98 | rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which | |
99 | allow the requeue code to acquire an uncontended rt_mutex on behalf | |
100 | of the waiter and to enqueue the waiter on a contended rt_mutex. | |
101 | Two new system calls provide the kernel<->user interface to | |
40a35503 | 102 | requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI. |
b30505c8 DH |
103 | |
104 | FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() | |
105 | and pthread_cond_timedwait()) to block on the initial futex and wait | |
106 | to be requeued to a PI-aware futex. The implementation is the | |
107 | result of a high-speed collision between futex_wait() and | |
108 | futex_lock_pi(), with some extra logic to check for the additional | |
109 | wake-up scenarios. | |
110 | ||
40a35503 | 111 | FUTEX_CMP_REQUEUE_PI is called by the waker |
b30505c8 DH |
112 | (pthread_cond_broadcast() and pthread_cond_signal()) to requeue and |
113 | possibly wake the waiting tasks. Internally, this system call is | |
114 | still handled by futex_requeue (by passing requeue_pi=1). Before | |
115 | requeueing, futex_requeue() attempts to acquire the requeue target | |
116 | PI futex on behalf of the top waiter. If it can, this waiter is | |
117 | woken. futex_requeue() then proceeds to requeue the remaining | |
118 | nr_wake+nr_requeue tasks to the PI futex, calling | |
119 | rt_mutex_start_proxy_lock() prior to each requeue to prepare the | |
120 | task as a waiter on the underlying rt_mutex. It is possible that | |
121 | the lock can be acquired at this stage as well, if so, the next | |
122 | waiter is woken to finish the acquisition of the lock. | |
123 | ||
40a35503 | 124 | FUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but |
b30505c8 DH |
125 | their sum is all that really matters. futex_requeue() will wake or |
126 | requeue up to nr_wake + nr_requeue tasks. It will wake only as many | |
127 | tasks as it can acquire the lock for, which in the majority of cases | |
128 | should be 0 as good programming practice dictates that the caller of | |
129 | either pthread_cond_broadcast() or pthread_cond_signal() acquire the | |
40a35503 | 130 | mutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that |
b30505c8 DH |
131 | nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for |
132 | signal. |