2
2
* File descriptors management functions.
4
* Copyright 2000-2008 Willy Tarreau <w@1wt.eu>
4
* Copyright 2000-2014 Willy Tarreau <w@1wt.eu>
6
6
* This program is free software; you can redistribute it and/or
7
7
* modify it under the terms of the GNU General Public License
8
8
* as published by the Free Software Foundation; either version
9
9
* 2 of the License, or (at your option) any later version.
11
* This code implements an events cache for file descriptors. It remembers the
12
* readiness of a file descriptor after a return from poll() and the fact that
13
* an I/O attempt failed on EAGAIN. Events in the cache which are still marked
14
* ready and active are processed just as if they were reported by poll().
16
* This serves multiple purposes. First, it significantly improves performance
17
* by avoiding to subscribe to polling unless absolutely necessary, so most
18
* events are processed without polling at all, especially send() which
19
* benefits from the socket buffers. Second, it is the only way to support
20
* edge-triggered pollers (eg: EPOLL_ET). And third, it enables I/O operations
21
* that are backed by invisible buffers. For example, SSL is able to read a
22
* whole socket buffer and not deliver it to the application buffer because
23
* it's full. Unfortunately, it won't be reported by a poller anymore until
24
* some new activity happens. The only way to call it again thus is to keep
25
* this readiness information in the cache and to access it without polling
26
* once the FD is enabled again.
28
* One interesting feature of the cache is that it maintains the principle
29
* of speculative I/O introduced in haproxy 1.3 : the first time an event is
30
* enabled, the FD is considered as ready so that the I/O attempt is performed
31
* via the cache without polling. And the polling happens only when EAGAIN is
32
* first met. This avoids polling for HTTP requests, especially when the
33
* defer-accept mode is used. It also avoids polling for sending short data
34
* such as requests to servers or short responses to clients.
36
* The cache consists in a list of active events and a list of updates.
37
* Active events are events that are expected to come and that we must report
38
* to the application until it asks to stop or asks to poll. Updates are new
39
* requests for changing an FD state. Updates are the only way to create new
40
* events. This is important because it means that the number of cached events
41
* cannot increase between updates and will only grow one at a time while
42
* processing updates. All updates must always be processed, though events
43
* might be processed by small batches if required.
45
* There is no direct link between the FD and the updates list. There is only a
46
* bit in the fdtab[] to indicate than a file descriptor is already present in
47
* the updates list. Once an fd is present in the updates list, it will have to
48
* be considered even if its changes are reverted in the middle or if the fd is
51
* It is important to understand that as long as all expected events are
52
* processed, they might starve the polled events, especially because polled
53
* I/O starvation quickly induces more cached I/O. One solution to this
54
* consists in only processing a part of the events at once, but one drawback
55
* is that unhandled events will still wake the poller up. Using an edge-
56
* triggered poller such as EPOLL_ET will solve this issue though.
58
* Since we do not want to scan all the FD list to find cached I/O events,
59
* we store them in a list consisting in a linear array holding only the FD
60
* indexes right now. Note that a closed FD cannot exist in the cache, because
61
* it is closed by fd_delete() which in turn calls fd_release_cache_entry()
62
* which always removes it from the list.
64
* The FD array has to hold a back reference to the cache. This reference is
65
* always valid unless the FD is not in the cache and is not updated, in which
66
* case the reference points to index 0.
68
* The event state for an FD, as found in fdtab[].state, is maintained for each
69
* direction. The state field is built this way, with R bits in the low nibble
70
* and W bits in the high nibble for ease of access and debugging :
73
* [ 0 | PW | RW | AW | 0 | PR | RR | AR ]
75
* A* = active *R = read
76
* P* = polled *W = write
79
* An FD is marked "active" when there is a desire to use it.
80
* An FD is marked "polled" when it is registered in the polling.
81
* An FD is marked "ready" when it has not faced a new EAGAIN since last wake-up
82
* (it is a cache of the last EAGAIN regardless of polling changes).
84
* We have 8 possible states for each direction based on these 3 flags :
86
* +---+---+---+----------+---------------------------------------------+
87
* | P | R | A | State | Description |
88
* +---+---+---+----------+---------------------------------------------+
89
* | 0 | 0 | 0 | DISABLED | No activity desired, not ready. |
90
* | 0 | 0 | 1 | MUSTPOLL | Activity desired via polling. |
91
* | 0 | 1 | 0 | STOPPED | End of activity without polling. |
92
* | 0 | 1 | 1 | ACTIVE | Activity desired without polling. |
93
* | 1 | 0 | 0 | ABORT | Aborted poll(). Not frequently seen. |
94
* | 1 | 0 | 1 | POLLED | FD is being polled. |
95
* | 1 | 1 | 0 | PAUSED | FD was paused while ready (eg: buffer full) |
96
* | 1 | 1 | 1 | READY | FD was marked ready by poll() |
97
* +---+---+---+----------+---------------------------------------------+
99
* The transitions are pretty simple :
100
* - fd_want_*() : set flag A
101
* - fd_stop_*() : clear flag A
102
* - fd_cant_*() : clear flag R (when facing EAGAIN)
103
* - fd_may_*() : set flag R (upon return from poll())
104
* - sync() : if (A) { if (!R) P := 1 } else { P := 0 }
106
* The PAUSED, ABORT and MUSTPOLL states are transient for level-trigerred
107
* pollers and are fixed by the sync() which happens at the beginning of the
108
* poller. For event-triggered pollers, only the MUSTPOLL state will be
109
* transient and ABORT will lead to PAUSED. The ACTIVE state is the only stable
110
* one which has P != A.
112
* The READY state is a bit special as activity on the FD might be notified
113
* both by the poller or by the cache. But it is needed for some multi-layer
114
* protocols (eg: SSL) where connection activity is not 100% linked to FD
115
* activity. Also some pollers might prefer to implement it as ACTIVE if
116
* enabling/disabling the FD is cheap. The READY and ACTIVE states are the
117
* two states for which a cache entry is allocated.
119
* The state transitions look like the diagram below. Only the 4 right states
120
* have polling enabled :
122
* (POLLED=0) (POLLED=1)
124
* +----------+ sync +-------+
125
* | DISABLED | <----- | ABORT | (READY=0, ACTIVE=0)
126
* +----------+ +-------+
130
* +----------+ sync +--------+
131
* | MUSTPOLL | -----> | POLLED | (READY=0, ACTIVE=1)
132
* +----------+ +--------+
135
* | EAGAIN v | EAGAIN
136
* +--------+ +-------+
137
* | ACTIVE | | READY | (READY=1, ACTIVE=1)
138
* +--------+ +-------+
142
* +---------+ sync +--------+
143
* | STOPPED | <------ | PAUSED | (READY=1, ACTIVE=0)
144
* +---------+ +--------+
13
147
#include <stdio.h>
31
167
struct poller cur_poller;
32
168
int nbpollers = 0;
170
unsigned int *fd_cache = NULL; // FD events cache
171
unsigned int *fd_updt = NULL; // FD updates list
172
int fd_cache_num = 0; // number of events in the cache
173
int fd_nbupdt = 0; // number of updates in the list
35
175
/* Deletes an FD from the fdsets, and recomputes the maxfd limit.
36
176
* The file descriptor is also closed.
38
178
void fd_delete(int fd)
180
if (fdtab[fd].linger_risk) {
181
/* this is generally set when connecting to servers */
182
setsockopt(fd, SOL_SOCKET, SO_LINGER,
183
(struct linger *) &nolinger, sizeof(struct linger));
188
fd_release_cache_entry(fd);
41
191
port_range_release_port(fdinfo[fd].port_range, fdinfo[fd].local_port);
42
192
fdinfo[fd].port_range = NULL;
44
fdtab[fd].state = FD_STCLOSE;
194
fdtab[fd].owner = NULL;
46
while ((maxfd-1 >= 0) && (fdtab[maxfd-1].state == FD_STCLOSE))
197
while ((maxfd-1 >= 0) && !fdtab[maxfd-1].owner)
201
/* Scan and process the cached events. This should be called right after
204
void fd_process_cached_events()
208
for (entry = 0; entry < fd_cache_num; ) {
209
fd = fd_cache[entry];
212
/* Principle: events which are marked FD_EV_ACTIVE are processed
213
* with their usual I/O callback. The callback may remove the
214
* events from the cache or tag them for polling. Changes will be
215
* applied on next round. Cache entries with no more activity are
216
* automatically scheduled for removal.
218
fdtab[fd].ev &= FD_POLL_STICKY;
220
if ((e & (FD_EV_READY_R | FD_EV_ACTIVE_R)) == (FD_EV_READY_R | FD_EV_ACTIVE_R))
221
fdtab[fd].ev |= FD_POLL_IN;
223
if ((e & (FD_EV_READY_W | FD_EV_ACTIVE_W)) == (FD_EV_READY_W | FD_EV_ACTIVE_W))
224
fdtab[fd].ev |= FD_POLL_OUT;
226
if (fdtab[fd].iocb && fdtab[fd].owner && fdtab[fd].ev)
231
/* If the fd was removed from the cache, it has been
232
* replaced by the next one that we don't want to skip !
234
if (entry < fd_cache_num && fd_cache[entry] != fd)
240
/* Check the events attached to a file descriptor, update its cache
241
* accordingly, and call the associated I/O callback. If new updates are
242
* detected, the function tries to process them as well in order to save
243
* wakeups after accept().
245
void fd_process_polled_events(int fd)
247
int new_updt, old_updt;
249
/* First thing to do is to mark the reported events as ready, in order
250
* for them to later be continued from the cache without polling if
251
* they have to be interrupted (eg: recv fills a buffer).
253
if (fdtab[fd].ev & (FD_POLL_IN | FD_POLL_HUP | FD_POLL_ERR))
256
if (fdtab[fd].ev & (FD_POLL_OUT | FD_POLL_ERR))
259
if (fdtab[fd].cache) {
260
/* This fd is already cached, no need to process it now. */
264
if (unlikely(!fdtab[fd].iocb || !fdtab[fd].ev)) {
269
/* Save number of updates to detect creation of new FDs. */
270
old_updt = fd_nbupdt;
273
/* One or more fd might have been created during the iocb().
274
* This mainly happens with new incoming connections that have
275
* just been accepted, so we'd like to process them immediately
276
* for better efficiency, as it saves one useless task wakeup.
277
* Second benefit, if at the end the fds are disabled again, we can
278
* safely destroy their update entry to reduce the scope of later
279
* scans. This is the reason we scan the new entries backwards.
281
for (new_updt = fd_nbupdt; new_updt > old_updt; new_updt--) {
282
fd = fd_updt[new_updt - 1];
287
fdtab[fd].ev &= FD_POLL_STICKY;
289
if ((fdtab[fd].state & FD_EV_STATUS_R) == (FD_EV_READY_R | FD_EV_ACTIVE_R))
290
fdtab[fd].ev |= FD_POLL_IN;
292
if ((fdtab[fd].state & FD_EV_STATUS_W) == (FD_EV_READY_W | FD_EV_ACTIVE_W))
293
fdtab[fd].ev |= FD_POLL_OUT;
295
if (fdtab[fd].ev && fdtab[fd].iocb && fdtab[fd].owner)
298
/* we can remove this update entry if it's the last one and is
299
* unused, otherwise we don't touch anything, especially given
300
* that the FD might have been closed already.
302
if (new_updt == fd_nbupdt && !fd_recv_active(fd) && !fd_send_active(fd)) {
303
fdtab[fd].updated = 0;
51
309
/* disable the specified poller */
52
310
void disable_poller(const char *poller_name)