3
ORNL - tipparajuv@ornl.gov
7
/* --------------------------------------------------------------------------- *\
8
PORTALS_USE_RENDEZ_VOUS
9
=======================
10
When the number of PEs gets very large, the data server is required to have
11
buffer space available for all possible incoming messages which is defined
12
by PORTALS_MAX_DESCRIPTORS = (MAX_BUFS+MAX_SMALL_BUFS).
13
For each PE, the DS must have at least:
14
min_memory_per_pe = PORTALS_MAX_BUFS*PORTALS_BUF_SIZE +
15
PORTALS_MAX_SMALL_BUFS*PORTALS_SMALL_BUF_SIZE
16
This becomes a memory constraint at large core count.
17
Rendez-vous message is one mechanism to get around requiring the DS to
18
have buffer space for all messages. When rendez-vous (RZV) messaging is
19
enabled, the messages what use the large buffers no longer send the entire
20
buffer "eagerly". Instead, only the data request (request_header_t) gets
21
sent to the data server. When the data server is ready to handle the
22
request, it "pulls" the entire buffer over via a portals_get operation.
23
One can immediately see that this can lead to a slow down in performance,
24
since the data server is idle when it has to pull the data over. This is
25
the price paid when you remove the bufferign for those messsages. Ideally,
26
when the DS is pulling the message, it could be processing another request.
27
This double buffering technique needs to be programmed in. Care must be
28
taken to ensure proper ARMCI behavior. The next request handled can not be
29
from the same PE, nor can it be a FENCE operation ... all other (?)
30
requests/operations can be double buffered.
31
\* --------------------------------------------------------------------------- */
32
# define PORTALS_USE_RENDEZ_VOUS
36
/* --------------------------------------------------------------------------- *\
37
PORTALS_LIMIT_REMOTE_REQUESTS_BY_NODE
38
=====================================
39
Another means to reduce the required buffer needed by the data server is
40
to limit the number of cores that can talk to the data server at any given
41
moment. When this options is turned on, only 1 request per node is allowed
42
to be in the buffer of any given data server. On a 10 core node, the size
43
of the buffer required by the data server is reduced by more than an order
44
of magnitude. You get more than an order of magnitude, because you don't
45
need to reserve space for any of the small buffers, since you can only have
46
one small or one large from any given node in the ds buffer at any one time.
47
Another major benefit is you can increase MAX_BUFS and MAX_SMALL_BUFS to
48
increase concurrency without affecting the DS's buffer size.
50
Can be used with PORTALS_USE_RENDEZ_VOUS.
52
notes: every request needs to respond with an ack, even gets.
53
acks actually send data when we limit remote request ... the ack
54
response is needed to trigger that the outstanding request has
55
been finished by the data server ... the ack zeros out the index
56
in the active_requests_by_node array.
57
\* --------------------------------------------------------------------------- */
58
# define PORTALS_LIMIT_REMOTE_REQUESTS_BY_NODE_TURNED_OFF
61
/* --------------------------------------------------------------------------- *\
64
When initializing compute processes and data servers, the affinity passed
65
in by aprun/alps is ignored.
67
Compute processes are bound strictly to a particular core. Cores are
68
evenly divided between sockets keeping the last core (mask = 1 << (ncpus-1))
69
free for the data server.
71
If the node is not fully subscribed, then the data server is bound to the
72
last core on the node (mask = 1 << (ncpus-1)); otherwise, the data server
73
is "free floating" (mask = (1 << ncpus)-1) on a fully subscribed node.
74
\* --------------------------------------------------------------------------- */
75
# define PORTALS_AFFINITY
76
# define PORTALS_AFFINITY_NSOCKETS 2
79
/* --------------------------------------------------------------------------- *\
82
Used MDMD copy instead of PtlGetRegion for on-node "local" transfers
83
\* --------------------------------------------------------------------------- */
84
# define CRAY_USE_MDMD_COPY
88
/* --------------------------------------------------------------------------- *\
89
ORNL_USE_DS_FOR_REMOTE_GETS
90
===========================
91
Vinod informed us of a modification that can be made to enable the use of
92
the data server for remote gets. Without this option, direct gets are
93
used. This can cause severe network congestion, because many armci_gets
94
are not stride 1. The data server packs those gets into contiguous blocks
95
and sends them back as a single put. However, the direct gets, require
98
Unfortunately, there is a small bug in the DS for remote gets. This bug
99
may cause the program to abort or print out the following message:
100
%d: server wrote data at unexpected offset %d
102
This is a bug actively being worked on @ CRAY and ORNL.
103
\* --------------------------------------------------------------------------- */
104
# define ORNL_USE_DS_FOR_REMOTE_GETS
105
# define CRAY_USE_ARMCI_CLIENT_BUFFERS