1
Specification of the Grid Engine Master monitoring
10
The extended work with the existing profiling in the SGE 6.0
11
system, which is used in the scheduler and the maser showed, that
12
it is not enough to understand bottlenecks in the master. Futhermore
13
it does not keep any statistics on the kind of messages processed. This
14
information is vital to enhance the performance of the master and
15
undestand issues the customer is facing. This document will describe the
16
kind of data, which is monitored and of the output is formated.
20
We currently have two different ways to get performance information about
22
- qping - health monitoring and commlib load
23
- qmaster profiling - measures different sections in the master
25
The qmaster profiling output is printed to the master message file while
26
qping has its own way print information. The monitoring should use both
27
ways to ensure easy processing of the data.
29
The monitoring should be capable of monitor unlimited number of threads
30
without to many pitfalls.
32
The health monitoring is currently implemented on top of the profiling
33
library. Due to the similatity of the health monitoring and the performance
34
monitoring, should that be changed. The health monitoring should be build
35
on top of the new monitoring.
37
The implementation documentation and how to use the monitoring in ones
38
own code is described in the files 'source/uti/sge_monitor.[c|h]'
42
The monitoring is only implemented for the qmaster. Therefore it qconf -mconf
43
got 2 new switches for the qmaster_params documented in sge_conf(5):
46
> Specifies the time interval when the monitoring information should be printed. The
47
> monitoring is disabled per default and can be enabled by specifying an interval.
48
> The monitoring is per thread and is written to the messages file or displayed by
49
> the "qping -f" command line tool. Example: MONITOR_TIME=0:0:10 generates the
50
> monitoring information most likely every 10 seconds and prints it. The specified
51
> time is a guideline and not a fixed interval. The used interval is printed and
52
> can be everything between 9 seconds and 20 in this example.
56
> The monitoring information is logged into the messages files per default. In addition
57
> it is provided for qping and can be requested by it. The messages files can become
58
> quite big, if the monitoring is enabled all the time, therefore this switch allows
59
> to disable the logging into the messages files and the monitoring data will only
60
> be available via "qping -f".
66
Event Master Thread (EDT):
68
Timed Event Thread (TET):
73
Each monioring result for a thread is printed in one line.
76
TET: runs: 0.10r/s out: 0.00m/s APT: 0.0001s/m idle: 100.00% \
77
wait: 0.00% time: 10.00s
78
MT(1): runs: 0.14r/s (execd (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s \
79
GDI (a:0.00,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s \
80
event-acks: 0.00/s) out: 0.00m/s APT: 0.0000s/m idle: 100.00% \
81
wait: 0.00% time: 7.01s
82
MT(2): runs: 0.14r/s (execd (l:0.00,j:0.00,c:0.00,p:0.00,a:0.00)/s \
83
GDI (a:0.00,g:0.00,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s \
84
event-acks: 0.00/s) out: 0.00m/s APT: 0.0000s/m idle: 100.00% \
85
wait: 0.00% time: 7.01s
86
EDT: runs: 2.00r/s out: 0.00m/s APT: 0.0007s/m idle: 99.87% \
87
wait: 0.00% time: 5.00s
89
Line format: "<NAME><NR>: <RUNS>: <OPTIONAL> <EXTENDED_DATA>"
92
<NR>: This is optional, it is only there, if more threads share
94
<RUNS>: The number of thread loops per second
95
<OPTIONAL>: The printed data depends on the configuration and the thread.
96
However it is always enclosed in breakets
97
(Example: "(reports...)"),
99
<EXTENDED_DATA>: "<OUT>: <APT>: <IDLE>: <WAIT>: <TIME>:"
101
<OUT>: The number of messages send per second.
102
<APT>: Average processing time for the thread loop.
103
<IDLE>: The percentage of the time spend waiting for an event,
105
<WAIT>: The percentage of the time spend waiting for important
107
<TIME>: The length of the timeframe the meassuring was done in
114
<RUNS> : Number of messages processed per second
115
<OUT> : Number of messages send per second
116
<IDLE> : Percentage waited in the comlib for a new message to process
117
<WAIT> : Percentage waited for the global lock or at the safe gard
119
<OPTIONAL> : will have two settings: a default and an extended monitoring
121
<execd>: number of messages from the execd, in detail
122
l: - number of (l)oad reports
123
j: - number of (j)ob reports
124
c: - number of execd (c)onfig version compares
125
p: - number of (p)rocessor reports
126
a: - number of (a)cknowledges
128
<GDI>: number of GDI requests, in detail
129
a: - number of (a)dd requests
130
g: - number of (g)et requests
131
m: - number of (m)odify requests
132
d: - number of (d)elete requests
133
c: - number of (c)opy requests
134
t: - number of (t)rigger requests
135
p: - number of verify (p)ermission requests
137
<event-acks>: number of event client acknowledges and execd acknowledges
139
3.2 Event Master Thread (EDT): (Under construction)
140
<RUNS> : Number of times the EDT was triggered
141
<OUT> : Number of messages send per second to the event client
142
<IDLE> : Percentage waited for the next trigger
143
<WAIT> : Percentage waited for the global lock (Other important locks
144
need to be determined.
147
<clients> : Number of connected clients
148
<mod> : Number of modified event clients per second
149
<ack> : Number of processed acknowledges per second
150
<blocked> : Number of blocked event clients
151
<busy> : Number of busy event clients
152
<events> : Number of new events to send out
153
<added> : Number of assigned events to the different event clients
154
<skipt> : Number of ignored events, no subscriptions
156
3.3 Timed Event Thread (TET):
157
<RUNS> : Number of times the EDT was triggered (new events and
159
<OUT> : Number of messages send per second to the event client
160
<IDLE> : Percentage waited for the next event
161
<WAIT> : Percentage waited for the global lock
164
<pending> : Number of pending timed events
165
<execduted> : Number of executed events per second.