~mathiaz/+junk/ceph-new-pkg-review

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
v0.17
- kclient: fix multiple mds mdsmap decoding
- kclient: fix mon subscription renewal
- crush: fix crush map creation with empty buckets (occurs on larger clusters)
- osdmap: fix encoding bug (crashes kclient); make kclient not crash
- msgr: simplified policy, failure model
- mon: less push, more pull
- mon: request routing
- mon cluster expansion
- osd: fix pg parsing, restarts on larger clusters

v0.18
- osd: basic ENOSPC handling
- big endian fixes (required protocol/disk format change)
- osd: improved object -> pg hash function; selectable
- crush: selectable hash function(s)
- mds restart bug fixes
- kclient: mds reconnect bug fixes
- fixed mds log trimming bug
- fixed mds cap vs snap deadlock
- filestore: faster flushing
- uclient,kclient: snapshot fixes
- mds: fix recursive accounting bug
- uclient: fixes for 32bit clients
- auth: 'none' security framework
- mon: "safely" bail on write errors (e.g. ENOSPC)
- mds: fix replay/reconnect race (caused (fast) client reconnect to fail)
- mds: misc journal replay, session fixes

v0.19
- ms_dispatch fairness
- kclient: bad fsid deadlock fix
- tids in fixed msg header (protocol change)
- feature bits during connection handshake
- remove erank from ceph_entity_addr
- disk format, compat/incompat bits
- journal format improvements
- kclient: cephx
- improved truncation
- cephx: lots of fixes
- mkcephfs: cephx support
- debian: packaging fixes

v0.20
- osd: new filestore, journaling infrastructure
- msgr: wire protocol improvements (lower per-message overhead)
- mds: reduced memory utilization (still more to do!)
- mds: many single mds fixes
- mds: many clustered mds fixes
- auth: many auth_x cleanups, fixes, improvements
- kclient: many bug fixes
- librados: some cleanup, c++ api now usable

v0.21

- 

- qa: snap test.  maybe walk through 2.6.* kernel trees?
- osd: rebuild pg log
- osd: handle storage errors
- rebuild mds hierarchy
- kclient: retry alloc on ENOMEM when reading from connection?

filestore
- throttling
- flush objects onto primary during recovery
- audit queue_transaction calls for dependencies
- convert apply_transaction calls in handle_map to queue?
  - need an osdmap cache layer?

clients need to handle MClassAck, or anybody using that will die

bugs
- mds states
  - closing -> opening transition
- mds prepare_force_open_sessions, then import aborts.. session is still OPENING but no client_session is sent...
- rm -r failure (on kernel tree)
- dbench 1, restart mds (may take a few times), dbench will error out.

- kclient lockdep warning
[ 1615.328733] =======================================================
[ 1615.331050] [ INFO: possible circular locking dependency detected ]
[ 1615.331050] 2.6.34-rc2 #22
[ 1615.331050] -------------------------------------------------------
[ 1615.331050] fixdep/3263 is trying to acquire lock:
[ 1615.331050]  (&osdc->request_mutex){+.+...}, at: [<ffffffffa007b66c>] ceph_osdc_start_request+0x4d/0x278 [ceph]
[ 1615.331050] 
[ 1615.331050] but task is already holding lock:
[ 1615.331050]  (&mm->mmap_sem){++++++}, at: [<ffffffff810208c0>] do_page_fault+0x104/0x278
[ 1615.331050] 
[ 1615.331050] which lock already depends on the new lock.
[ 1615.331050] 
[ 1615.331050] 
[ 1615.331050] the existing dependency chain (in reverse order) is:
[ 1615.331050] 
[ 1615.331050] -> #3 (&mm->mmap_sem){++++++}:
[ 1615.331050]        [<ffffffff81059fd3>] validate_chain+0xa4d/0xd28
[ 1615.331050]        [<ffffffff8105aa7f>] __lock_acquire+0x7d1/0x84e
[ 1615.331050]        [<ffffffff8105ab84>] lock_acquire+0x88/0xa5
[ 1615.331050]        [<ffffffff81094daf>] might_fault+0x90/0xb3
[ 1615.331050]        [<ffffffff81390d1e>] memcpy_fromiovecend+0x54/0x8e
[ 1615.331050]        [<ffffffff813b6ea7>] ip_generic_getfrag+0x2a/0x8f
[ 1615.331050]        [<ffffffff813b5da2>] ip_append_data+0x5f6/0x971
[ 1615.331050]        [<ffffffff813d35bf>] udp_sendmsg+0x4e8/0x603
[ 1615.331050]        [<ffffffff813d91e3>] inet_sendmsg+0x46/0x53
[ 1615.331050]        [<ffffffff813878c1>] sock_sendmsg+0xd4/0xf5
[ 1615.331050]        [<ffffffff81387e0f>] sys_sendto+0xdf/0x107
[ 1615.331050]        [<ffffffff810029eb>] system_call_fastpath+0x16/0x1b
[ 1615.331050] 
[ 1615.331050] -> #2 (sk_lock-AF_INET){+.+.+.}:
[ 1615.331050]        [<ffffffff81059fd3>] validate_chain+0xa4d/0xd28
[ 1615.331050]        [<ffffffff8105aa7f>] __lock_acquire+0x7d1/0x84e
[ 1615.331050]        [<ffffffff8105ab84>] lock_acquire+0x88/0xa5
[ 1615.331050]        [<ffffffff8138a562>] lock_sock_nested+0xeb/0xff
[ 1615.331050]        [<ffffffff813da29d>] inet_stream_connect+0x2b/0x25c
[ 1615.331050]        [<ffffffffa006eea6>] try_write+0x26e/0x102c [ceph]
[ 1615.331050]        [<ffffffffa00705ba>] con_work+0x126/0x6bc [ceph]
[ 1615.529553]        [<ffffffff8104774e>] worker_thread+0x1e8/0x2fa
[ 1615.529553]        [<ffffffff8104a4aa>] kthread+0x7d/0x85
[ 1615.529553]        [<ffffffff81003794>] kernel_thread_helper+0x4/0x10
[ 1615.529553] 
[ 1615.529553] -> #1 (&con->mutex){+.+.+.}:
[ 1615.529553]        [<ffffffff81059fd3>] validate_chain+0xa4d/0xd28
[ 1615.529553]        [<ffffffff8105aa7f>] __lock_acquire+0x7d1/0x84e
[ 1615.529553]        [<ffffffff8105ab84>] lock_acquire+0x88/0xa5
[ 1615.529553]        [<ffffffff81425727>] mutex_lock_nested+0x62/0x32c
[ 1615.529553]        [<ffffffffa0070cd3>] ceph_con_send+0xb3/0x244 [ceph]
[ 1615.529553]        [<ffffffffa007b591>] __send_request+0x108/0x196 [ceph]
[ 1615.529553]        [<ffffffffa007b794>] ceph_osdc_start_request+0x175/0x278 [ceph]
[ 1615.529553]        [<ffffffffa006029d>] ceph_writepages_start+0xb23/0x112a [ceph]
[ 1615.529553]        [<ffffffff810849aa>] do_writepages+0x1f/0x28
[ 1615.529553]        [<ffffffff810ca5e8>] writeback_single_inode+0xb6/0x1f5
[ 1615.529553]        [<ffffffff810cad9b>] writeback_inodes_wb+0x2d1/0x378
[ 1615.529553]        [<ffffffff810cafa8>] wb_writeback+0x166/0x1e0
[ 1615.529553]        [<ffffffff810cb154>] wb_do_writeback+0x83/0x1d3
[ 1615.529553]        [<ffffffff810cb2d2>] bdi_writeback_task+0x2e/0x9b
[ 1615.529553]        [<ffffffff8108fd73>] bdi_start_fn+0x71/0xd2
[ 1615.529553]        [<ffffffff8104a4aa>] kthread+0x7d/0x85
[ 1615.529553]        [<ffffffff81003794>] kernel_thread_helper+0x4/0x10
[ 1615.529553] 
[ 1615.529553] -> #0 (&osdc->request_mutex){+.+...}:
[ 1615.529553]        [<ffffffff81059cbf>] validate_chain+0x739/0xd28
[ 1615.529553]        [<ffffffff8105aa7f>] __lock_acquire+0x7d1/0x84e
[ 1615.529553]        [<ffffffff8105ab84>] lock_acquire+0x88/0xa5
[ 1615.529553]        [<ffffffff81425727>] mutex_lock_nested+0x62/0x32c
[ 1615.529553]        [<ffffffffa007b66c>] ceph_osdc_start_request+0x4d/0x278 [ceph]
[ 1615.529553]        [<ffffffffa007d8b6>] ceph_osdc_readpages+0x123/0x222 [ceph]
[ 1615.529553]        [<ffffffffa005f4b7>] ceph_readpages+0x193/0x456 [ceph]
[ 1615.529553]        [<ffffffff81085bd1>] __do_page_cache_readahead+0x17d/0x1f5
[ 1615.529553]        [<ffffffff81085c65>] ra_submit+0x1c/0x20
[ 1615.529553]        [<ffffffff81085fab>] ondemand_readahead+0x264/0x277
[ 1615.529553]        [<ffffffff81086092>] page_cache_sync_readahead+0x33/0x35
[ 1615.529553]        [<ffffffff8107f0d7>] filemap_fault+0x143/0x31f
[ 1615.529553]        [<ffffffff810913bf>] __do_fault+0x50/0x415
[ 1615.529553]        [<ffffffff810934d9>] handle_mm_fault+0x334/0x6a6
[ 1615.529553]        [<ffffffff810209af>] do_page_fault+0x1f3/0x278
[ 1615.529553]        [<ffffffff814281ff>] page_fault+0x1f/0x30
[ 1615.529553] 
[ 1615.529553] other info that might help us debug this:
[ 1615.529553] 
[ 1615.529553] 1 lock held by fixdep/3263:
[ 1615.529553]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff810208c0>] do_page_fault+0x104/0x278
[ 1615.529553] 
[ 1615.529553] stack backtrace:
[ 1615.529553] Pid: 3263, comm: fixdep Not tainted 2.6.34-rc2 #22
[ 1615.529553] Call Trace:
[ 1615.529553]  [<ffffffff81058f49>] print_circular_bug+0xb3/0xc1
[ 1615.529553]  [<ffffffff81059cbf>] validate_chain+0x739/0xd28
[ 1615.529553]  [<ffffffff810099d7>] ? native_sched_clock+0x37/0x71
[ 1615.824177]  [<ffffffff8105aa7f>] __lock_acquire+0x7d1/0x84e
[ 1615.824177]  [<ffffffff8105ab84>] lock_acquire+0x88/0xa5
[ 1615.824177]  [<ffffffffa007b66c>] ? ceph_osdc_start_request+0x4d/0x278 [ceph]
[ 1615.824177]  [<ffffffffa007b66c>] ? ceph_osdc_start_request+0x4d/0x278 [ceph]
[ 1615.824177]  [<ffffffff81425727>] mutex_lock_nested+0x62/0x32c
[ 1615.824177]  [<ffffffffa007b66c>] ? ceph_osdc_start_request+0x4d/0x278 [ceph]
[ 1615.824177]  [<ffffffffa007b66c>] ceph_osdc_start_request+0x4d/0x278 [ceph]
[ 1615.824177]  [<ffffffffa007d8b6>] ceph_osdc_readpages+0x123/0x222 [ceph]
[ 1615.824177]  [<ffffffffa005f4b7>] ceph_readpages+0x193/0x456 [ceph]
[ 1615.824177]  [<ffffffff810099d7>] ? native_sched_clock+0x37/0x71
[ 1615.824177]  [<ffffffff81056580>] ? get_lock_stats+0x19/0x4c
[ 1615.824177]  [<ffffffff81085bd1>] __do_page_cache_readahead+0x17d/0x1f5
[ 1615.824177]  [<ffffffff81085ad0>] ? __do_page_cache_readahead+0x7c/0x1f5
[ 1615.824177]  [<ffffffff8107d848>] ? find_get_page+0xd9/0x12d
[ 1615.824177]  [<ffffffff81085c65>] ra_submit+0x1c/0x20
[ 1615.916887]  [<ffffffff81085fab>] ondemand_readahead+0x264/0x277
[ 1615.916887]  [<ffffffff81086092>] page_cache_sync_readahead+0x33/0x35
[ 1615.931403]  [<ffffffff8107f0d7>] filemap_fault+0x143/0x31f
[ 1615.931403]  [<ffffffff810913bf>] __do_fault+0x50/0x415
[ 1615.931403]  [<ffffffff8105aa99>] ? __lock_acquire+0x7eb/0x84e
[ 1615.946963]  [<ffffffff810934d9>] handle_mm_fault+0x334/0x6a6
[ 1615.946963]  [<ffffffff810209af>] do_page_fault+0x1f3/0x278
[ 1615.946963]  [<ffffffff814281ff>] page_fault+0x1f/0x30

- kclient: moonbeamer gets this with iozone -a...
[17608.696906] ------------[ cut here ]------------
[17608.701761] WARNING: at lib/kref.c:43 kref_get+0x23/0x2a()
[17608.707584] Hardware name: PDSMi
[17608.711056] Modules linked in: aes_x86_64 aes_generic ceph fan ac battery ehci_hcd container uhci_hcd thermal button processor
        
[17608.723268] Pid: 19594, comm: sync Not tainted 2.6.32-rc2 #12
[17608.729255] Call Trace:
[17608.731835]  [<ffffffff81252a6d>] ? kref_get+0x23/0x2a
[17608.737209]  [<ffffffff81044f9a>] warn_slowpath_common+0x77/0x8f
[17608.743457]  [<ffffffff81044fc1>] warn_slowpath_null+0xf/0x11
[17608.749395]  [<ffffffff81252a6d>] kref_get+0x23/0x2a
[17608.754634]  [<ffffffffa006e9be>] ceph_mdsc_sync+0xd7/0x305 [ceph]
[17608.761061]  [<ffffffff8145d8b6>] ? mutex_unlock+0x9/0xb
[17608.766618]  [<ffffffffa0078bf3>] ? ceph_osdc_sync+0xe7/0x173 [ceph]
[17608.773199]  [<ffffffffa004d906>] ceph_syncfs+0x4e/0xc8 [ceph]
[17608.779235]  [<ffffffff810f6084>] __sync_filesystem+0x5e/0x72
[17608.785195]  [<ffffffff810f6142>] sync_filesystems+0xaa/0x101
[17608.791122]  [<ffffffff810f61e0>] sys_sync+0x12/0x2e
[17608.796331]  [<ffffffff8100baeb>] system_call_fastpath+0x16/0x1b
[17608.802560] ---[ end trace 36481c4089b3d493 ]---
[17608.807461] BUG: unable to handle kernel paging request at 0000001200000635
[17608.811372] IP: [<ffffffff81252a2c>] kref_put+0x31/0x4f
[17608.811372] PGD 11c7c8067 PUD 0 
[17608.811372] Oops: 0002 [#1] PREEMPT SMP 
[17608.811372] last sysfs file: /sys/kernel/uevent_seqnum
[17608.811372] CPU 0 
[17608.811372] Modules linked in: aes_x86_64 aes_generic ceph fan ac battery ehci_hcd container uhci_hcd thermal button processor
        
[17608.811372] Pid: 19594, comm: sync Tainted: G        W  2.6.32-rc2 #12 PDSMi
[17608.811372] RIP: 0010:[<ffffffff81252a2c>]  [<ffffffff81252a2c>] kref_put+0x31/0x4f
[17608.811372] RSP: 0018:ffff8801137f5e28  EFLAGS: 00010206
[17608.811372] RAX: 0000001200000495 RBX: 0000001200000635 RCX: 0000000000000000
[17608.811372] RDX: 0000000000000000 RSI: 0000000000000040 RDI: 0000001200000635
[17608.811372] RBP: ffff8801137f5e38 R08: 0000000000000000 R09: ffff8801137f5e28
[17608.811372] R10: ffff8801137f5e38 R11: ffffffff8145c5ad R12: ffffffffa00675ae
[17608.811372] R13: ffff880119d02508 R14: ffff880119d02510 R15: 0000000000055fad
[17608.811372] FS:  00007fadcfcd96e0(0000) GS:ffff88002f000000(0000) knlGS:0000000000000000
[17608.811372] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[17608.811372] CR2: 0000001200000635 CR3: 00000001136cd000 CR4: 00000000000006f0
[17608.811372] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[17608.811372] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        
[17608.811372] Process sync (pid: 19594, threadinfo ffff8801137f4000, task ffff88011bf70480)
[17608.811372] Stack:
[17608.811372]  ffff880103726e18 ffff880103726c00 ffff8801137f5e58 ffffffffa00733f1
[17608.811372] <0> ffff880103726e18 ffffffffa00733a2 ffff8801137f5e78 ffffffff81252a3e
[17608.811372] <0> 0000000000000000 ffff880103726c00 ffff8801137f5ef8 ffffffffa006ea19
[17608.811372] Call Trace:
[17608.811372]  [<ffffffffa00733f1>] ceph_mdsc_release_request+0x4f/0x12a [ceph]
[17608.811372]  [<ffffffffa00733a2>] ? ceph_mdsc_release_request+0x0/0x12a [ceph]
[17608.811372]  [<ffffffff81252a3e>] kref_put+0x43/0x4f
[17608.811372]  [<ffffffffa006ea19>] ceph_mdsc_sync+0x132/0x305 [ceph]
[17608.811372]  [<ffffffff8145d8b6>] ? mutex_unlock+0x9/0xb
[17608.811372]  [<ffffffffa0078bf3>] ? ceph_osdc_sync+0xe7/0x173 [ceph]
[17608.811372]  [<ffffffffa004d906>] ceph_syncfs+0x4e/0xc8 [ceph]
[17608.811372]  [<ffffffff810f6084>] __sync_filesystem+0x5e/0x72
[17608.811372]  [<ffffffff810f6142>] sync_filesystems+0xaa/0x101
[17608.811372]  [<ffffffff810f61e0>] sys_sync+0x12/0x2e
        
[17608.811372]  [<ffffffff8100baeb>] system_call_fastpath+0x16/0x1b
[17608.811372] Code: 49 89 f4 4d 85 e4 be 40 00 00 00 53 48 89 fb 74 0e 49 81 fc 2d 09 0d 81 75 11 be 41 00 00 00 48 c7 c7 62 f8 5c 81 e8 86 25 df ff <f0> ff 0b 0f 94 c0 31 d2 84 c0 74 0b 48 89 df 41 ff d4 ba 01 00 
[17608.811372] RIP  [<ffffffff81252a2c>] kref_put+0x31/0x4f
[17608.811372]  RSP <ffff8801137f5e28>
[17608.811372] CR2: 0000001200000635
[17609.074355] ---[ end trace 36481c4089b3d494 ]---


?- bonnie++ -u root -d /mnt/ceph/ -s 0 -n 1
(03:35:29 PM) Isteriat: Using uid:0, gid:0.
(03:35:29 PM) Isteriat: Create files in sequential order...done.
(03:35:29 PM) Isteriat: Stat files in sequential order...Expected 1024 files but only got 0
(03:35:29 PM) Isteriat: Cleaning up test directory after error.

- osd pg split breaks if not all osds are up...

- mds recovery flag set on inode that didn't get recovered??
- osd pg split breaks if not all osds are up...
- mislinked directory?  (cpusr.sh, mv /c/* /c/t, more cpusr, ls /c/t)


filestore performance notes
- write ordering options
  - fs only (no journal)
  - fs, journal
  - fs + journal in parallel
  - journal sync, then fs
- and the issues
  - latency
  - effect of a btrfs hang
  - unexpected error handling (EIO, ENOSPC)
  - impact on ack, sync ordering semantics.
  - how to throttle request stream to disk io rate
  - rmw vs delayed mode

- if journal is on fs, then
  - throttling isn't an issue, but
  - fs stalls are also journal stalls

- fs only
  - latency: commits are bad.
  - hang: bad.
  - errors: could be handled, aren't
  - acks: supported
  - throttle: fs does it
  - rmw: pg toggles mode
- fs, journal
  - latency: good, unless fs hangs
  - hang: bad.  latency spikes.  overall throughput drops.
  - errors: could probably be handled, isn't.
  - acks: supported
  - throttle: btrfs does it (by hanging), which leads to a (necessary) latency spike
  - rmw: pg toggles mode
- fs | journal
  - latency: good
  - hang: no latency spike.  fs throughput may drop, to the extent btrfs throughput necessarily will.
  - errors: not detected until later.  could journal addendum record.  or die (like we do now)
  - acks: could be flexible.. maybe supported, maybe not.  will need some extra locking smarts?
  - throttle: ??
  - rmw: rmw must block on prior fs writes.
- journal, fs (writeahead)
  - latency: good (commit only, no acks)
  - hang: same as |
  - errors: same as |
  - acks: never.
  - throttle: ??
  - rmw: rmw must block on prior fs writes.
  * JourningObjectStore interface needs work?



greg
- csync data import/export tool?
- uclient: readdir from cache
- mds: basic auth checks

later
- document on-wire protocol
- client reconnect after long eviction; and slow delayed reconnect
- repair
- mds security enforcement
- client, user authentication
- cas
- osd failure declarations
- rename over old files should flush data, or revert back to old contents
- clean up SimpleMessenger interface and usage a little. Can probably unify
	some/all of shutdown, wait, destroy. Possibly move destroy into put()
	and make get/put usage more consistent/stringently mandated.

rados
- snapc interface
  - mon: allocate snapid, adjust pool_t seq
  - librados: snapc manipulation, set pool handle snapc
- make rest interface superset of s3?
  - create/delete snapshots
  - list, access snapped version
- perl swig wrapper
- 'rados call foo.bar'?
- merge pgs
- autosize pg_pools?

repair
- namespace reconstruction tool
- repair pg (rebuild log)  (online or offline?  ./cosd --repair_pg 1.ef?)
- repair file ioctl?
- are we concerned about
  - scrubbing
  - reconstruction after loss of subset of cdirs
  - reconstruction after loss of md log
- data object 
  - path backpointers?
  - parent dir pointer?
- mds scrubbing

kclient
- mdsc: preallocate reply(ies?)
- mdsc: mempool for cap writeback?
- osdc: combine request, request+reply messages into single pool.
- ENOMEM
  - message pools
  - sockets?  (this can actual generates a lockdep warning :/)
- fs-portable file layout virtual xattr (see Andreas' -fsdevel thread)
- statlite
- add cap to release if we get fouled up in fill_inode et al?
- fix up ESTALE handling
- don't retry on ENOMEM on non-nofail requests in kick_requests
- make cap import/export more efficient?
- flock, fnctl locks
- ACLs
  - init security xattrs
- should we try to ref CAP_PIN on special inodes that are open?  
- fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it
- inotify for updates from other clients?

vfs issues
- a getattr mask would be really nice

filestore
- make min sync interval self-tuning (ala xfs, ext3?)
- get file csum?
- clonerange writeahead journal data into objects?

btrfs
- clone compressed inline extents
- ioctl to pull out data csum?

osd
- separate reads/writes into separate op queues?
- gracefully handle ENOSPC
- gracefully handle EIO?
- what to do with lost objects.. continue peering?
- segregate backlog from log ondisk?
- preserve pg logs on disk for longer period
- make scrub interruptible
- optionally separate osd interfaces (ips) for clients and osds (replication, peering, etc.)
- pg repair
- pg split should be a work queue
- optimize remove wrt recovery pushes?

uclient
- fix client_lock vs other mutex with C_SafeCond
- clean up check_caps to more closely mirror kclient logic
- readdir from cache
- fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it
- hadoop: clean up assert usage

mds
- special case commit of stray dir to avoid having to commit, re-commit strays?
  - once we commit one stray, we have to re-commit later to remove it, which means we commit other new ones.
- put inode dirty fields into dirty_bits_t to reduce per-inode memory footprint
- don't sync log on every clientreplay request?
- pass issued, wanted into eval(lock) when eval() already has it?  (and otherwise optimize eval paths..)
- add an up:shadow mode?
  - tail the mds log as it is written
  - periodically check head so that we trim, too
- handle slow client reconnect (i.e. after mds has gone active)
- anchor_destroy needs to xlock linklock.. which means it needs a Mutation wrapper?
  - ... when it gets a caller.. someday..
- add FILE_CAP_EXTEND capability bit
- dir fragment
  - maybe just take dftlock for now, to keep it simple.
- dir merge
- snap
  - hard link backpointers
    - anchor source dir
    - build snaprealm for any hardlinked file
    - include snaps for all (primary+remote) parents
  - how do we properly clean up inodes when doing a snap purge?
    - when they are mid-recover?  see 136470cf7ca876febf68a2b0610fa3bb77ad3532
  - what if a recovery is queued, or in progress, and the inode is then cowed?  can that happen?  
  - proper handling of cache expire messages during rejoin phase?
    -> i think cache expires are fine; the rejoin_ack handler just has to behave if rejoining items go missing

- clustered
  - on replay, but dirty scatter replicas on lists so that they get flushed?  or does rejoin handle that?
  - linkage vs cdentry replicas and remote rename....
  - rename: importing inode... also journal imported client map?

mon
- don't allow lpg_num expansion and osd addition at the same time?
- how to shrink cluster?
- how to tell osd to cleanly shut down
- paxos need to clean up old states.
  - default: simple max of (state count, min age), so that we have at least N hours of history, say?
  - osd map: trim only old maps < oldest "in" osd up_from

osdmon
- monitor needs to monitor some osds...

pgmon
- check for orphan pgs
- monitor pg states, notify on out?
- watch osd utilization; adjust overload in cluster map

crush
- allow forcefeed for more complicated rule structures.  (e.g. make force_stack a list< set<int> >)

simplemessenger
- close idle connections?

objectcacher
- read locks?
- maintain more explicit inode grouping instead of wonky hashes

cas
- chunking.  see TTTD in
   ESHGHI, K.
   A framework for analyzing and improving content-based chunking algorithms.
   Tech. Rep. HPL-2005-30(R.1), Hewlett Packard Laboratories, Palo Alto, 2005. 

radosgw
 - handle gracefully location related requests
 - logging control (?)
 - parse date/time better
 - upload using post
 - torrent
 - handle gracefully PUT/GET requestPayment



-- for nicer kclient debug output (everything but messenger, but including msg in/out)
echo 'module ceph +p' > /sys/kernel/debug/dynamic_debug/control ; echo 'file fs/ceph/messenger.c -p' > /sys/kernel/debug/dynamic_debug/control ; echo 'file ' `grep -- --- /sys/kernel/debug/dynamic_debug/control | grep ceph | awk '{print $1}' | sed 's/:/ line /'` +p  > /sys/kernel/debug/dynamic_debug/control ; echo 'file ' `grep -- === /sys/kernel/debug/dynamic_debug/control | grep ceph | awk '{print $1}' | sed 's/:/ line /'` +p  > /sys/kernel/debug/dynamic_debug/control