[ZODB-Dev] Unpickler.noload, zc.zodbdgc and multi-database reference bug (IndexError)?

Discussion:

jason.madden

2014-01-30 14:58:47 UTC

Hello ZODB dev,

I was recently trying to GC a large multi-database setup for the first time using zc.zodbdgc. The process wouldn't complete (or really even get started) because of an IndexError being thrown from `zc.zodbdgc.getrefs` (__init__.py line 287). As I traced through it, it began to look like the combination of `cPickle.Unpickler.noload` and multi-database persistent ids (which in ZODB are list objects) fails, generating an empty list instead of the expected [ref type, args] list documented in `ZODB.serialize`. This makes it impossible to correctly GC a multi-database.

I was curious if anyone else had seen this, or maybe I'm just doing something wrong? We solved our problem by using `load` instead of `noload`, but I wondered if there might be a better way?

Details:

I'm working under Python 2.7.6 and 2.7.3 with ZODB 4.0.0, zc.zodbdgc 0.6.1 and eventually zodbpickle 0.5.2. Most of my results were repeated on both Mac OS X and Linux.

p = 'cBTrees.OOBTree\nOOBTree\nq\x01.((((X\x0c\x00\x00\x00Users_1_Prodq\x02]q\x03(U\x01m(U\x0cUsers_1_Prodq\x04U\x08\x00\x00\x00\x00\x00\x00\x00\x01q\x05czope.site.folder\nFolder\nq\x06tq\x07eQX\x0c\x00\x00\x00Users_2_Prodq\x08]q\t(U\x01m(U\x0cUsers_2_Prodq\nU\x08\x00\x00\x00\x00\x00\x00\x00\x01q\x0bh\x06tq\x0ceQX\x0b\x00\x00\x00dataserver2q\r(U\x08\x00\x00\x00\x00\x00\x00\x00\x10q\x0eh\x06tQttttq\x0f.'
import cPickle
import cStringIO
refs = []
u = cPickle.Unpickler(cStringIO.StringIO(p))
u.persistent_load = refs
u.noload()
u.noload()
refs

[[], [], ('\x00\x00\x00\x00\x00\x00\x00\x10', None)]

refs = []
u = cPickle.Unpickler(cStringIO.StringIO(p))
u.persistent_load = refs
u.noload()
u.load()
refs

[['m', ('Users_1_Prod', '\x00\x00\x00\x00\x00\x00\x00\x01', <class 'zope.site.folder.Folder'>)],
['m', ('Users_2_Prod', '\x00\x00\x00\x00\x00\x00\x00\x01', <class 'zope.site.folder.Folder'>)],
('\x00\x00\x00\x00\x00\x00\x00\x10', <class 'zope.site.folder.Folder'>)]

The results are the same using zodbpickle or using an actual callback function instead of the append-directly-to-list shortcut.

If we fix the IndexError by checking the size of the list first, we miss all the cross-db references, meaning that a GC is going to be too aggressive. But using `load` is slower and requires access to all of the classes referenced. If anyone has run into this before or has other suggestions, I'd appreciate hearing them.

Thanks,
Jason

Jim Fulton

2014-01-30 16:14:55 UTC

Permalink

Post by jason.madden
Hello ZODB dev,
I was recently trying to GC a large multi-database setup for the first time using zc.zodbdgc. The process wouldn't complete (or really even get started) because of an IndexError being thrown from `zc.zodbdgc.getrefs` (__init__.py line 287). As I traced through it, it began to look like the combination of `cPickle.Unpickler.noload` and multi-database persistent ids (which in ZODB are list objects) fails, generating an empty list instead of the expected [ref type, args] list documented in `ZODB.serialize`. This makes it impossible to correctly GC a multi-database.
I was curious if anyone else had seen this

I haven't. I'm the author of zodbdgc and I use it regularly, including on
large (for some definition) databases.

Post by jason.madden
, or maybe I'm just doing something wrong? We solved our problem by using `load` instead of `noload`, but I wondered if there might be a better way?
I'm working under Python 2.7.6 and 2.7.3 with ZODB 4.0.0, zc.zodbdgc 0.6.1 and eventually zodbpickle 0.5.2. Most of my results were repeated on both Mac OS X and Linux.

Why are you using zodbpickle? Perhaps that is behaving differently
from cPickle in some
fashion?

Post by jason.madden

[[], [], ('\x00\x00\x00\x00\x00\x00\x00\x10', None)]

refs = []
u = cPickle.Unpickler(cStringIO.StringIO(p))
u.persistent_load = refs
u.noload()
u.load()
refs

[['m', ('Users_1_Prod', '\x00\x00\x00\x00\x00\x00\x00\x01', <class 'zope.site.folder.Folder'>)],
['m', ('Users_2_Prod', '\x00\x00\x00\x00\x00\x00\x00\x01', <class 'zope.site.folder.Folder'>)],
('\x00\x00\x00\x00\x00\x00\x00\x10', <class 'zope.site.folder.Folder'>)]
The results are the same using zodbpickle or using an actual callback function instead of the append-directly-to-list shortcut.
If we fix the IndexError by checking the size of the list first, we miss all the cross-db references, meaning that a GC is going to be too aggressive. But using `load` is slower and requires access to all of the classes referenced. If anyone has run into this before or has other suggestions, I'd appreciate hearing them.

I'd try using ZODB 3.10. I suspect a ZODB 4 incompatibility of some sort.

Unfortunately, I don't have time to dig into this now.

This weekend, I'll at least see if I can make zodbdgc tests pass with ZODB 4.
Perhaps that will shed light.

Jim

--
Jim Fulton
http://www.linkedin.com/in/jimfulton

jason.madden

2014-01-30 17:12:35 UTC

Permalink

Post by Jim Fulton

Post by jason.madden
I was curious if anyone else had seen this

I haven't. I'm the author of zodbdgc and I use it regularly, including on
large (for some definition) databases.

Hi Jim,

I really appreciate your input on this. Thanks for taking the time to respond.

Post by Jim Fulton

Post by jason.madden
I'm working under Python 2.7.6 and 2.7.3 with ZODB 4.0.0, zc.zodbdgc 0.6.1 and eventually zodbpickle 0.5.2. Most of my results were repeated on both Mac OS X and Linux.

Why are you using zodbpickle? Perhaps that is behaving differently
from cPickle in some fashion?

I didn't start out with zodbpickle, I just tried it as an alternative to cPickle after cPickle was failing. The hope was that they would actually behave differently in some fashion, but the observed behaviour is identical.

Post by Jim Fulton
I'd try using ZODB 3.10. I suspect a ZODB 4 incompatibility of some sort.
Unfortunately, I don't have time to dig into this now.
This weekend, I'll at least see if I can make zodbdgc tests pass with ZODB 4.
Perhaps that will shed light.

That's a very good point. I'm not sure what it will take to repeat the test with ZODB 3.10, but I can try (although, since the pickles are already written at this point, I probably wouldn't expect to see a difference? Nothing mutates the actual pickle data loaded from a storage, excluding transformers like zc.zlibstorage).

Along those lines, I tried just downloading zc.zodbdgc and running its tests against ZODB 3.10 (under both Python 2.7.6/OS X and 2.7.3/Linux), and unfortunately, they immediately fail with the exact same IndexError I see when trying to GC our database:

$ mkvirtualenv ZODB3
$ workon ZODB3
$ pip install ZODB3==3.10.5
....
$ wget https://pypi.python.org/packages/source/z/zc.zodbdgc/zc.zodbdgc-0.6.1.tar.gz
...
$ tar -xf zc.zodbdgc-0.6.1.tar.gz && cd zc.zodbdgc-0.6.1
$ python ./setup.py test
...
======================================================================
FAIL: zc.zodbdgc-0.6.1/src/zc/zodbdgc/README.test
Doctest: README.test
----------------------------------------------------------------------
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/doctest.py", line 2201, in runTest
raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for README.test
File "tmp/zc.zodbdgc-0.6.1/src/zc/zodbdgc/README.test", line 0

----------------------------------------------------------------------
File "zc.zodbdgc-0.6.1/src/zc/zodbdgc/README.test", line 194, in README.test
Failed example:
bad = zc.zodbdgc.gc('config', days=2)
Exception raised:
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/doctest.py", line 1289, in __run
compileflags, 1) in test.globs
File "<doctest README.test[86]>", line 1, in <module>
bad = zc.zodbdgc.gc('config', days=2)
File "zc.zodbdgc-0.6.1/src/zc/zodbdgc/__init__.py", line 105, in gc
return gc_(close, conf, days, ignore, conf2, fs, untransform, ptid)
File "zc.zodbdgc-0.6.1/src/zc/zodbdgc/__init__.py", line 209, in gc_
for ref in getrefs(data, name, ignore):
File "zc.zodbdgc-0.6.1/src/zc/zodbdgc/__init__.py", line 287, in getrefs
ref = ref[1]
IndexError: list index out of range

The tests failed the same way under ZODB 4.0.0 and ZODB3 3.9.0 (the minimum requirement listed in setup.py).

However, when I tried Python 2.6.8, the tests pass with ZODB 3.10.5 (they don't pass with ZODB 4, but it looks to be unrelated). So it seems that the behaviour of `noload` might have changed between 2.6.x and 2.7.x? Is it possible you're running on Python < 2.7?

Thanks,

Jason

jason.madden

2014-01-30 17:40:53 UTC

Permalink

Post by jason.madden
So it seems that the behaviour of `noload` might have changed between 2.6.x and 2.7.x?

Apologies for replying to myself, but I think I found the root cause.

After some further investigation, I found issue 1101399 (http://bugs.python.org/issue1101399), complaining that noload is broken for subclasses of dict. The fix for this issue was applied to the cPython trunk in October of 2009 without any corresponding tests (http://hg.python.org/releasing/2.7.6/rev/d0f005e6fadd). In releases with this fix (if I'm reading the code correctly), the pickle opcodes append and appends become no-ops, clearing the pickle stack when they are encountered.

A multi-database reference is pickled as a list, so it uses the appends opcode:

23: ] EMPTY_LIST
24: q BINPUT 3
26: ( MARK
27: U SHORT_BINSTRING 'm'
30: ( MARK
31: U SHORT_BINSTRING 'Users_1_Prod'
45: q BINPUT 4
47: U SHORT_BINSTRING '\x00\x00\x00\x00\x00\x00\x00\x01'
57: q BINPUT 5
59: c GLOBAL 'zope.site.folder Folder'
84: q BINPUT 6
86: t TUPLE (MARK at 30)
87: q BINPUT 7
89: e APPENDS (MARK at 26)
90: Q BINPERSID

This fix means that multi-database references are always going to be returned as an empty list under noload (again, if I'm reading the code correctly). This means that multi-references and noload don't work under Python 2.7.x or 3.x with zodbpickle and so consequently neither does an unmodified zc.zodbdgc.

I don't know what the best way forward is. Our solution to use `load` instead seems to work for us, but may not work for everyone. Maybe zodbpickle could revert the fix in its branch and zc.zodbgc could depend on that? I'm happy to help test other ideas.

Thanks again,

Jason

Jim Fulton

2014-01-30 18:19:32 UTC

Permalink

Post by jason.madden

Post by jason.madden
So it seems that the behaviour of `noload` might have changed between 2.6.x and 2.7.x?

Probably because the original code had no tests (because we weren't
doing that then) and
no documentation either (my bad) because this was added for the
specific use case of
efficiently scraping out references in ZODB.

....

Post by jason.madden
This fix means that multi-database references are always going to be returned as an empty list under noload (again, if I'm reading the code correctly). This means that multi-references and noload don't work under Python 2.7.x or 3.x with zodbpickle and so consequently neither does an unmodified zc.zodbdgc.

Sigh. We're still using Python 2.6 for out database servers. :)

Post by jason.madden
I don't know what the best way forward is. Our solution to use `load` instead seems to work for us, but may not work for everyone.

It will probably work, but be slower, which isn't a huge deal since
zc.zodbdgc runs out of
process. We run it on ZRS secondaries, which are otherwise idle.

Post by jason.madden
Maybe zodbpickle could revert the fix in its branch and zc.zodbgc could depend on that? I'm happy to help test other ideas.

That may be the best way forward.

I can't speak with much confidence until I've had a chance to
wade in and refresh my memory on this stuff.

Something I wish I'd done differently in ZODB and contemplated changing on a
number of occasions is the handling of references. I wish I'd stored
them outside the
pickle so they could be analyzed without unpicking (or at least
without unpickling the
application data).

Jim

--
Jim Fulton
http://www.linkedin.com/in/jimfulton