Discussion:
[ZODB-Dev] Does ZODB pipeline load requests?
Dylan Jay
2014-02-19 01:59:33 UTC
Permalink
Hi,

I'm seeing a a ZCatalog reindex of a large number of objects take a long time while only using 10% cpu. I'm not sure yet if this is due to the size of the objects and therefore the network is saturated, or the ZEO file reads aren't fast enough. However looking at the protocol I didn't see a way for code such as the ZCatalog to give a hint to ZEO as to what they wanted to load next so the time is taken by network delays rather than either ZEO or app. Is that the case?
I'm guessing if it is, it's a fundamental design problem that can't be fixed :(


Dylan Jay

---
www.pretagov.com - Secure SaaS for Government hosted locally.
P: +61-2-9955-2830 +44-87-0392-7071 | linkedin.com/in/djay75
Jim Fulton
2014-02-19 11:44:00 UTC
Permalink
Post by Dylan Jay
Hi,
I'm seeing a a ZCatalog reindex of a large number of objects take a long time while only using 10% cpu. I'm not sure yet if this is due to the size of the objects and therefore the network is saturated, or the ZEO file reads aren't fast enough.
How heavily loaded is your storage server, especially %CPU of the
server process?

Also, are the ZODB object or client caches big enough for the job?
Post by Dylan Jay
However looking at the protocol I didn't see a way for code such as the ZCatalog to give a hint to ZEO as to what they wanted to load next so the time is taken by network delays rather than either ZEO or app. Is that the case?
It is the case that a ZEO client does one read at a time and that
there's no easy way to pre-load objects.
Post by Dylan Jay
I'm guessing if it is, it's a fundamental design problem that can't be fixed :(
I don't think there's a *fundamental* problem. There are three
issues. The hardest to solve
isn't at the storage level. I'll mention the 2 easiest problems first:

1. The ZEO client implementation only allows one outstanding request at a time,
even on a client with multiple threads. This is merely a clumsy
implementation.

The protocol easily allows for multiple outstanding reads!

2. The storage API doesn't provide a way to read multiple objects at once, or to
otherwise hint that additional objects will be loaded.

Both of these are fairly straightforward to fix. It's just a matter of time. :)

3. You have to be able to predict what data are going to be needed.

This IMO is rather hard, at least at a general level. It's what's left
me somewhat under-motivated to address the first 2 problems.

We really should address problems 1 and 2 to make it possible
for people to experiment with approaches to problem 3.

Jim
--
Jim Fulton
http://www.linkedin.com/in/jimfulton
Dylan Jay
2014-02-19 14:57:53 UTC
Permalink
Post by Jim Fulton
Post by Dylan Jay
Hi,
I'm seeing a a ZCatalog reindex of a large number of objects take a long time while only using 10% cpu. I'm not sure yet if this is due to the size of the objects and therefore the network is saturated, or the ZEO file reads aren't fast enough.
How heavily loaded is your storage server, especially %CPU of the
server process?
no not heavily loaded.
Post by Jim Fulton
Also, are the ZODB object or client caches big enough for the job?
I'm not sure the caches would ever be big enough since it's iterating over 1.7M objects.
Post by Jim Fulton
Post by Dylan Jay
However looking at the protocol I didn't see a way for code such as the ZCatalog to give a hint to ZEO as to what they wanted to load next so the time is taken by network delays rather than either ZEO or app. Is that the case?
It is the case that a ZEO client does one read at a time and that
there's no easy way to pre-load objects.
Post by Dylan Jay
I'm guessing if it is, it's a fundamental design problem that can't be fixed :(
I don't think there's a *fundamental* problem. There are three
issues. The hardest to solve
1. The ZEO client implementation only allows one outstanding request at a time,
even on a client with multiple threads. This is merely a clumsy
implementation.
The protocol easily allows for multiple outstanding reads!
2. The storage API doesn't provide a way to read multiple objects at once, or to
otherwise hint that additional objects will be loaded.
Both of these are fairly straightforward to fix. It's just a matter of time. :)
3. You have to be able to predict what data are going to be needed.
This IMO is rather hard, at least at a general level. It's what's left
me somewhat under-motivated to address the first 2 problems.
We really should address problems 1 and 2 to make it possible
for people to experiment with approaches to problem 3.
yeah I figured it might be the case thats its hard to predict. In this case it's catalog indexing so I was wondering if something could be done with __iter__ on a btree? It's a reasonably good guess that you could start preloading more of those objects if the first few are loaded?
Post by Jim Fulton
Jim
--
Jim Fulton
http://www.linkedin.com/in/jimfulton
Jim Fulton
2014-02-19 15:27:12 UTC
Permalink
...
Post by Dylan Jay
yeah I figured it might be the case thats its hard to predict. In this case it's catalog indexing so I was wondering if something could be done with __iter__ on a btree? It's a reasonably good guess that you could start preloading more of those objects if the first few are loaded?
Iterators certainly seem like a logical place to start.

Jim
--
Jim Fulton
http://www.linkedin.com/in/jimfulton
Dylan Jay
2014-02-20 00:40:15 UTC
Permalink
Post by Jim Fulton
...
Post by Dylan Jay
yeah I figured it might be the case thats its hard to predict. In this case it's catalog indexing so I was wondering if something could be done with __iter__ on a btree? It's a reasonably good guess that you could start preloading more of those objects if the first few are loaded?
Iterators certainly seem like a logical place to start.
As an example I originally was doing a TTW zope reindex of a single index.
Due to conflict problems I used a modified version of this https://github.com/plone/Products.PloneOrg/blob/master/scripts/catalog_rebuild.py (which I'd love to integrate something similar into zcatalog sometime).
Both use iterators I believe.
I think even if there was an explicit api where you can pass in an iterator, a max buffer length and you'd get passed back another iterator. Then asynchronously objects will load to try and keep ahead of the iterator consumption.
e.g.
for obj in async_load(myitr, 50):
dox(obj)

I don't know how that would help with a loop like this however

for obj in async_load(myitr, 50):
dox(obj.getMainObject())
Post by Jim Fulton
Jim
--
Jim Fulton
http://www.linkedin.com/in/jimfulton
Jim Fulton
2014-02-20 13:00:13 UTC
Permalink
Post by Dylan Jay
Post by Jim Fulton
Iterators certainly seem like a logical place to start.
As an example I originally was doing a TTW zope reindex of a single index.
Due to conflict problems I used a modified version of this https://github.com/plone/Products.PloneOrg/blob/master/scripts/catalog_rebuild.py (which I'd love to integrate something similar into zcatalog sometime).
Both use iterators I believe.
I think even if there was an explicit api where you can pass in an iterator, a max buffer length and you'd get passed back another iterator. Then asynchronously objects will load to try and keep ahead of the iterator consumption.
e.g.
dox(obj)
I like the idea of a wrapper. I think a) you're pushing the abstraction
to far, and b) this doesn't have to be a ZODB API, at least not initially.

In any case, if the lower-level API exists, it would be straightforward
to implement one like above.
Post by Dylan Jay
I don't know how that would help with a loop like this however
dox(obj.getMainObject())
Well, this would simply be another custom iterator wrapper.

for ob in main_iterator(myiter):
...

Jim
--
Jim Fulton
http://www.linkedin.com/in/jimfulton
Loading...