In our graph, there are a lot of vertices which have more than 100k of outgoing edges. I would like to know what are the approaches to handle all palettes of situation which come out of this.
Let's say that we have a group_1 defined in our graph. group_1 has 100k members. We have a few traversals which start from a member_xvertex and compute some stuff. These traversals are quite fast, they are ending within ~2s each.
But times changed, and now we have a requirement to aggregate all the results from individual small traversals into one number. The traversals have to contain all the results from group_1's members.
At first our approach was to create traversals which emit a bundle of members_x by using skip and limit and then, using parallel processing on application level, count the sum of our stuff. There are few problems with this approach however:
g.V().has('group',y).out('member_of').skip(0).limit(10)- according to the documentation this traversal can return different results each time. So creating bundles this way would just be incorrectg.V().has('group',y).out('member_of').skip(100_000).limit(10)takes too long, because as we've found out, database will still have to visit 100k vertices
So, our next approach would be to store a traversal which emits bundles of members and then, in separate threads, execute parallel traversals which count sum over the previously fetched member:
while(is_not_the_end) {
List<Members> members = g.V().has('group',y).out('member_of').next(100)`
addMembersToExecutorThread(members) // done in async way
}
So, what are the approaches when you have such scenarios? Basically, we can solve that problem if a way can be found to quickly fetch all the ancestors of some vertex. In our case that would be a group_1. But it takes a lot of time just to fetch ids by using g.V().has('group',y).out('member_of').properties('members_id').
Is there a way to work around this problem? Or maybe we should try to execute such queries on GraphComputer?