Batch Processing
In order to answer queries, JanusGraph has to perform queries against the storage backend. In general, there are two ways of doing this:
- Once data from the backend is needed, execute a backend query and continue with the result.
- Maintain a list of what data is needed. Once the list reaches a certain size, execute a batched backend query to fetch all of it at once.
The first option tends to be more responsive and consume less memory because the query can emit the first results very early without waiting for larger batches of queries to complete. This is also the option that JanusGraph uses by default. The second option can be configured in multiple ways which are explained below.
No Batch Processing
This is the default configuration of JanusGraph. In terms of graph traversals, the execution of queries is loosely coupled to the principle of Depth-First-Search.
Use this configuration in use cases where for example ...
- ... each query only accesses few vertices of the graph.
- ... your application does not need the full result set immediately but rather requires a low latency for the first results to arrive.
Possible limitations
- Traversing large neighborhoods can make the query slow.
Steps to explicitly configure this option:
- Ensure
query.batch
is set tofalse
Unrestricted Batch Processing
Using this configuration, each step which traverses the Graph
starting from a vertex (so e.g. in()
, outE()
and values()
but not inV()
or otherV()
and also not valueMap()
, see
#2444)
becomes a blocking operator which means that it produces no
results until all the results of the previous step are known.
Only then, a single backend query is executed and the results
are passed to the next step. Manual barrier()
steps do not
affect this in any meaningful way.
This way of execution can be thought of as a
Breadth-First-Search.
Use this configuration in use cases where for example ...
- ... your queries are likely to access multiple vertices in each step.
- ... there is a significant network latency between JanusGraph and the storage backend.
Possible limitations
- Increased memory consumption
- If limit steps occur late in the query, there might be an unnecessary overhead produced by the steps before the limit step.
- Performing very large backend queries could stress the storage backend.
Steps to explicitly configure this option:
- Ensure
query.batch
is set totrue
- Ensure
query.limit-batch-size
is set tofalse
Limited Batch Processing
Using this configuration, each step which traverses the Graph
starting from a vertex (so e.g. in()
, outE()
and values()
but not inV()
or otherV()
) aggregates a number of vertices
first, before executing a batched backend query.
This aggregation phase and backend query phase will repeat
until all vertices are processed.
In contrast to unrestricted batch processing where one batch
corresponds to one step in the query, this approach can
construct multiple batches per step.
Configuring the batch size
Although batch size does not necessarily need to be configured,
it can provide an additional tuning parameter to improve the
performance of a query.
By default, the batch size of TinkerPop's barrier step
will be used, which is currently at 2500.
The batch size of each vertex step can be individually
configured by prepending a barrier(<size>)
step.
For example, in the query below, the first out()
step would
use the default batch size of 2500 and the second out()
step
would use a manually configured batch size of 1234:
g.V(list_of_vertices).out().barrier(1234).out()
For local traversals which start with a vertex step, the limit is best configured outside the local traversal, as seen below:
g.V(list_of_vertices).out().barrier(1234).where(__.out())
barrier(1234)
step would not be allowed to aggregate multiple
traversers.
A special case applies to repeat()
steps.
Because the local traversal of a repeat()
step has two inputs
(first, the step before the repeat()
step and second, the
last step of the repeated traversal, which feeds the result
back to the beginning), two limits can be configured here.
g.V(list_of_vertices).barrier(1234).repeat(__.barrier(2345).out()).times(5)
barrier(1234)
step in front of the local
traversal can only aggregate traversers once they enter the
repeat
step for the first time. For each iteration, the inner
barrier(2345)
is used to aggregate traversers from the
previous iteration.
Use this configuration in use cases where for example ...
- ... your need to dynamically switch between the previously mentioned workloads.
Possible limitations
- Increased memory consumption
- The performance of queries depends on the configured batch size. If you switch to this configuration, make sure that the latency and throughput of your queries meet your requirements and if not, tweak the batch size accordingly.
Steps to explicitly configure this option:
- Ensure
query.batch
is set totrue
- Ensure
query.limit-batch-size
is set totrue