AWS Database Blog
Load balance graph queries using the Amazon Neptune Gremlin Client
[Updated August 2021] The Gremlin Client for Amazon Neptune is now available from Maven Central. Some APIs have changed since this article was published. Please review the demo code in the GitHub repository for the latest examples of how to use the APIs.
Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. Graph applications built using Neptune use read replicas to horizontally scale read throughput. These applications use the Neptune reader endpoint to distribute connections across the replicas in the cluster.
The reader endpoint provides a convenient means of distributing connection requests across replicas, but it’s not appropriate for all use cases. If an application opens a lot of connections to the reader endpoint at the same time, many of those connections may end up being tied to a single replica, resulting in an uneven distribution of connections. This can be particularly problematic for applications that use Gremlin, because Gremlin clients use long-lived WebSockets connections that send all requests to the instance to which they are connected.
In this post, we describe the Gremlin client for Amazon Neptune, a Java-based Gremlin client that is aware of your cluster topology and fairly distributes connections and requests across instances in a Neptune cluster. The client adapts to changes in cluster topology, such as adding or removing a replica. You can even configure the client to distribute requests across a subset of instances in your cluster, based on role, instance type, Availability Zone, or tags associated with each instance. The Gremlin client for Neptune acts as a drop-in replacement for the standard TinkerPop Java Gremlin client.
Distributing requests across an Amazon Neptune cluster
A Neptune cluster comprises of a primary write instance and up to 15 read replicas:
- The primary instance coordinates all write operations to the database. The primary also supports read operations.
- Read replicas support read operations from the same underlying storage volume as the primary instance. Read replicas allow you to scale horizontally for high read workloads. Replicas also act as failover targets for the primary.
You can size instances in the cluster independently of one another, and you can add and remove read replicas while the cluster is running.
Neptune provides a number of different endpoints to efficiently query your cluster:
- The cluster endpoint connects to the current primary instance in your cluster. You can use the cluster endpoint to route write requests to the primary.
- The reader endpoint balances connections across read replicas in your cluster. If your cluster doesn’t have any replicas, the reader endpoint connects to the primary instance. You can use the reader endpoint to create a connection to a replica for read requests.
- Each instance in your cluster also offers an instance endpoint. You can use the instance endpoint if your client application needs to send a request to a specific instance in the cluster.
If you’re building an application that needs to distribute requests across replicas, your first choice should be the reader endpoint. This endpoint continues to balance connections across replicas even if you change the cluster topology by adding or removing replicas, or promoting a replica to become the new primary.
However, in some circumstances, using the reader endpoint can result in an uneven use of cluster resources. The reader endpoint works by periodically changing the host that the DNS entry points to. If a client opens a lot of connections before the DNS entry changes, all the connection requests are sent to a single Neptune instance. The same thing happens if DNS caching occurs in the application layer: the client ends up using the same replica over and over again. If an application opens a lot of connections to the reader endpoint at the same time, many of those connections can end up being tied to a single replica.
The Neptune documentation provides guidance on load balancing across read replicas, but the recommendations aren’t always sufficient to ensure an even distribution of work across the cluster. One way you can distribute requests across instances is to configure the client to connect to one or more instance endpoints, rather than the reader endpoint. The downside of this approach is that it requires the client code to handle changes in the cluster topology by monitoring the cluster and updating the client whenever the membership of the cluster changes.
Gremlin client for Amazon Neptune
To help you better distribute connections and Gremlin requests, we’ve developed the Gremlin client for Amazon Neptune, a Java Gremlin client that is aware of your cluster topology, and which fairly distributes connections and requests across a set of instances in a Neptune cluster. The client works by creating a connection pool for each instance endpoint in a given list of endpoints, and distributing requests (queries, not connections) in a round-robin fashion across these connection pools, thereby ensuring a more even distribution of work and higher read throughput. The client provides support for AWS Identity and Access management (IAM) database authentication, and for connecting to Neptune via a network or Application Load Balancer.
The Gremlin client includes an agent that can periodically refresh the client with endpoint details. Every time this agent fires, it uses the Neptune Management API to get the current cluster topology. It then updates the client with the new endpoint information. You can configure how frequently this agent is triggered. You can also supply a custom endpoint selector to the agent that updates the client with a subset of instances in your cluster based on tags, instance types, instance IDs, role, Availability Zone, and more.
The following Amazon CloudWatch screenshot shows requests from an application using the Gremlin client being distributed over five read replicas in a Neptune cluster. The endpoints in the client are rotated every minute, with requests actively sent to three endpoints at any one time. Partway through the run, we deliberately triggered a failover in the cluster.
Using the GremlinCluster and GremlinClient to connect to Neptune
You create a GremlinCluster
and GremlinClient
much as you would a normal TinkerPop Cluster
and Client
. The following code shows how to create a TinkerPop GraphTraversalSource
from a GremlinCluster
and GremlinClient
:
You can use the GraphTraversalSource
created in this way throughout the lifetime of your application, and across threads, just as you would with a regular client. The underlying GremlinClient
ensures that requests are distributed across the current set of endpoints in a round-robin fashion.
The GremlinClient
has a refreshEndpoints()
method that allows you to configure the client with a new list of endpoint addresses. When the list of endpoints is changed using this method, new requests are distributed across the new set of endpoints.
After you have a reference to a GremlinClient
, you can call this refreshEndpoints()
method whenever you discover the cluster topology has changed. For example, you could subscribe to Amazon Simple Notification Services (Amazon SNS) using event subscriptions and refresh the list whenever an instance is added or removed, or when you detect a failover. The following code shows how to update the list of endpoint addresses:
Exception handling and retry logic
The Neptune Gremlin server supports both WebSockets and HTTPS connections. WebSockets connections are often kept alive for long periods, but occasionally an untoward network event can cause the connection to close prematurely. When this happens, subsequent queries that attempt to use that connection fail. How you choose to handle these failures depends on your application’s requirements.
Before we discuss specific exception-handling strategies, it’s worth spending a moment to understand how the underlying TinkerPop connection management code deals with connection issues. The TinkerPop client (on which the GremlinClient
depends) maintains a pool of connections. If a connection fails, the client attempts to replace it with a new connection. If the client considers the connection issue to be a result of the Gremlin server itself being unavailable, it immediately fails any attempts to use a connection in its pool, while beginning a background task that tries to reestablish connectivity with the host. In other words, whether it’s a single connection that has failed or the Gremlin server becoming unavailable, the client is designed to automatically attempt to reestablish connectivity to Neptune on behalf of your application.
Given this automatic connection retry behavior—which GremlinClient
inherits from the underlying TinkerPop connection management code—you have a couple of options when dealing with connection issues in your application:
- Let the connection exception bubble up, causing the operation that triggered the request to fail. This then leaves the consumer of your application code to decide whether to retry the operation. If the consumer decides to retry, and the TinkerPop client has reestablished connectivity in the interim, the retry request likely succeeds.
- Surround your query with a back-off-and-retry strategy. With this approach, your code waits for the underlying client to reestablish connectivity. This shields consumers of your code from intermittent issues, but at the expense of some additional latency.
Even if you choose to allow connection issues to bubble up to consumers of your code, there are still some circumstances—related to write requests—where you should consider adopting a back-off-and-retry strategy:
- ConcurrentModificationException – The Neptune transaction semantics mean that write requests can sometimes fail with a
ConcurrentModificationException
. In these situations, you may want to try an exponential back-off-based retry mechanism. - ReadOnlyViolationException – Because the cluster topology can change at any moment as a result of both planned and unplanned cluster events, write responsibilities may migrate from one instance in the cluster to another. If the
GremlinClient
attempts to send a write request to an instance that is no longer the primary, the request fails with aReadOnlyViolationException
. When this happens, you may want to update theGremlinClient
with the address of the new primary, and then retry the request. If you use theClusterEndpointsRefreshAgent
, detailed later in this post, to update theGremlinClient
with details of the cluster topology, then your back-off-and-retry strategy need only wait for the agent to refresh the client with the primary endpoint information before attempting the request again.
Using the GremlinClusterBuilder and NeptuneGremlinClusterBuilder
Gremlin Client includes two different cluster builders: GremlinClusterBuilder
and NeptuneGremlinClusterBuilder
. GremlinClusterBuilder
should work with any Gremlin server. You can also use GremlinClusterBuilder
if you don’t use IAM database authentication or a load balancer with your Neptune database.
If you do use IAM database authentication or a network or Application Load Balancer to connect to your cluster, use the NeptuneGremlinClusterBuilder
instead. This includes additional builder methods for enabling IAM database authentication support in the client (such as Sigv4 signing requests) and adding load balancer endpoints and port details. See the following code:
If you have IAM database authentication enabled for your Neptune database, you must set the SERVICE_REGION environment variable before connecting from your client. For example, export SERVICE_REGION=us-east-1
.
Sizing the connection pools
GremlinCluster
creates a connection pool per endpoint. The minConnectionPoolSize
and maxConnectionPoolSize
settings apply to each connection pool. In the following code, we create a GremlinCluster
for three replica endpoints, with minConnectionPoolSize
and maxConnectionPoolSize
both set to 3:
This configuration results in the client creating three connections to replica-endpoint-1
, three connections to replica-endpoint-2
, and three connections to replica-endpoint-3
. This is the same behavior as the TinkerPop Cluster
when configured with multiple endpoints.
Using the ClusterEndpointsRefreshAgent to update endpoints in the client
You can manually change the endpoints to which the client sends requests while the client is running by using the refreshEndpoints()
method of the GremlinClient
. But what if you want automatically to refresh the client on a periodic basis? This is where the ClusterEndpointsRefreshAgent
can help.
The ClusterEndpointsRefreshAgent
allows you to schedule periodic updates to the list of endpoints with which a GremlinClient
is configured. By default, the agent uses the Neptune Management API to get the details of your cluster based on its database cluster ID. Because this involves a Management API call, the IAM identity under which you’re running the agent (or the client application hosting the agent) must be authorized to perform rds:DescribeDBClusters
, rds:DescribeDBInstances
, and rds:ListTagsForResource
for your Neptune cluster. The Management API has the same high level of availability as Neptune, which minimizes the risk of your client application not being able to connect to the database.
The following diagram shows how a Java application can use a GremlinClient
and a ClusterEndpointsRefreshAgent
to maintain a long-lived set of connections to a Neptune cluster. If the cluster topology changes, the ClusterEndpointsRefreshAgent
refreshes the endpoints with which the GremlinClient
is configured, and new connections and requests are balanced across this new list of endpoints.
The following code shows how to configure a ClusterEndpointsRefreshAgent
to refresh a client with a cluster’s available replica endpoints every 60 seconds:
Using the EndpointsSelector to add custom endpoint selection logic
The ClusterEndpointsRefreshAgent
constructor accepts an EndpointsSelector
that allows you to add custom endpoint selection logic. The following code shows how to select endpoints for all available replicas with a workload
tag with the value analytics
:
The EndpointsType
enum provides implementations of EndpointsSelector
for some common use cases:
- All – Returns all available instance (primary and read replicas) endpoints
- Primary – Returns the primary instance endpoint if it’s available, or the cluster endpoint if the primary instance endpoint isn’t available
- ReadReplicas – Returns all available read replica instance endpoints, or, if there are no replica instance endpoints, the reader endpoint
- ClusterEndpoint – Returns the cluster endpoint
- ReaderEndpoint – Returns the reader endpoint
Using the Neptune Gremlin client in an AWS Lambda function
You should pay attention to a couple of things if you use a GremlinClient
and a ClusterEndpointsRefreshAgent
in an AWS Lambda function:
- It’s good practice to create a single client and a single traversal source that together live for the duration of the Lambda container, rather than creating a new client and traversal per function invocation.
- In a highly concurrent scenario, you may have many Lambda containers, each with its own
ClusterEndpointsRefreshAgent
. To avoid being throttled by the Neptune Management API, these agents should proxy their endpoint refresh requests through a single Lambda function that in turn gets its data from the Management API.
Using a single client and a single traversal source per container
The creation of the client-side Java driver stack is a reasonably expensive operation, whether you’re using the regular TinkerPop Gremlin client or the Neptune Gremlin client, and startup times can be further exacerbated in a Lambda function because of the relative lack of CPU cycles. Lambda allocates CPU power linearly in proportion to the amount of memory configured. At 1,792 MB, a function has the equivalent of one full vCPU (one vCPU-second of credits per second).
To improve the performance of your Lambda functions, it’s a good practice to create a single GremlinClient
and a single graph traversal source that together live for the duration of the Lambda container, rather than creating a new client and traversal per function invocation. These can be held as member variables of the Lambda function, and reused across function invocations.
There is a downside to this approach, however. If an untoward network event abruptly terminates the connection (or connections) used by the client, subsequent queries issued by the client fail. Fortunately, TinkerPop’s connection management (which the Neptune Gremlin client reuses) automatically attempts to reestablish any connections that fail. Therefore, all your code has to do is back off and retry its Gremlin query.
So, if you choose to use a single GremlinClient
and a single graph traversal source per Lambda container—and there are performance benefits to doing so—you should also wrap your queries with a back-off-and-retry strategy.
Proxying endpoint refresh requests through a single Lambda function
By default, the ClusterEndpointsRefreshAgent
uses the Neptune Management API to get the details of your cluster based on its database cluster ID. This is fine if in your application tier you have a relatively small number of clients that between them service many concurrent requests. But in a Lambda environment under load, you may end up with many concurrent Lambda containers, each with a ClusterEndpointsRefreshAgent
that is periodically querying the Management API. In these circumstances, the Management API may start throttling cluster information requests.
If your application uses a lot of concurrent GremlinClient
and ClusterEndpointsRefreshAgent
instances, you should proxy endpoint refresh requests through a single Lambda function that periodically queries the Management API and caches the results on behalf of its clients.
The Gremlin client for Neptune GitHub repo provides all the code necessary to implement a proxying strategy. The repository includes an AWS CloudFormation template that creates a Lambda function that polls the Management API, and which you can configure with a Neptune database cluster ID and a polling interval. After it’s deployed, this function is named neptune-endpoint-info_<cluster-id>. In your own Lambda code, you can then create a ClusterEndpointsRefreshAgent
using the lambdaProxy()
factory method, supplying the type of endpoint information to be retrieved (All
, Primary
, or ReadReplicas
), the Region in which the proxy Lambda function is located, and the name of the function.
The following code shows how to create a refresh agent that gets endpoint information from a proxy Lambda function named neptune-endpoint-info_my-cluster:
The following diagram shows a Lambda function, my-lambda-function, querying Neptune, with endpoint information supplied by a proxy Lambda, neptune-endpoint-info_my-cluster.
Example AWS Lambda function
The following code shows a Lambda function that uses the GremlinClient
and ClusterEndpointsRefreshAgent
(configured to use a Lambda proxy for endpoint information requests) to issue Gremlin queries that add new vertices to a graph:
Some things to note about this code:
- The
GremlinClient
,ClusterEndpointsRefreshAgent
, andGraphTraversalSource
are initialized in the function’s constructor and held as member variables, which allows them to be reused across function invocations. - The
ClusterEndpointsRefreshAgent
is initialized with the name of an endpoint’s proxy Lambda function (supplied here via a Lambda environment variable), and configured to fetch theClusterEndpoint
(for write requests). - The function uses Retry4j to implement back-off-and-retry logic. The
retryLogic()
method checks for connection issues and indicates that the query should be retried after a wait interval if a connection issue occurs. There’s no explicit reconnect logic here in the function, however, because reconnection is handled by the underlying TinkerPop connection management code. - Because this function contains a Gremlin write query that may be retried multiple times, the query itself is implemented using a
fold().coalesce(unfold(), ...)
upsert idiom. - Besides handling connection issues, the retry logic also handles instances of
ConcurrentModificationException
andReadOnlyViolationException
. The latter can occur if a write request is issued during a failover event.
Examples
The Gremlin client for Amazon Neptune GitHub repo includes three demos:
- Refresh agent demo – Uses a
ClusterTopologyRefreshAgent
to query the Neptune APIs for the current cluster topology every 15 seconds. TheGremlinClient
adapts accordingly. - Rolling subset of endpoints demo – Shows how the
GremlinClient
behaves when the cluster topology changes. The application periodically refreshes the client with a rolling subset of the endpoints in a cluster, mimicking the variations in the instance endpoints that occur when the cluster topology changes. - NeptuneGremlinClientExample Lambda function – Shows to implement a Lambda function that uses the
GremlinClient
to query Neptune.
We’ve provided a CloudFormation template that creates everything you need to run the sample applications. This template creates the following resources:
- Neptune cluster comprising a four r5.xlarge instances: a primary and three read replicas
- Amazon SageMaker Jupyter notebook instance that you can use to install a dataset and run the demos
The Neptune and Amazon SageMaker resources incur costs. Make sure you delete the CloudFormation stack when you’re finished with the demo.
Launching the stack
To launch the stack, complete the following steps:
- Choose Launch stack from the following table (pick the stack for your preferred Region).
- Select the check-boxes acknowledging that AWS CloudFormation will create IAM resources.
- Choose Create.
Region | Launch |
US East (Ohio) | |
US East (N. Virginia) | |
US West (Oregon) | |
EU (Ireland) |
The stack takes about 30 minutes to complete, and creates the following output parameters:
- SageMakerNotebook – Link to an Amazon SageMaker hosted Jupyter notebook that you can use to load a dataset and run the demos
- CloudWatchMetrics – Link to a page showing the GremlinRequestsPerSec CloudWatch metric for the instances in the cluster
Viewing the GremlinRequestPerSec CloudWatch metrics for the cluster
To view your CloudWatch metric, enter the CloudWatchMetrics link from the Outputs tab of the CloudFormation stack into a new browser window. This opens a CloudWatch page showing the GremlinRequestsPerSec CloudWatch metrics for each instance in the cluster. You can return to this view when you run the demos to see how requests are being distributed by the client across the instances in the cluster.
Setting up the demos
To set up the demos, choose the SageMakerNotebook link on the Outputs tab of the CloudFormation stack. In the Jupyter window, open the Neptune/neptune-gremlin-client-demo directory. This directory contains three notebooks:
- ipynb – Installs an air routes dataset and downloads gremlin-client-demo.jar
- using-gremlin-client-in-a-lambda.ipynb – Allows you to trigger the NeptuneEndpointsInfoLambda, which polls the Neptune Management API on a periodic basis to fetch all the available instance endpoints, and NeptuneGremlinClientLambda, which demonstrates how to use the
GremlinClient
in a Lambda function - force-failover.ipynb – Allows you to force a failover in your Neptune cluster while you are running one of the demos
Open setup.ipynb and run the cells. This installs an air routes dataset and downloads the gremlin-client-demo.jar. The following screenshot shows the results of setting up the demos.
Running the ClusterEndpointsRefreshAgent demo
This demo uses a ClusterTopologyRefreshAgent
to query the Neptune APIs for the current cluster topology every 15 seconds. The GremlinClient
adapts accordingly. The source for this demo is available on the GitHub repo.
To run this demo, you must meet the following prerequisites:
- Ensure you have installed the air routes dataset using setup.ipynb before running this demo.
- The command for this demo requires a
--cluster-id
If you’ve installed the demo via our CloudFormation template, the cluster ID is supplied to the demo as an environment variable. - The identity under which you’re running the demo must be authorized to perform
rds:DescribeDBClusters
,rds:DescribeDBInstances
, andrds:ListTagsForResource
for your Neptune cluster. If you’ve installed the demo via our CloudFormation template, the necessary IAM permissions have already been granted to the Amazon SageMaker execution role.
To run the demo from Jupyter, choose New at the top right side of the page and choose Terminal.
In the new terminal, run the following command:
When you run the demo, you see the GremlinClient
connect to the read replicas in the cluster. Every 15 seconds, you see the ClusterTopologyRefreshAgent
fetching new endpoints from the Neptune Management API. Periodically, you see the demo output the number of queries run. The following screenshot shows the agent getting read replica details for the client, with approximately 10,000 Gremlin requests run in between the endpoints being refreshed.
While this demo is running, try triggering a failover in the cluster (using the force-failover.ipynb notebook). The demo doesn’t implement any retry strategies, so you see some requests failing during the failover event, but after approximately 15 seconds, you should see a new endpoint added to the client, and the old endpoint removed. Don’t forget to monitor which instances are handling requests by using the CloudWatchMetrics link provided by the CloudFormation template. The following screenshot shows requests being distributed across three read replicas before and after a failover event.
Running the rolling endpoints demo
This demo shows how the GremlinClient
behaves when the cluster topology changes. The application periodically refreshes the client with a rolling subset of the endpoints in a cluster. This mimics situations in which the cluster topology changes. The source for this demo is available on the GitHub repo.
When the demo runs, the GremlinClient
is initialized with approximately two-thirds of the endpoints in the cluster (one endpoint in a two-instance cluster, two endpoints in a three-instance cluster, three endpoints in a five-instance cluster, and so on). The demo repeatedly issues simple queries against the cluster via these endpoints. (It doesn’t do anything with the query results; you just see some feedback every 10,000 queries.) Every minute, the demo rolls the endpoints to use a different subset of endpoints.
You can view the distribution of queries across the cluster using the GremlinRequestsPerSec CloudWatch metric on the Gremlin console. The CloudWatchMetrics stack output parameter provides a convenient link to the GremlinRequestsPerSec metrics for the instances in the cluster.
Before running this demo, ensure you installed the air routes dataset using setup.ipynb. Don’t forget to monitor which instances are handling requests using the CloudWatchMetrics link provided by the CloudFormation template.
To run the demo from Jupyter, choose New at the top right side of the page and choose Terminal.
In the new terminal, run the following command:
Running the Lambda function demo
Our CloudFormation template installs an endpoints proxy Lambda function and an example Lambda function that gets endpoint information from the proxy, configures a GremlinClient
, and submits Gremlin requests to the Neptune cluster. You can invoke both of these functions from the using-gremlin-client-in-a-lambda.ipynb notebook.
Open the using-gremlin-client-in-a-lambda.ipynb notebook and run the cells to invoke each function in turn. The first function invokes the proxy function. You can configure this invocation to fetch all endpoints, or just the primary or read replica endpoints. The second function is the function that hosts a GremlinClient
and all the query logic. We showed the code for this function earlier in the post; it’s also available on the GitHub repo. You can adapt this code when building your own applications using Lambda functions.
Conclusion
This post introduced the Java Gremlin client for Amazon Neptune, which helps you increase the performance of your graph applications by distributing connections and requests across instances of a Neptune cluster.
For links to documentation, blog posts, videos, and code repositories with samples and tools, see Amazon Neptune resources.
Before you begin designing your database, we also recommend that you consult the AWS Reference Architectures for Using Graph Databases GitHub repo, where you can inform your choices about graph data models and query languages, and browse examples of reference deployment architectures.
About the Authors
Ian Robinson is a Principal Graph Architect with Amazon Neptune. He is a co-author of ‘Graph Databases’ and ‘REST in Practice’ (both from O’Reilly) and a contributor to ‘REST: From Research to Practice’ (Springer) and ‘Service Design Patterns’ (Addison-Wesley).
Navtanay Sinha is a Senior Product Manager for Amazon Neptune at AWS.