Ingesting Jira data into Amazon S3

Consolidating data from a work management tool like Jira and integrating this data with other data sources like ServiceNow, GitHub, Jenkins, and Time Entry Systems enables end-to-end visibility of different aspects of the software development lifecycle and helps keep your projects on schedule and within budget.

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, performance, security, and data availability. Many of our customers choose to build their data lakes on Amazon S3. They find the flexible, pay-as-you-go, cloud model ideal when dealing with vast amounts of heterogeneous data.

This post discusses some of the use cases for ingesting Jira data into an Amazon S3 data lake, the ingestion data flow, and a conceptional approach to ingesting data. We also provide the relevant Python code.

Use cases

Business use cases for Jira data ingestion range from proactive project monitoring, detection and resolution of project effort and cost variances, and non-compliance of SDLC processes. In this section, we provide an inclusive but not exhaustive list of use cases and their benefits.

Cognitive project monitoring

Cognitive project monitoring use cases include the following:

Automated analytics – You can prevent and reduce project schedule and budget variances by proactively monitoring metrics by combining data from Jira, GitHub, Jenkins, and Time Entry Systems.
Automated status reporting – You can use Amazon SageMaker machine learning (ML) models to derive prescriptive metrics by looking at data across various sources. This could reduce a project manager’s time spent stitching data and generating reports, and provide a holistic view of project-tracking metrics.

Automated project compliance and governance

You can analyze user behavior to detect potentially suspicious patterns by building a baseline of user activity. You create this based on primary data from HR (such as role, location, and work hours) and IT infrastructure (such as an assigned asset’s IP address).

Possible business outcomes include the following:

Proactively identify user IDs and passwords being shared with other users
Detect insider threats, such as abnormal login times, unauthorized access to Jira, and incorrect access permissions in Jira, GitHub, Jenkins, or Time Entry Systems
Identify compromised accounts based on frequent logins from unassigned assets or unusual successive authentications
Identify theft of corporate IPs based on unusual printing volume, printing project-related documents, and emailing organization-related documents and code to external accounts

Accelerated migration from Jira to another project management product

You can also use AWS Glue and its Data Catalog metadata to map between two products. This could increase your data migration.

Overview of solution

One of the most common approaches to ingest data from Jira into AWS is to create a Python module, which is used in AWS Glue or AWS Lambda. The following diagram shows the high-level approach for an end-to-end solution. In this solution, Glue Development Endpoint and respective SageMaker Jupyter Notebook instance are used to create the Jira Python module to facilitate Jupyter notebook experience, interactive testing and debugging capability. The scope of this post is limited to the following steps:

Setting up access for Jira
Using the Python model with AWS Lambda or AWS Glue
Incrementally pulling changed data from Jira with JQL (Jira Query Language)
Ingesting data to the AWS serverless data lake

Ingesting data from Jira into Amazon S3

The Jira server exposes data using REST APIs and open authorization (OAuth) authentication methods. It uses a three-legged OAuth approach (also called the OAuth dance) to acquire access to the resources served by the APIs. For more information about the following steps, see OAuth for REST APIs.

Generating an RSA public/private key pair

Consumer key and consumer secret details are required for interacting with API endpoints. You store the details inside encrypted SSM parameters.

To use macOS or Linux, run the following OpenSSL commands in the terminal (anywhere in the file system):

openssl genrsa -out jira_privatekey.pem 1024
openssl req -newkey rsa:1024 -x509 -key jira_privatekey.pem -out jira_publickey.cer -days 365
openssl pkcs8 -topk8 -nocrypt -in jira_privatekey.pem -out jira_privatekey.pcks8
openssl x509 -pubkey -noout -in jira_publickey.cer > jira_publickey.pem

To use Windows, download OpenSSL and run it using the path to the bin folder. Create a new environment variable named OPENSSL_CONF and the value "path_to"\openssl.cnf. Run the command as admin:

"path_to_openssl"\bin\openssl genrsa -out jira_privatekey.pem 1024
"path_to_openssl"\bin\openssl req -newkey rsa:1024 -x509 -key jira_privatekey.pem -out jira_publickey.cer -days 365
"path_to_openssl"\bin\openssl pkcs8 -topk8 -nocrypt -in jira_privatekey.pem -out jira_privatekey.pcks8
"path_to_openssl"\bin\openssl x509 -pubkey -noout -in jira_publickey.cer > jira_publickey.pem

Configuring a REST API-based consumer in Jira

For full instructions on configuring your REST API-based consumer, see Step 2: Configure your client application as an OAuth consumer in OAuth for REST APIs. Be sure to complete the following steps:

In the Link applications section, select Create incoming link.
For Public key, enter the public key you created earlier.

Performing the OAuth dance

In this step, you go through the process of getting the access token from the resource so the consumer can access the resource.

Create the following parameters in the AWS Systems Manager Parameter Store:
1. jira_access_private_key – Stores the private key in AWS Systems Manager as a parameter.
2. jira_access_urls – Stores URLs to access Jira. These URLs are constructed based on display URLs defined in JIRA by adding the additional tags:
  - request_token_url – https://jiratoawss3.atlassian.net/plugins/servlet/oauth/request-token
  - access_token_url – https://jiratoawss3.atlassian.net/plugins/servlet/oauth/access-token
  - authorize_url – https://jiratoawss3.atlassian.net/plugins/servlet/oauth/authorize
  - data_url – https://jiratoawss3.atlassian.net/rest/api/2/search
3. jira_access_secrets – Stores secrets to access Jira. Initially, only two values are present in this SSM parameter; it’s updated later with access_token. You need the following two parameters to start:
  1. consumer_key
  2. consumer_secret
Download the notebook file and upload it to SageMaker notebook instance of AWS Glue Development Endpoint.:
1. Below are the steps to set-up AWS Glue Development Endpoint
  1. In the AWS Glue console, choose Dev endpoints. Choose Add endpoint.
  2. Specify an endpoint name, such as demo-endpoint.
  3. Choose an IAM role with permissions similar to the IAM role that you use to run AWS Glue ETL jobs. For more information, see Create an IAM Role for AWS Glue. Choose Next.
  4. In Networking, leave Skip networking information selected, and choose Next.
  5. In SSH Public Key, enter a public key generated by an SSH key generator program, such as ssh-keygen (do not use an Amazon EC2 key pair). The generated public key will be imported into your development endpoint. Save the corresponding private key to later connect to the development endpoint using SSH. Choose Next. For more information, see ssh-keygen in Wikipedia.b. Once status of AWS Glue Development Endpoint is ready follow steps to set-up SageMaker notebooks with in your development endpoint.
2. Once status of AWS Glue Development Endpoint is ready follow
3. Once status shows Ready, open the notebook and upload the downloaded notebook file.

You now run the following cells.

Install and import dependent modules with the following code:

!pip install tlslite
!pip install oauth2

import urllib
import oauth2 as oauth
from tlslite.utils import keyfactory
import json
import sys
import os
import base64
import boto3
from boto3.dynamodb.conditions import Key, Attr
import datetime
import logging
import pprint
import time
from pytz import timezone

logger = logging.getLogger()
logger.setLevel(logging.INFO)

Create an SSM client to connect parameters defined in Systems Manager (update the Region if it’s different than us-east-1):
```
ssm = boto3.client("ssm", region_name='us-east-1')
```

Define the signature class to sign the Jira REST API requests:

class SignatureMethod_RSA_SHA1(oauth.SignatureMethod):
    name = 'RSA-SHA1'

    def signing_base(self, request, consumer, token):
        if not hasattr(request, 'normalized_url') or request.normalized_url is None:
            raise ValueError("Base URL for request is not set.")

        sig = (
            oauth.escape(request.method),
            oauth.escape(request.normalized_url),
            oauth.escape(request.get_normalized_parameters()),
        )

        key = '%s&' % oauth.escape(consumer.secret)
        if token:
            key += oauth.escape(token.secret)
        raw = '&'.join(sig)
        return key, raw

    def sign(self, request, consumer, token):

        key, raw = self.signing_base(request, consumer, token)

        # SSM support to fetch private key
        ssm_param = ssm.get_parameter(Name='jira_access_private_key', WithDecryption=True)
        jira_private_key_str = ssm_param['Parameter']['Value']

        privateKeyString = jira_private_key_str.strip()

        privatekey = keyfactory.parsePrivateKey(privateKeyString)

        # Used encode() to convert to bytes
        signature = privatekey.hashAndSign(raw.encode())
        return base64.b64encode(signature)

Get the consumer_key and consumer_secret from the SSM parameter that you defined earlier:

jira_secrets = json.loads(ssm.get_parameter(Name='jira_access_secrets_1', WithDecryption=True)['Parameter']['Value'])
jira_secrets

consumer_key = jira_secrets["consumer_key"]
consumer_key

consumer_secret = jira_secrets["consumer_secret"]
consumer_secret

Define the URLs for request_token_url, access_token_url, and authorize_url:
```
request_token_url = 'input_here'
access_token_url = 'input_here'
authorize_url = 'input_here'
```
These URLs are defined while setting up the Rest API endpoint in Jira. It includes the following components:
- request_token_url – https://jiratoawss3.atlassian.net/plugins/servlet/oauth/request-token
- access_token_url – https://jiratoawss3.atlassian.net/plugins/servlet/oauth/access-token
- authorize_url – https://jiratoawss3.atlassian.net/plugins/servlet/oauth/authorize

Generate your request token:

# Create Consumer using consumer_key and consumer_secret
consumer = oauth.Consumer(consumer_key, consumer_secret)

# Use Consumer to create oauth client
client = oauth.Client(consumer)

# Add Signature Method to the client
client.set_signature_method(SignatureMethod_RSA_SHA1())

# Get response from request token URL using the client
resp, content = client.request(request_token_url, "POST")

# Convert the content received from previous step into a Dictionary
request_token = dict(urllib.parse.parse_qsl(content))

# request token has two components oauth_token and oauth_token_secret
request_token

You only need to do this one time every five years (the default setting in Jira).

The following is an example request token value:

b'oauth_token=oFUFV5cqOuoWycnaCXYrkcioHuRw2TbV&oauth_token_secret=CzhMoEsozCV3xFZ179YQoLzRu4DYQHlR'

The following are example values after the URL parse and converted to dict:

{b'oauth_token': b'oFUFV5cqOuoWycnaCXYrkcioHuRw2TbV ',
b'oauth_token_secret': b'CzhMoEsozCV3xFZ179YQoLzRu4DYQHlR'}

Manually approve the request token by opening the following user in a browser:
```
authorize_url + '?oauth_token=' + request_token[b'oauth_token'].decode()
```

An example value of the final authorized user is:

https://jiratoawss3.atlassian.net/plugins/servlet/oauth/authorize?oauth_token=wYLlIxmcsnZTHgTy2ZpUmBakqzmqSbww.

When you go to the URL in your output, you see the following screenshot.

Use an approved request token to generate an access token:

# Create an oauth token using components of request token
token = oauth.Token(request_token[b'oauth_token'], request_token[b'oauth_token_secret'])

# Use Consumer and token to create oauth client
client = oauth.Client(consumer, token)

# Add Signature Method to the client
client.set_signature_method(SignatureMethod_RSA_SHA1())

# Get response from access token URL using the client
access_token_resp, access_token_content = client.request(access_token_url, "POST")

access_token_content

An example access token is:

b'oauth_token=Ym3UDrs1iYnLUZ1t0TkT1PinfJNN3RLj&oauth_token_secret=FYQfGjLLhbCJg3DXZFaKsE6wsURVfebN&oauth_expires_in=157680000&oauth_session_handle=BulouCOypjssDS3GzeY7Ldi30h0ERWDo&oauth_authorization_expires_in=160272000'

The following are example access token values after URL parse and converted to dict:

{b'oauth_token': b'Ym3UDrs1iYnLUZ1t0TkT1PinfJNN3RLj',
b'oauth_token_secret': b'FYQfGjLLhbCJg3DXZFaKsE6wsURVfebN',
b'oauth_expires_in': b'157680000',
b'oauth_session_handle': b'BulouCOypjssDS3GzeY7Ldi30h0ERWDo',
b'oauth_authorization_expires_in': b'160272000'}

oauth_authorization_expires_in states when the token expires (in seconds), which is generally 5 years.

Update the access_token key in the SSM parameter jira_access_secrets with the value for access_token_content.

This access token is valid for 5 years (the expires_in key of access_token_content states when the token expires in seconds). Rotating the access key depends on your organization’s security policy and is out of scope for this post.

Using the access token, querying data from Jira, and storing data in Amazon S3

The following are the important points in this step:

The Jira REST API returns 50 records at a time, but gives a total record count, which is used to paginate through the result set.
The Jira REST API endpoint needs to be updated with JQL filters. JQL allows you to only pick changed records.
Data returned from the REST API endpoint is serialized to Amazon S3. The Python code batches the records from Jira pages and commits after every four pages (which is configurable) have been fetched from Jira.
JQL is appended to the data_url (defined earlier). When pulling data from Jira, it’s good practice to do a one-time bulk load and get incremental loads by maintaining the last data pull date in an Amazon DynamoDB The following screenshot shows an example of tracking dates in DynamoDB by projects. Because it’s an hourly batch, last_ingest_date is rounded up to the hour.
The key attributes of JQL used for constructing JQL and pulling data from Jira are:
- project – Loop through the projects from DynamoDB and pull data for one project at a time.
- updated – Last update date for Jira story or task. Data pull from Jira is based on this date.
  - For bulk loads, updated is less than or equal to the batch run date, rounded up to the hour.
  - For incremental loads, updated is greater than or equal to last_ingest_date from DynamoDB, and less than or equal to the batch run date, rounded up to the hour.
  - To use date in JQL, we need to parse the date before we use it.
- startAt – Jira generally paginates the results every 50 records. This attribute is used to loop through the complete data. For example, if a project has 500 records and page size is 50 records, this attribute is incremented by page size in every iteration, and it takes 10 iterations to get complete data.
- maxResults – This is the page size set up in Jira (maximum number of records Jira returns in every API call).
Use the provided notebook to perform the OAuth dance. The sample code pulls data from Jira based on the approach you specified earlier. The purpose of this code is to accelerate implementing data ingestion from Jira.

Cleaning up

To avoid incurring future charges, delete the resources set up as part of this post:

AWS Glue Development Endpoint
DynamoDB table
Systems Manager parameters
S3 bucket

Next steps

To extend the usability scope of Jira data in S3 buckets, you can crawl the location to create AWS Glue Data Catalog database tables. Registering the locations with AWS Lake Formation helps simplify permission management and allows you to implement fine-grained access control. You can also use Amazon Athena, Amazon Redshift, Amazon SageMaker, and Amazon QuickSight for data analysis, ML, and reporting services.

Conclusion

This post aims to simplify and accelerate the steps to ingest Jira data into Amazon S3. The solution includes Jira configuration, performing the three-legged OAuth dance, JQL-based attributes for data selection, and Python-based data extraction into Amazon S3.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Glue forum.

About the Authors

Vishwa Gupta is a Data and ML Engineer with AWS Professional Services Intelligence Practice. He helps customers implement big data and analytics platform and solutions. Outside of work, he enjoys spending time with family, traveling, and playing badminton.

Sreeram Thoom is a Data Architect at Amazon Web Services.

AWS Big Data Blog