AWS Big Data Blog
Ingesting Jira data into Amazon S3
Consolidating data from a work management tool like Jira and integrating this data with other data sources like ServiceNow, GitHub, Jenkins, and Time Entry Systems enables end-to-end visibility of different aspects of the software development lifecycle and helps keep your projects on schedule and within budget.
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, performance, security, and data availability. Many of our customers choose to build their data lakes on Amazon S3. They find the flexible, pay-as-you-go, cloud model ideal when dealing with vast amounts of heterogeneous data.
This post discusses some of the use cases for ingesting Jira data into an Amazon S3 data lake, the ingestion data flow, and a conceptional approach to ingesting data. We also provide the relevant Python code.
Use cases
Business use cases for Jira data ingestion range from proactive project monitoring, detection and resolution of project effort and cost variances, and non-compliance of SDLC processes. In this section, we provide an inclusive but not exhaustive list of use cases and their benefits.
Cognitive project monitoring
Cognitive project monitoring use cases include the following:
- Automated analytics – You can prevent and reduce project schedule and budget variances by proactively monitoring metrics by combining data from Jira, GitHub, Jenkins, and Time Entry Systems.
- Automated status reporting – You can use Amazon SageMaker machine learning (ML) models to derive prescriptive metrics by looking at data across various sources. This could reduce a project manager’s time spent stitching data and generating reports, and provide a holistic view of project-tracking metrics.
Automated project compliance and governance
You can analyze user behavior to detect potentially suspicious patterns by building a baseline of user activity. You create this based on primary data from HR (such as role, location, and work hours) and IT infrastructure (such as an assigned asset’s IP address).
Possible business outcomes include the following:
- Proactively identify user IDs and passwords being shared with other users
- Detect insider threats, such as abnormal login times, unauthorized access to Jira, and incorrect access permissions in Jira, GitHub, Jenkins, or Time Entry Systems
- Identify compromised accounts based on frequent logins from unassigned assets or unusual successive authentications
- Identify theft of corporate IPs based on unusual printing volume, printing project-related documents, and emailing organization-related documents and code to external accounts
Accelerated migration from Jira to another project management product
You can also use AWS Glue and its Data Catalog metadata to map between two products. This could increase your data migration.
Overview of solution
One of the most common approaches to ingest data from Jira into AWS is to create a Python module, which is used in AWS Glue or AWS Lambda. The following diagram shows the high-level approach for an end-to-end solution. In this solution, Glue Development Endpoint and respective SageMaker Jupyter Notebook instance are used to create the Jira Python module to facilitate Jupyter notebook experience, interactive testing and debugging capability. The scope of this post is limited to the following steps:
- Setting up access for Jira
- Using the Python model with AWS Lambda or AWS Glue
- Incrementally pulling changed data from Jira with JQL (Jira Query Language)
- Ingesting data to the AWS serverless data lake
Ingesting data from Jira into Amazon S3
The Jira server exposes data using REST APIs and open authorization (OAuth) authentication methods. It uses a three-legged OAuth approach (also called the OAuth dance) to acquire access to the resources served by the APIs. For more information about the following steps, see OAuth for REST APIs.
Generating an RSA public/private key pair
Consumer key and consumer secret details are required for interacting with API endpoints. You store the details inside encrypted SSM parameters.
To use macOS or Linux, run the following OpenSSL commands in the terminal (anywhere in the file system):
To use Windows, download OpenSSL and run it using the path to the bin
folder. Create a new environment variable named OPENSSL_CONF
and the value "path_to"\openssl.cnf
. Run the command as admin:
Configuring a REST API-based consumer in Jira
For full instructions on configuring your REST API-based consumer, see Step 2: Configure your client application as an OAuth consumer in OAuth for REST APIs. Be sure to complete the following steps:
- In the Link applications section, select Create incoming link.
- For Public key, enter the public key you created earlier.
Performing the OAuth dance
In this step, you go through the process of getting the access token from the resource so the consumer can access the resource.
- Create the following parameters in the AWS Systems Manager Parameter Store:
- jira_access_private_key – Stores the private key in AWS Systems Manager as a parameter.
- jira_access_urls – Stores URLs to access Jira. These URLs are constructed based on display URLs defined in JIRA by adding the additional tags:
- request_token_url –
https://jiratoawss3.atlassian.net/plugins/servlet/oauth/request-token
- access_token_url –
https://jiratoawss3.atlassian.net/plugins/servlet/oauth/access-token
- authorize_url –
https://jiratoawss3.atlassian.net/plugins/servlet/oauth/authorize
- data_url –
https://jiratoawss3.atlassian.net/rest/api/2/search
- request_token_url –
- jira_access_secrets – Stores secrets to access Jira. Initially, only two values are present in this SSM parameter; it’s updated later with
access_token
. You need the following two parameters to start:consumer_key
consumer_secret
- Download the notebook file and upload it to SageMaker notebook instance of AWS Glue Development Endpoint.:
- Below are the steps to set-up AWS Glue Development Endpoint
- In the AWS Glue console, choose Dev endpoints. Choose Add endpoint.
- Specify an endpoint name, such as demo-endpoint.
- Choose an IAM role with permissions similar to the IAM role that you use to run AWS Glue ETL jobs. For more information, see Create an IAM Role for AWS Glue. Choose Next.
- In Networking, leave Skip networking information selected, and choose Next.
- In SSH Public Key, enter a public key generated by an SSH key generator program, such as ssh-keygen (do not use an Amazon EC2 key pair). The generated public key will be imported into your development endpoint. Save the corresponding private key to later connect to the development endpoint using SSH. Choose Next. For more information, see ssh-keygen in Wikipedia.b. Once status of AWS Glue Development Endpoint is ready follow steps to set-up SageMaker notebooks with in your development endpoint.
- Once status of AWS Glue Development Endpoint is ready follow
- Once status shows Ready, open the notebook and upload the downloaded notebook file.
- Below are the steps to set-up AWS Glue Development Endpoint
You now run the following cells.
- Install and import dependent modules with the following code:
- Create an SSM client to connect parameters defined in Systems Manager (update the Region if it’s different than
us-east-1
): - Define the signature class to sign the Jira REST API requests:
- Get the
consumer_key
andconsumer_secret
from the SSM parameter that you defined earlier: - Define the URLs for
request_token_url
,access_token_url
, andauthorize_url
:These URLs are defined while setting up the Rest API endpoint in Jira. It includes the following components:
request_token_url
– https://jiratoawss3.atlassian.net/plugins/servlet/oauth/request-tokenaccess_token_url
– https://jiratoawss3.atlassian.net/plugins/servlet/oauth/access-tokenauthorize_url
– https://jiratoawss3.atlassian.net/plugins/servlet/oauth/authorize
- Generate your request token:
You only need to do this one time every five years (the default setting in Jira).
The following is an example request token value:
The following are example values after the URL parse and converted to dict:
{b'oauth_token': b'oFUFV5cqOuoWycnaCXYrkcioHuRw2TbV ',
b'oauth_token_secret': b'CzhMoEsozCV3xFZ179YQoLzRu4DYQHlR'}
- Manually approve the request token by opening the following user in a browser:
An example value of the final authorized user is:
- https://jiratoawss3.atlassian.net/plugins/servlet/oauth/authorize?oauth_token=wYLlIxmcsnZTHgTy2ZpUmBakqzmqSbww.
When you go to the URL in your output, you see the following screenshot.
- Use an approved request token to generate an access token:
An example access token is:
The following are example access token values after URL parse and converted to dict:
{b'oauth_token': b'Ym3UDrs1iYnLUZ1t0TkT1PinfJNN3RLj',
b'oauth_token_secret': b'FYQfGjLLhbCJg3DXZFaKsE6wsURVfebN',
b'oauth_expires_in': b'157680000',
b'oauth_session_handle': b'BulouCOypjssDS3GzeY7Ldi30h0ERWDo',
b'oauth_authorization_expires_in': b'160272000'}
oauth_authorization_expires_in
states when the token expires (in seconds), which is generally 5 years.
- Update the
access_token
key in the SSM parameterjira_access_secrets
with the value foraccess_token_content
.
This access token is valid for 5 years (the expires_in
key of access_token_content
states when the token expires in seconds). Rotating the access key depends on your organization’s security policy and is out of scope for this post.
Using the access token, querying data from Jira, and storing data in Amazon S3
The following are the important points in this step:
- The Jira REST API returns 50 records at a time, but gives a total record count, which is used to paginate through the result set.
- The Jira REST API endpoint needs to be updated with JQL filters. JQL allows you to only pick changed records.
- Data returned from the REST API endpoint is serialized to Amazon S3. The Python code batches the records from Jira pages and commits after every four pages (which is configurable) have been fetched from Jira.
- JQL is appended to the
data_url
(defined earlier). When pulling data from Jira, it’s good practice to do a one-time bulk load and get incremental loads by maintaining the last data pull date in an Amazon DynamoDB The following screenshot shows an example of tracking dates in DynamoDB by projects. Because it’s an hourly batch,last_ingest_date
is rounded up to the hour.
- The key attributes of JQL used for constructing JQL and pulling data from Jira are:
- project – Loop through the projects from DynamoDB and pull data for one project at a time.
- updated – Last update date for Jira story or task. Data pull from Jira is based on this date.
- For bulk loads,
updated
is less than or equal to the batch run date, rounded up to the hour. - For incremental loads, updated is greater than or equal to
last_ingest_date
from DynamoDB, and less than or equal to the batch run date, rounded up to the hour. - To use date in JQL, we need to parse the date before we use it.
- For bulk loads,
- startAt – Jira generally paginates the results every 50 records. This attribute is used to loop through the complete data. For example, if a project has 500 records and page size is 50 records, this attribute is incremented by page size in every iteration, and it takes 10 iterations to get complete data.
- maxResults – This is the page size set up in Jira (maximum number of records Jira returns in every API call).
- Use the provided notebook to perform the OAuth dance. The sample code pulls data from Jira based on the approach you specified earlier. The purpose of this code is to accelerate implementing data ingestion from Jira.
Cleaning up
To avoid incurring future charges, delete the resources set up as part of this post:
- AWS Glue Development Endpoint
- DynamoDB table
- Systems Manager parameters
- S3 bucket
Next steps
To extend the usability scope of Jira data in S3 buckets, you can crawl the location to create AWS Glue Data Catalog database tables. Registering the locations with AWS Lake Formation helps simplify permission management and allows you to implement fine-grained access control. You can also use Amazon Athena, Amazon Redshift, Amazon SageMaker, and Amazon QuickSight for data analysis, ML, and reporting services.
Conclusion
This post aims to simplify and accelerate the steps to ingest Jira data into Amazon S3. The solution includes Jira configuration, performing the three-legged OAuth dance, JQL-based attributes for data selection, and Python-based data extraction into Amazon S3.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Glue forum.
About the Authors
Vishwa Gupta is a Data and ML Engineer with AWS Professional Services Intelligence Practice. He helps customers implement big data and analytics platform and solutions. Outside of work, he enjoys spending time with family, traveling, and playing badminton.
Sreeram Thoom is a Data Architect at Amazon Web Services.