Synthetic IoT Security Data using Amazon Bedrock

In the rapidly evolving landscape of the Internet of Things (IoT), security is paramount. One critical example that underscores this challenge is the prevalence of insecure network devices with open SSH ports, a top security threat as per the non-profit foundation Open Worldwide Application Security Project (OWASP). Such vulnerabilities can allow unauthorized control over IoT devices, leading to severe security breaches. In environments where billions of connected devices generate vast amounts of data, ensuring the security and integrity of these devices and their communications becomes increasingly complex. Moreover, collecting comprehensive and diverse security data to prevent such threats can be daunting, as real-world scenarios are often limited or difficult to reproduce. This is where synthetic data generation technique using generative AI comes into play. By simulating scenarios, such as unauthorized access attempts, telemetry anomalies, and abnormal traffic patterns, this technique provides a solution to bridge the gap, enabling the development and testing of more robust security measures for IoT devices on AWS.

What is Synthetic Data Generation?

Synthetic data is artificially generated data that mimics the characteristics and patterns of real-world data. It is created using sophisticated algorithms and machine learning models, rather than using data collected from physical sources. In the context of security, synthetic data can be used to simulate various attack scenarios, network traffic patterns, device telemetry, and other security-related events.

Generative AI models have emerged as powerful tools for synthetic data generation. These models are trained on real-world data and learn to generate new, realistic samples that resemble the training data while preserving its statistical properties and patterns.

The use of synthetic data for security purposes offers numerous benefits, particularly when embedded within a continuous improvement cycle for IoT security. This cycle begins with the assumption of ongoing threats within an IoT environment. By generating synthetic data that mimics these threats, organizations can simulate the application of security protections and observe their effectiveness in real-time. This synthetic data allows for the creation of comprehensive and diverse datasets without compromising privacy or exposing sensitive information. As security tools are calibrated and refined based on these simulations, the process loops back, enabling further data generation and testing. This vicious cycle ensures that security measures are constantly evolving, staying ahead of potential vulnerabilities. Moreover, synthetic data generation is both cost-effective and scalable, allowing for the production of large volumes of data tailored to specific use cases. Ultimately, this cycle provides a robust and controlled environment for the continuous testing, validation, and enhancement of IoT security measures.

IoT Security Enhancement Cycle

Figure 1.0 – Continuous IoT Security Enhancement Cycle Using Synthetic Data

Benefits of Synthetic Data Generation

The application of synthetic security data generated by generative AI models spans various use cases in the IoT domain:

Security Testing and Validation: Synthetic data can be used to simulate various attack scenarios, stress-test security controls, and validate the effectiveness of intrusion detection and prevention systems in a controlled and safe environment.
Anomaly Detection and Threat Hunting: By generating synthetic data representing both normal and anomalous behavior, machine learning models can be trained to identify potential security threats and anomalies in IoT environments more effectively.
Incident Response and Forensics: Synthetic security data can be used to recreate and analyze past security incidents, enabling improved incident response and forensic investigation capabilities.
Security Awareness and Training: Synthetic data can be used to create realistic security training scenarios, helping to educate and prepare security professionals for various IoT security challenges.

How does Amazon Bedrock help?

Amazon Bedrock is a managed generative AI service with the capability to help organizations generate high-quality synthetic data across various domains, including security. With Amazon Bedrock, users can leverage advanced generative AI models to create synthetic datasets that mimic the characteristics of their real-world data. One of the key advantages of Amazon Bedrock is its ability to handle structured, semi-structured, and unstructured data formats, making it well-suited for generating synthetic security data from diverse sources, such as network logs, device telemetry, and intrusion detection alerts.

Generating Synthetic Security Data for IoT

In this blog post, we’re going to use Amazon Bedrock with Anthropic Claude 3 Sonnet to generate synthetic log data. Here is an example of a prompt to Amazon Bedrock:

Create a python function that generates synthetic security log entries for an AWS IoT environment consisting of various connected devices such as smart home appliances, industrial sensors, and wearable devices. The log entries should include different types of events, including: 
1. Device authentication and connection events (successful and failed attempts) 
2. Device telemetry and sensor data transmissions 
3. Network traffic patterns (normal and anomalous) 
4. Security incidents and potential attacks (e.g., unauthorized access attempts, malware detection, distributed denial-of-service (DDoS) attacks) 
5. System and application log messages related to security events 

Each log entry should have the following format: 
{ "timestamp": "2024-07-23 16:51:17.384", "logLevel": "INFO", "traceId": "e2893ea0-8c00-b560-5e71-9fb35a9654c2", "accountId": "123456789012", "status": "Success", "eventType": "Publish-Out", "protocol": "MQTT", "topicName": "/iot/test/device", "clientId": "virtualDevice1", "principalId": "ad4f9225b1753fc27feb79341bf13d17bedbd3f8d6514ba626bfb22d1851e472", "sourceIp": "1.2.3.4", "sourcePort": 36954 }

Here is another log example:
{ "timestamp": "2024-07-23 16:38:46.504", "logLevel": "ERROR", "traceId": "c9c54f40-5d9a-6693-5ddf-d52fb16e514f", "accountId": "123456789012", "status": "Failure", "eventType": "Connect", "protocol": "MQTT", "clientId": "virtualDevice1", "principalId": "ad4f9225b1753fc27feb79341bf13d17bedbd3f8d6514ba626bfb22d1851e472", "sourceIp": "2.3.4.5", "sourcePort": 46028, "reason": "AUTHORIZATION_FAILURE", "details": "Authorization Failure" }

Additionally, ensure that the generated log entries are diverse, realistic, and representative of various IoT security scenarios, including both benign and malicious activities. The synthetic data should also maintain the statistical properties and patterns found in real-world IoT security logs.

The results can be captured through the Amazon Bedrock chat playground, or accessed programmatically through the AWS APIs, such as those highlighted below:

import boto3
import json

bedrock = boto3.client(service_name='bedrock-runtime',region_name='us-west-2')

prompt = '''<abbreviated for brevity, refer to the prompt above>'''

body = json.dumps({
    "messages" : [
        { 
            "role" : "user",
            "content" : "prompt"
        }],
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 2000,
    "temperature": 0.75,
    "top_k" : 250
})

response = bedrock.invoke_model(
    modelId='anthropic.claude-3-sonnet-20240229-v1:0',
    contentType='application/json',
    accept='application/json',
    body=body
)

response_body = json.loads(response['body'].read())
print(response_body['content'][0]['text'])

Given that generative AI models are not deterministic, your response may vary, but it will likely be similar to the code shown below:

Here's a Python function that generates synthetic security log entries for an AWS IoT environment based on your requirements:

```python
import random
import uuid
import datetime
import json
import ipaddress

def generate_iot_security_log():
    # Define possible values for various fields
    log_levels = ["INFO", "WARN", "ERROR"]
    event_types = ["Connect", "Disconnect", "Publish-In", "Publish-Out", "Subscribe", "Unsubscribe"]
    protocols = ["MQTT", "HTTP", "WebSocket"]
    topics = ["/iot/device/telemetry", "/iot/device/status", "/iot/device/command"]
    client_types = ["smartHome", "industrial", "wearable"]
    account_ids = ["123456789012", "234567890123", "345678901234"]
    status_options = ["Success", "Failure"]

    # Generate random values
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
    log_level = random.choice(log_levels)
    trace_id = str(uuid.uuid4())
    account_id = random.choice(account_ids)
    status = random.choice(status_options)
    event_type = random.choice(event_types)
    protocol = random.choice(protocols)
    topic_name = random.choice(topics)
    client_type = random.choice(client_types)
    client_id = f"{client_type}Device{random.randint(1, 1000)}"
    principal_id = uuid.uuid4().hex
    source_ip = str(ipaddress.IPv4Address(random.randint(0, 2**32 - 1)))
    source_port = random.randint(1024, 65535)

    # Create the base log entry
    log_entry = {
        "timestamp": timestamp,
        "logLevel": log_level,
        "traceId": trace_id,
        "accountId": account_id,
        "status": status,
        "eventType": event_type,
        "protocol": protocol,
        "clientId": client_id,
        "principalId": principal_id,
        "sourceIp": source_ip,
        "sourcePort": source_port
    }

    # Add event-specific details
    if event_type in ["Publish-In", "Publish-Out", "Subscribe", "Unsubscribe"]:
        log_entry["topicName"] = topic_name

    if status == "Failure":
        log_entry["reason"] = "AUTHORIZATION_FAILURE"
        log_entry["details"] = "Authorization Failure"

    # Add security-specific events
    if random.random() < 0.1:  # 10% chance of security event
        security_events = [
            ("Malware", "Potential malware detected on device"),
            ("UnauthorizedAccess", "Unauthorized access attempt detected"),
            ("DDoS", "Potential DDoS attack detected"),
            ("AnomalousTraffic", "Anomalous network traffic pattern detected")
        ]
        security_event, details = random.choice(security_events)
        log_entry["securityEvent"] = security_event
        log_entry["securityDetails"] = details

    # Add telemetry data for some events
    if event_type == "Publish-In" and random.random() < 0.3:  # 30% chance of telemetry data
        telemetry = {
            "temperature": round(random.uniform(18, 28), 2),
            "humidity": round(random.uniform(30, 70), 2),
            "batteryLevel": random.randint(0, 100)
        }
        log_entry["telemetryData"] = telemetry

    return json.dumps(log_entry)

# Example usage
for _ in range(10):
    print(generate_iot_security_log())
```

This function generates synthetic security log entries for an AWS IoT environment with the following features:

1. Diverse event types, including device connections, telemetry data transmissions, security incidents.
2. Realistic field values, such as randomly generated IP addresses, ports, and UUIDs.
3. Variety in device types (smart home, industrial, wearable).
4. Both successful and failed events.
5. Inclusion of security-specific events (malware detection, unauthorized access, DDoS attacks, anomalous traffic).
6. Telemetry data for some publish events.

The function maintains the required log format and includes additional fields when relevant. You can adjust the probabilities and ranges of values to fine-tune the generated data to match your specific requirements or to more closely mimic real-world patterns.

To use this function, simply call `generate_iot_security_log()` to get a single log entry as a JSON string. You can call it multiple times to generate a series of log entries.

This python function generates IoT security logs that you can now send to Amazon Simple Storage Service (Amazon S3) to query with Amazon Athena, use Amazon Quicksight to visualize the data, or integrate a variety of AWS services to work with the data as you see fit. This is also just an example, and we encourage you to work with the prompt to fit your organizations needs, as there are a variety of use cases. For example, you can add the additional sentence to the end of the prompt: “Also, the python function should write to an Amazon S3 bucket of the user’s choosing” to modify the python function to write to Amazon S3.

Best Practices and Considerations

While synthetic data generation using generative AI offers numerous benefits, there are several best practices and considerations to keep in mind:

Model Validation: Thoroughly validate and test the generative AI models used for synthetic data generation to ensure they produce realistic and statistically accurate samples.
Domain Expertise: Collaborate with subject matter experts in IoT security and data scientists to ensure the synthetic data accurately represents real-world scenarios and meets the specific requirements of the use case.
Continuous Monitoring: Regularly monitor and update the generative AI models and synthetic data to reflect changes in the underlying real-world data distributions and emerging security threats.

Conclusion

As the IoT landscape continues to expand, the need for comprehensive and robust security measures becomes increasingly crucial. Synthetic data generation using generative AI offers a powerful solution to address the challenges of obtaining diverse and representative security data for IoT environments. By using services like Amazon Bedrock, organizations can generate high-quality synthetic security data, enabling rigorous testing, validation, and training of their security systems.

The benefits of synthetic data generation extend beyond just data availability; it also enables privacy preservation, cost-effectiveness, and scalability. By adhering to best practices and leveraging the expertise of data scientists and security professionals, organizations can harness the power of generative AI to fortify their IoT security posture and stay ahead of evolving threats.

About the authors

syed

Syed Rehan

Syed is a Senior Cybersecurity Product Manager at Amazon Web Services (AWS), operating within the AWS IoT Security organization. As a published book author on AWS IoT, Machine Learning, and Cybersecurity, he brings extensive expertise to his global role. Syed serves a diverse customer base, collaborating with security specialists, CISOs, developers, and security decision-makers to promote the adoption of AWS Security services and solutions. With in-depth knowledge of cybersecurity, machine learning, artificial intelligence, IoT, and cloud technologies, Syed assists customers ranging from startups to large enterprises. He enables them to construct secure IoT, ML, and AI-based solutions within the AWS environment

Anthony Harvey

Anthony is a Senior Security Specialist Solutions Architect for AWS in the worldwide public sector group. Prior to joining AWS, he was a chief information security officer in local government for half a decade. He has a passion for figuring out how to do more with less and using that mindset to enable customers in their security journey.

The Internet of Things on AWS – Official Blog

Synthetic IoT Security Data using Amazon Bedrock

What is Synthetic Data Generation?

Benefits of Synthetic Data Generation

How does Amazon Bedrock help?

Generating Synthetic Security Data for IoT

Best Practices and Considerations

Conclusion

About the authors

Syed Rehan

Anthony Harvey

Resources

Follow

Learn

Resources

Developers

Help