AWS Machine Learning Blog

Code Llama code generation models from Meta are now available via Amazon SageMaker JumpStart

Today, we are excited to announce Code Llama foundation models, developed by Meta, are available for customers through Amazon SageMaker JumpStart to deploy with one click for running inference. Code Llama is a state-of-the-art large language model (LLM) capable of generating code and natural language about code from both code and natural language prompts. Code Llama is free for research and commercial use. You can try out this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML. In this post, we walk through how to discover and deploy the Code Llama model via SageMaker JumpStart.

What is Code Llama

Code Llama is a model released by Meta that is built on top of Llama 2 and is a state-of-the-art model designed to improve productivity for programming tasks for developers by helping them create high quality, well-documented code. The models show state-of-the-art performance in Python, C++, Java, PHP, C#, TypeScript, and Bash, and have the potential to save developers’ time and make software workflows more efficient. It comes in three variants, engineered to cover a wide variety of applications: the foundational model (Code Llama), a Python specialized model (Code Llama-Python), and an instruction-following model for understanding natural language instructions (Code Llama-Instruct). All Code Llama variants come in three sizes: 7B, 13B, and 34B parameters. The 7B and 13B base and instruct variants support infilling based on surrounding content, making them ideal for code assistant applications.

The models were designed using Llama 2 as the base and then trained on 500 billion tokens of code data, with the Python specialized version trained on an incremental 100 billion tokens. The Code Llama models provide stable generations with up to 100,000 tokens of context. All models are trained on sequences of 16,000 tokens and show improvements on inputs with up to 100,000 tokens.

The model is made available under the same community license as Llama 2.

What is SageMaker JumpStart

With SageMaker JumpStart, ML practitioners can choose from a growing list of best-performing foundation models. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances within a network isolated environment and customize models using SageMaker for model training and deployment.

You can now discover and deploy Code Llama models with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, helping ensure data security. Code Llama models are discoverable and can be deployed in in US East (N. Virginia), US West (Oregon) and Europe (Ireland) regions.

Customers must accept the EULA to deploy model visa SageMaker SDK.

Discover models

You can access Code Llama foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

In SageMaker Studio, you can access SageMaker JumpStart, which contains pre-trained models, notebooks, and prebuilt solutions, under Prebuilt and automated solutions.

On the SageMaker JumpStart landing page, you can browse for solutions, models, notebooks, and other resources. You can find Code Llama models in the Foundation Models: Text Generation carousel.

You can also find other model variants by choosing Explore all Text Generation Models or searching for Code Llama.

You can choose the model card to view details about the model such as license, data used to train, and how to use. You will also find two buttons, Deploy and Open Notebook, which will help you use the model.

Deploy

When you choose Deploy and acknowledge the terms, deployment will start. Alternatively, you can deploy through the example notebook by choosing Open Notebook. The example notebook that provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using notebook, we start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker with the following code:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id="meta-textgeneration-llama-codellama-7b")
predictor = model.deploy()

This deploys the model on SageMaker with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
   "inputs": "<s>[INST] How do I deploy a model on Amazon SageMaker? [/INST]",
   "parameters": {"max_new_tokens": 512, "temperature": 0.2, "top_p": 0.9}
}
predictor.predict(payload, custom_attributes="accept_eula=true")

Note that by default, accept_eula is set to false. You need to set accept_eula=true to invoke the endpoint successfully. By doing so, you accept the user license agreement and acceptable use policy as mentioned earlier. You can also download the license agreement.

Custom_attributes used to pass EULA are key/value pairs. The key and value are separated by = and pairs are separated by ;. If the user passes the same key more than once, the last value is kept and passed to the script handler (in this case, used for conditional logic). For example, if accept_eula=false; accept_eula=true is passed to the server, then accept_eula=true is kept and passed to the script handler.

Inference parameters control the text generation process at the endpoint. The maximum new tokens control refers to the size of the output generated by the model. Note that this is not the same as the number of words because the vocabulary of the model is not the same as the English language vocabulary, and each token may not be an English language word. Temperature controls the randomness in the output. Higher temperature results in more creative and hallucinated outputs. All the inference parameters are optional.

The following table lists all the Code Llama models available in SageMaker JumpStart along with the model IDs, default instance types, and the maximum supported tokens (sum of the number of input tokens and number of generated tokens for all concurrent requests) supported for each of these models.

Model Name Model ID Default Instance Type Max Supported Tokens
CodeLlama-7b meta-textgeneration-llama-codellama-7b ml.g5.2xlarge 10000
CodeLlama-7b-Instruct meta-textgeneration-llama-codellama-7b-instruct ml.g5.2xlarge 10000
CodeLlama-7b-Python meta-textgeneration-llama-codellama-7b-python ml.g5.2xlarge 10000
CodeLlama-13b meta-textgeneration-llama-codellama-13b ml.g5.12xlarge 32000
CodeLlama-13b-Instruct meta-textgeneration-llama-codellama-13b-instruct ml.g5.12xlarge 32000
CodeLlama-13b-Python meta-textgeneration-llama-codellama-13b-python ml.g5.12xlarge 32000
CodeLlama-34b meta-textgeneration-llama-codellama-34b ml.g5.48xlarge 48000
CodeLlama-34b-Instruct meta-textgeneration-llama-codellama-34b-instruct ml.g5.48xlarge 48000
CodeLlama-34b-Python meta-textgeneration-llama-codellama-34b-python ml.g5.48xlarge 48000

While the Code Llama models were trained on a context length of 16,000 tokens, the models have reported good performance on even larger context windows. The maximum supported tokens column in the preceding table is the upper limit on the supported context window on the default instance type. Since the Code Llama 7B model can only support 10,000 tokens on an ml.g5.2xlarge instance, we recommend deploying a 13B or 34B model version if larger contexts are required for your application.

By default, all models work for code generation tasks. The base and instruct models both respond to infilling tasks, though the base model had better quality output for the majority of sample queries. Finally, only instruct models work on instruct tasks. The following table illustrates which models had good performance (Good) and moderate performance (Moderate) on example queries in the demo notebooks.

. Code Generation Code Infilling Code instructions
CodeLlama-7b Good Good N/A
CodeLlama-7b-Instruct Good Moderate Good
CodeLlama-7b-Python Good N/A N/A
CodeLlama-13b Good Good N/A
CodeLlama-13b-Instruct Good Moderate Good
CodeLlama-13b-Python Good N/A N/A
CodeLlama-34b Good N/A N/A
CodeLlama-34b-Instruct Good N/A Good
CodeLlama-34b-Python Good N/A N/A

Code generation

The following examples were run on the CodeLlama-34b-Instruct model with payload parameters "parameters": {"max_new_tokens": 256, "temperature": 0.2, "top_p": 0.9}:

Input
import socket

def ping_exponential_backoff(host: str):


Output
    """
    Ping a host using exponential backoff.
    """
    for i in range(1, 10):
        try:
            socket.gethostbyname(host)
            return True
        except socket.gaierror:
            time.sleep(2 ** i)
    return False

Input
import argparse

def main(string: str):
    print(string)
    print(string[::-1])

if __name__ == "__main__":


Output
    parser = argparse.ArgumentParser()
    parser.add_argument("string", help="String to be printed")
    args = parser.parse_args()
    main(args.string)

Code infilling

Code infilling involves returning generated code given surrounding context. This differs from the code generation task because, in addition to a prefix code segment, the model is also provided with a code segment suffix. Special tokens were used during fine-tuning to mark the beginning of the prefix (<PRE>), the beginning of the suffix (<SUF>), and the beginning of the middle (<MID>). Input sequences to the model should be in one of the following formats:

  • prefix-suffix-middle <PRE> {prefix} <SUF>{suffix} <MID>
  • suffix-prefix-middle<PRE> <SUF>{suffix} <MID> {prefix}

The following examples use the prefix-suffix-middle format on the CodeLlama-7b model with payload parameters {"max_new_tokens": 256, "temperature": 0.05, "top_p": 0.9}:

Input
<PRE> def remove_non_ascii(s: str) -> str:
    """ <SUF>
    return result
 <MID>


Output
    Remove non-ASCII characters from a string.

    :param s: The string to remove non-ASCII characters from.
    :return: The string with non-ASCII characters removed.
    """
    result = ""
    for c in s:
        if ord(c) < 128:
            result += c

Input
# Installation instructions:
    ```bash
<FILL>
    ```
This downloads the LLaMA inference code and installs the repository as a local pip package.


Output
    git clone https://github.com/LLaMA-AI/LLaMA-inference.git
    cd LLaMA-inference
    pip install -e .

Input
class InterfaceManagerFactory(AbstractManagerFactory):
    def __init__(<FILL>
def main():
    factory = InterfaceManagerFactory(start=datetime.now())
    managers = []
    for i in range(10):
        managers.append(factory.build(id=i))


Output
self, start=None):
        self.start = start

    def build(self, id):
        return InterfaceManager(id, self.start)

Input
/-- A quasi-prefunctoid is 1-connected iff all its etalisations are 1-connected. -/
theorem connected_iff_etalisation [C D : precategoroid] (P : quasi_prefunctoid C D) :
  π₁ P = 0 ↔ <FILL> = 0 :=
begin
  split,
  { intros h f,
    rw pi_1_etalisation at h,
    simp [h],
    refl
  },
  { intro h,
    have := @quasi_adjoint C D P,
    simp [←pi_1_etalisation, this, h],
    refl
  }
end


Output
∀ f : C ⟶ D, π₁ (P.etalise f)

Code instructions

Meta also provided an instruction-tuned variant of Code Llama. Example queries in this section can only be applied to these instruction-tuned Code Llama models, which are the models with a model ID instruct suffix. The Code Llama format for instructions is the same as the Llama-2-chat prompt format, which we detail in Llama 2 foundation models are now available in SageMaker JumpStart

A simple user prompt may look like the following:

<s>[INST] {user_prompt} [/INST]

You may also add a system prompt with the following syntax:

<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>

{user_prompt} [/INST]

Finally, you can have a conversational interaction with the model by including all previous user prompts and assistant responses in the input:

<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>

{user_prompt_1} [/INST] {assistant_response_1} </s><s>[INST] {user_prompt_1} [/INST]

These examples were run on the CodeLlama-13b-Instruct model with payload parameters “parameters”: {"max_new_tokens": 512, "temperature": 0.2, "top_p": 0.9}:

Input
<s>[INST] In Bash, how do I list all text files in the current directory (excluding subdirectories) that have been modified in the last month? [/INST] 

Output
 You can use the `find` command in Bash to list all text files in the current directory (excluding subdirectories) that have been modified in the last month. Here's an example command:
```
find . -type f -name "*.txt" -mtime -30
```
Here's a breakdown of the command:

* `find .`: This tells `find` to search the current directory and all its subdirectories.
* `-type f`: This tells `find` to only search for files (not directories).
* `-name "*.txt"`: This tells `find` to only search for files with the `.txt` extension.
* `-mtime -30`: This tells `find` to only search for files that have been modified in the last 30 days (2,592,000 seconds).

The output of the command will be a list of all text files in the current directory (excluding subdirectories) that have been modified in the last month.

Input
<s>[INST] What is the difference between inorder and preorder traversal? Give an example in Python. [/INST] 

Output
 Inorder and preorder traversal are two different ways of traversing a binary tree.

Inorder traversal is a way of visiting the nodes of a binary tree in a specific order. In an inorder traversal, the left subtree is visited first, then the current node, and finally the right subtree is visited. This means that the nodes are visited in a left-to-right, top-to-bottom order.

Preorder traversal, on the other hand, is a way of visiting the nodes of a binary tree in a different order. In a preorder traversal, the current node is visited first, followed by the left subtree, and finally the right subtree. This means that the nodes are visited in a top-down, left-to-right order.

Here is an example of an inorder traversal of a binary tree in Python:
```
def inorder_traversal(root):
    if root is None:
        return []
    return inorder_traversal(root.left) + [root.val] + inorder_traversal(root.right)
```
And here is an example of a preorder traversal of a binary tree in Python:
```
def preorder_traversal(root):
    if root is None:
        return []
    return [root.val] + preorder_traversal(root.left) + preorder_traversal(root.right)
```
In this example, the `inorder_traversal` function returns a list of the values in the tree in the order of the keys, from smallest to largest. The `preorder_traversal` function returns a list of the values in the tree in the order of the keys, but with the current node visited before its children.

Input
<s>[INST] <<SYS>>
Provide answers in JavaScript
<</SYS>>

Write a function that computes the set of sums of all contiguous sublists of a given list. [/INST] 

Output
 function contiguousSublistSums(list) {
  let sums = [];
  for (let i = 0; i < list.length; i++) {
    let sum = 0;
    for (let j = i; j < list.length; j++) {
      sum += list[j];
      sums.push(sum);
    }
  }
  return sums;
}

console.log(contiguousSublistSums([1, 2, 3, 4, 5])); // [1, 3, 6, 10, 15]
console.log(contiguousSublistSums([1, 2, 3, 4, 5, 6])); // [1, 3, 6, 10, 15, 21]
console.log(contiguousSublistSums([1, 2, 3, 4, 5, 6, 7])); // [1, 3, 6, 10, 15, 21, 28]
console.log(contiguousSublistSums([1, 2, 3, 4, 5, 6, 7, 8])); // [1, 3, 6, 10, 15, 21, 28, 36]
console.log(contiguousSublistSums([1, 2, 3, 4, 5, 6, 7, 8, 9])); // [1, 3, 6, 10, 15, 21, 28, 36, 45]

Clean up

After you’re done running the notebook, make sure to delete all resources that you created in the process so your billing is stopped. Use the following code:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Code Llama models in SageMaker Studio and deploy the model for generating code and natural language about code from both code and natural language prompts. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.

Resources


About the authors


Gabriel Synnaeve
is a Research Director on the Facebook AI Research (FAIR) team at Meta. Prior to Meta, Gabriel was a postdoctoral fellow in Emmanuel Dupoux’s team at École Normale Supérieure in Paris, working on reverse-engineering the acquisition of language in babies. Gabriel received his PhD in Bayesian modeling applied to real-time strategy games AI from the University of Grenoble.

Eissa Jamil is a Partner Engineer RL, Generative AI at Meta.

Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker JumpStart team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Vivek Singh is a product manager with SageMaker JumpStart. He focuses on enabling customers to onboard SageMaker JumpStart to simplify and accelerate their ML journey to build Generative AI applications.