Running Llama 3 with Elixir Bumblebee

April 21, 2024

Update 4/22/2024: Jonatan Klosko has added multiple eos token support to bumblebee and fixed the special tokens map issue with this model. If you load bumblebee from github the repo works with the serving segment at the top of the article.

  {:bumblebee, git: "https://github.com/elixir-nx/bumblebee", override: true}

Llama 3 released this week, and it comes with a new tokenizer and chat template. We can get the model up and running, but it requires a few tweaks to the tokenizer and generation file.

Let’s start with a basic serving modified from the Bumblebee llama docs. You’ll need a HuggingFace token and authorization to use the llama-3-8b-instruct model at https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct.

Mix.install([
  {:bumblebee, "~> 0.5.3"},
  {:nx, "~> 0.7.1"},
  {:exla, "~> 0.7.1"},
  {:kino, "~> 0.12.3"},
  {:kino_bumblebee, "~> 0.5.0"}
])

Nx.global_default_backend({EXLA.Backend, client: :cuda})

hf_token = System.fetch_env!("HF_TOKEN")
repo = {:hf, "meta-llama/Meta-Llama-3-8B-Instruct", auth_token: hf_token}

{:ok, model_info} =
  Bumblebee.load_model(repo, backend: {EXLA.Backend, client: :cuda}, type: :bf16)
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)

generation_config =
  Bumblebee.configure(generation_config,
    max_new_tokens: 500,
    strategy: %{type: :multinomial_sampling, top_p: 0.6}
  )

serving =
  Bumblebee.Text.generation(model_info, tokenizer, generation_config,
    compile: [batch_size: 1, sequence_length: 1028],
    stream: true,
    defn_options: [compiler: EXLA]
  )

Kino.start_child({Nx.Serving, name: Llama, serving: serving})

prompt = """
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

What do you know about elixir?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

"""

Nx.Serving.batched_run(Llama, prompt) |> Enum.each(&IO.write/1)

Running that code, we get the following error:

** (RuntimeError) conversion failed, expected "eos_token_id" to be a number, got: [128001, 128009]
    (bumblebee 0.5.3) lib/bumblebee/shared/converters.ex:20: anonymous fn/3 in Bumblebee.Shared.Converters.convert!/2
    (elixir 1.15.2) lib/enum.ex:2510: Enum."-reduce/3-lists^foldl/2-0-"/3
    (bumblebee 0.5.3) lib/bumblebee/shared/converters.ex:14: Bumblebee.Shared.Converters.convert!/2
    (bumblebee 0.5.3) lib/bumblebee/text/generation_config.ex:289: Bumblebee.HuggingFace.Transformers.Config.Bumblebee.Text.GenerationConfig.load/2
    (bumblebee 0.5.3) lib/bumblebee.ex:1039: Bumblebee.load_generation_config/2
    #cell:ii2zn47a6ekeiz5cqllhltah5oyittpn:9: (file)

Bumblebee expects a single integer for the eos_token, not a list. We can fix this downloading the generation_config.json and changing the eos_token value.

If you’re running this model in Livebook, create a new folder “llama3” in whatever directory you launched livebook from, then create a new file generation_config.json and put the following into it. Download the config.json from https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/raw/main/config.json and add it to your llama3 folder as well.

{
  "_from_model_config": true,
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "transformers_version": "4.40.0.dev0"
}

Then change the load_generation_config line in the serving to this.

{:ok, generation_config} = Bumblebee.load_generation_config({:local, "./llama3"})

Now we have a new error.

** (ErlangError) Erlang error: "Could not decode field on position 1"
    (tokenizers 0.4.0) Tokenizers.Native.encoding_pad(#Tokenizers.Encoding<[length: 47, ids: [128000, 198, 128006, 9125, 27, 91, 408, 8932, 851, 1363, 2675, 527, 264, 11190, 18328, 16134, 91, 68, 354, 851, 397, 128006, 882, 27, 91, 408, 8932, 851, 1363, 3923, 656, 499, 1440, 922, 658, 953, 404, 76514, 91, 68, 354, 851, 397, 128006, 78191, 128007, 271]]>, 1028, [pad_id: nil, pad_token: "</s>", direction: :left])
    (elixir 1.15.2) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
    (bumblebee 0.5.3) lib/bumblebee/text/pre_trained_tokenizer.ex:287: Bumblebee.Text.PreTrainedTokenizer.apply/2

We can start our investigation in the pre_trained_tokenizer, and looking for something to do with padding. At https://github.com/elixir-nx/bumblebee/blob/50e846c2cd07266035d990137d65850f604f5374/lib/bumblebee/text/pre_trained_tokenizer.ex#L174 we can see the special token map for the llama models. Llama used to use </s> as an eos token, and we can see from the comments that it was copied to the padding token. This is the culprit. The old eos token doesn’t exist in the new tokenizer’s vocabulary, and it fails attempting to encode it.

Fortunately, once the tokenizer is loaded it is just a struct, and we’re able to manipulate the special_tokens key on that struct. In our serving code, right after the load_tokenizer function, add this:

tokenizer =
  tokenizer
  |> Map.put(:special_tokens, %{
    pad: "<|eot_id|>",
    bos: "<|begin_of_text|>",
    eos: "<|eot_id|>",
    unk: "<unk>"
  })

Re-run that and the model works:

Elixir! It's a modern, dynamic, and functional programming language built on top of the Erlang VM (BEAM). Here are some key features and facts about Elixir:
[The many benefits cut for brevity.]

<|eot_id|> versus <|end_of_text|>

There’s a bit of discussion in the community about what llama 3’s eos token actually is. It looks like the python community has decided to use both. In my testing using <|eot_id|> alone has been fine.

Github

The complete livebook is available at https://github.com/bowyern/llama3-bumblebee.