Large Language Models with the Transformers library¶

This notebook shows a few example of openly available Large Language Models (LLMs) that can be used using the Transformers library frmo HuggingFace, see https://huggingface.co/docs/transformers/index
Only a small selection of models that can fit in the GPU memory of T4s are presented. Better model (with more parameters) need GPUs with large amounts of memory (40 GB or more).
Note: Downloading the models from HuggingFace and loading the weights into the GPU can take several minutes.

Dolly, a LLM model¶

Dolly is an open source LLM model available for experimentation, see: https://huggingface.co/databricks/dolly-v2-12b

In [1]:

import torch
from transformers import pipeline

# the 2-12b model is large and requires more than 16GB of memory, the model 2-3b is smaller
# generate_text = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto") 

generate_text = pipeline(model="databricks/dolly-v2-3b", torch_dtype=torch.bfloat16, trust_remote_code=True, device=0) # use GPU

In [2]:

# Query the LLM model

res = generate_text("What is a particle accelerator?")
print(res[0]["generated_text"])

A particle accelerator is a machine that speeds up particles (such as electrons, photons or neutrons) to yield higher-energy particles. Particle accelerators can be divided into three general categories: synchrotron, linear and cyclotron.

Testing the model Falcon 7b¶

Falcon is a LLM model, see https://huggingface.co/blog/falcon

In [2]:

# Install einops and accelerate if not yet done

# !pip install einops
# !pip install accelerate

Defaulting to user installation because normal site-packages is not writeable
Collecting einops
  Using cached einops-0.6.1-py3-none-any.whl (42 kB)
Installing collected packages: einops
Successfully installed einops-0.6.1

In [ ]:

from transformers import AutoTokenizer
import transformers
import torch

model = "tiiuae/falcon-7b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)

In [2]:

sequences = pipeline(
   "What is a particle accelerator?",
    max_length=1000,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

/cvmfs/sft-nightlies.cern.ch/lcg/views/dev4cuda/Mon/x86_64-centos7-gcc11-opt/lib/python3.9/site-packages/transformers/generation/utils.py:1219: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.

Result: What is a particle accelerator?
A particle accelerator is a device used to accelerate and collide particles, typically at high speeds. Particles are typically energized with high-energy beams of radiation and can be accelerated into high speeds, which can then be useful in a number of fields, such as medical imaging and physics research.

Open llama¶

Open llama is an LLM model, see https://huggingface.co/openlm-research/open_llama_3b

In [1]:

from transformers import AutoTokenizer
import transformers
import torch

In [2]:

# This is an exampele using transformers with open llama
model = "openlm-research/open_llama_3b"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/593 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/534k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/330 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/6.85G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [7]:

sequences = pipeline(
   "Question: What is a particle accelerator? \nAnswer:",
    max_length=60,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: Question: What is a particle accelerator? 
Answer: It's a machine that speeds things up. 
Question: Where do particle accelerators work? 
Answer: They do work at CERN in Geneva Switzerland.

#### 4