Instructions to use elyza/Llama-3-ELYZA-JP-8B-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use elyza/Llama-3-ELYZA-JP-8B-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="elyza/Llama-3-ELYZA-JP-8B-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("elyza/Llama-3-ELYZA-JP-8B-AWQ")
model = AutoModelForMultimodalLM.from_pretrained("elyza/Llama-3-ELYZA-JP-8B-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use elyza/Llama-3-ELYZA-JP-8B-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "elyza/Llama-3-ELYZA-JP-8B-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "elyza/Llama-3-ELYZA-JP-8B-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/elyza/Llama-3-ELYZA-JP-8B-AWQ

SGLang

How to use elyza/Llama-3-ELYZA-JP-8B-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "elyza/Llama-3-ELYZA-JP-8B-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "elyza/Llama-3-ELYZA-JP-8B-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "elyza/Llama-3-ELYZA-JP-8B-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "elyza/Llama-3-ELYZA-JP-8B-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use elyza/Llama-3-ELYZA-JP-8B-AWQ with Docker Model Runner:
```
docker model run hf.co/elyza/Llama-3-ELYZA-JP-8B-AWQ
```

Llama-3-ELYZA-JP-8B-AWQ

Model Description

Llama-3-ELYZA-JP-8B is a large language model trained by ELYZA, Inc. Based on meta-llama/Meta-Llama-3-8B-Instruct, it has been enhanced for Japanese usage through additional pre-training and instruction tuning. (Built with Meta Llama3)

For more details, please refer to our blog post.

Quantization

We have prepared two quantized model options, GGUF and AWQ. This is the AutoAWQ model.

The following table shows the performance degradation due to quantization:

Model	ELYZA-tasks-100 GPT4 score
Llama-3-ELYZA-JP-8B	3.655
Llama-3-ELYZA-JP-8B-GGUF (Q4_K_M)	3.57
Llama-3-ELYZA-JP-8B-AWQ	3.39

Use with vLLM

Install vLLM:

pip install vllm

vLLM Offline Batched Inference

from vllm import LLM, SamplingParams

llm = LLM(model="elyza/Llama-3-ELYZA-JP-8B-AWQ", quantization="awq")
tokenizer = llm.get_tokenizer()

DEFAULT_SYSTEM_PROMPT = "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。"
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=1000)
messages_batch = [
    [
        {"role": "system", "content": DEFAULT_SYSTEM_PROMPT},
        {"role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは？"}
    ],
    [
        {"role": "system", "content": DEFAULT_SYSTEM_PROMPT},
        {"role": "user", "content": "クマが海辺に行ってアザラシと友達になり、最終的には家に帰るというプロットの短編小説を書いてください。"}
    ]
]

prompts = [
    tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    for messages in messages_batch
]

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    print(output.outputs[0].text)
    print("=" * 50)

vLLM OpenAI Compatible Server

Start the API server:

python -m vllm.entrypoints.openai.api_server \
--model elyza/Llama-3-ELYZA-JP-8B-AWQ \
--port 8000 \
--host localhost \
--quantization awq

Call the API using curl:

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "elyza/Llama-3-ELYZA-JP-8B-AWQ",
  "messages": [
    { "role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。" },
    { "role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは？" }
  ],
  "temperature": 0.6,
  "max_tokens": 1000,
  "stream": false
}'

Call the API using Python:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key = "dummy_api_key"
)

completion = client.chat.completions.create(
    model="elyza/Llama-3-ELYZA-JP-8B-AWQ",
    messages=[
        {"role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。"},
        {"role": "user", "content": "古代ギリシャを学ぶ上で知っておくべきポイントは？"}
    ]
)

Developers

Listed in alphabetical order.

License

Meta Llama 3 Community License

How to Cite

@misc{elyzallama2024,
      title={elyza/Llama-3-ELYZA-JP-8B},
      url={/elyza/Llama-3-ELYZA-JP-8B},
      author={Masato Hirakawa and Shintaro Horie and Tomoaki Nakamura and Daisuke Oba and Sam Passaglia and Akira Sasaki},
      year={2024},
}

Citations

@article{llama3modelcard,
    title={Llama 3 Model Card},
    author={AI@Meta},
    year={2024},
    url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
}

Downloads last month: 206

Safetensors

Model size

8B params

Tensor type

I32

F16

Collection including elyza/Llama-3-ELYZA-JP-8B-AWQ

Llama-3-ELYZA-JP

Collection

Llama-3 models augmented for Japanese usage • 7 items • Updated Dec 13, 2024 • 11