Meta Llama2 使用 llama_cpp_python 在 Windows 執行

以下為 Meta Llama2 的簡單上手範例，使用 llama-cpp 加速 inference，在 Windows 使用 CPU 執行

1. 先至 Hugging Face 下載 model, 此例選擇參數較少的 7b (7 billion): llama-2-7b-chat.ggmlv3.q4_0.bin
(若是舊版 llama_cpp 0.1.78 需下載 GGUF 格式: llama-2-7b-chat.Q4_0.gguf)

2, 安裝 Visual Studio

3, 安裝llama_cpp_python

pip install llama-cpp-python

4, 執行 Python 程式碼

from llama_cpp import Llama

def generate_text(model, message):
  prompt= f"[INST] {message.strip()} [/INST]"
  output= model(prompt)
  answer= output["choices"][0]["text"]
  return answer

if __name__ == "__main__":
  model= Llama(model_path='./llama-2-7b-chat.ggmlv3.q4_0.bin') #the latest llama-cpp-python, ex. 0.2.53
  #model= Llama(model_path='./llama-2-7b-chat.Q4_0.gguf')  #pip install llama-cpp-python==0.1.78
  while(1):
    print("\n------------------------------------------------\nPrompt: ",end='')
    str= input()
    print("Processing...")
    print("\nAnswer: " + generate_text(model, str))

5, 輸出結果