Llama Cpp on Python

⬅️ - 🏠

I have used a plethora on UI to access LLM models. But most of them are designed badly or frustrate working with them. And as the saying goes, if you dont like it do it yourself.

Now this is what I preach. Do it yourself. Being dependent on others and their wims and changes is always bad. Not like you dont have enough of your own.

So being the cli surfer I am I took to the waves and soon created some python code (while learning Python) that enabled me to access LLM’s on the lowest level.

The great thing about this is not only the learning factor, but you are able to do whatever you want with the output and the presentation there of.

This article that probably will have many updates, starts with the basic script I use. You need to change PATH_TO_YOUR_MODELS and probably the model you want to use. Depending on your GPU you should alter n_gpu_layers and/or n_ctx as they decide how much of the GPU is used.

TODO: Seting up environment.

#
# LLama CPP Python
#
# Local model call v.1.0.9 by Roelf Renkema
# 

# Updates:
# - Added Timestamps
# - Added Colors
# - set max tokens to 512 to save time with large scripts

from llama_cpp import Llama
from PIL import Image
from prompt_toolkit import PromptSession
from termcolor import colored
import datetime

llm = Llama(
      model_path="PATH_TO_YOUR_MODELS/gemma-2-27b-it-Q4_K_L.gguf",
  n_gpu_layers= 25,
  n_ctx= 4096,
  cache_8bit= False,
  cache_4bit= False,
  threads= 1,
  threads_batch= 1,
  n_batch= 512,
  no_mmap= False,
  mlock= True,
  no_mul_mat_q= False,
  tensor_split= '',
  compress_pos_emb= 1,
  rope_freq_base= 5000000,
  numa= True,
  no_offload_kqv= False,
  row_split= False,
  tensorcores= True,
  flash_attn= True,
  streaming_llm= True,
  attention_sink_size= 5,
  chat_format="chatml",
)

def get_input():
    session = PromptSession()
    print(colored("Enter your prompt (press LEFT-ALT+ENTER to finish):",'white',attrs=["bold"]))

    multiline_input = session.prompt("> ",multiline=True)

    print(colored("Processing. Please wait a moment!",'light_green',attrs=["bold"]))
    return multiline_input

def get_answer(prompt):
    output = llm(
    prompt, # Prompt
    max_tokens=512, # Generate up to 32 tokens, set to None to generate up to the end of the context window
    echo=False # Echo the prompt back in the output
    ) 
    # Output the answer
    multiline_answer = output['choices'][0]['text']
    print(colored(multiline_answer,'light_cyan'))
    return multiline_answer
 
def time_stamp():
    current_time = datetime.datetime.now()
    print(colored(current_time.strftime("%H:%M:%S"),'black', 'on_yellow'))
  
if __name__ == "__main__": 
    while True:
        multiline_input = get_input()
        if multiline_input == 'exit':
            break

        time_stamp()    
        get_answer(multiline_input)
        time_stamp()
 

I hope this script is helpfull and can teach you some stuff. You can contact me on Discord @geennaam or Huggingface https://huggingface.co/roelfrenkema if you have any questions.

With thanks to ⚡straico.com⚡ for supporting my work. You can support me to by using this affiliate link when subscribing to ⚡Straico⚡

⬅️ - 🏠

Llama Cpp Python

Here you can find the ADHD rambling and programming projects of the old geeser known as Gramps[tm] Take me serious at you own risk.

Llama Cpp on Python