An Easy Guide to Automated Prompt Engineering on Intel GPUs

Stay in the Know on All Things CODE

Ramya Ravi, AI Software Marketing Engineer, Intel | LinkedIn 
Gururaj Deshpande, Graduate Intern Technical, Intel | LinkedIn  
Chandan Damannagari, Director, AI Software, Intel | LinkedIn

Prompt engineering is a crucial technique used to instruct Large Language Models (LLMs) to guide the model toward generating task-specific responses. This is cheaper and faster than both fine-tuning and RAG, while taking less data, but has often been a manual task. For LLMs that are deployed on-device, which tend to be smaller (usually fewer than 14 billion parameters), it is even more important to have good prompts that are optimized for the task at hand as these cannot generalize to the same extent as larger LLMs. 

In this article we will show you how to use Declarative Self-improving Python (DSPy), an automatic prompt engineering framework, along with Intel® oneAPI Base Toolkit to create a pipeline for a specific task and optimize the prompts for that task on Intel® Core™ Ultra Processors available on Intel® AI PC.

What is Automated Prompt Engineering Optimization?

Automatic prompt engineering is a technique that takes an LLM and automatically creates increasingly better prompts. Any automatic prompt engineering framework requires the following:
 

  • An LLM that needs to be prompt-engineered

  • A dataset of inputs and outputs for the task at hand

  • A metric that measures how well the LLM is doing on the task

The automatic prompt engineering frameworks will then handle the prompt updates to make the LLM perform better on the given task.

Get Started

DSPy and llama.cpp

DSPy is an open source Python framework for programming LLMs for optimizing their prompts and weights. The philosophy relies around using code, contained in signatures, modules, and optimizers to create pipelines that can then be optimized. Compared to using pure string prompts, DSPy brings structure and modularity to LLM prompting, allowing for easy changes to be made while still being robust.  

Llama.cpp is an LLM engine that accelerates LLM inference on local and edge devices. Llama.cpp brings SOTA LLM inference techniques and applies them to LLM inferencing and native hardware acceleration. Llama.cpp supports the SYCL backend, meaning that llama.cpp can run on Intel GPUs (integrated graphics, discrete graphics, or data centers).

To install Llama.cpp with SYCL support for python execution, set GGML_SYCL=on environment variable before installing, like shown below. 
 

  • Linux 
    CMAKE_ARGS="-DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON" pip install llama-cpp-python 

  • Windows 
    set CMAKE_GENERATOR=Ninja  
    set CMAKE_C_COMPILER=cl  
    set CMAKE_CXX_COMPILER=icx  
    set CXX=icx  
    set CC=cl  
    set CMAKE_ARGS="-DGGML_SYCL=ON -DCMAKE_CXX_COMPILER=icx -DCMAKE_C_COMPILER=cl -DGGML_SYCL_F16=ON" pip install llama-cpp-python 

Intel® oneAPI Base Toolkit

Intel oneAPI Base Toolkit includes different tools (for profiling, design assistance, and debug tools) and domain-specific libraries for developing high-performance, data-centric applications across various architectures such as Intel CPUs, GPUs, and FPGAs. It also allows you to easily migrate from CUDA* code to open standard multiarchitecture C++ with SYCL  

Download and check out the tools available on Intel oneAPI Base Toolkit.

AI PCs

AI PCs represent the new generation of personal computers to provide power efficient AI acceleration through the included central processing unit (CPU), graphic processing unit (GPU), and neural processing unit (NPU) in order to be able to handle a diverse range of AI workloads. AI PCs powered by Intel® Core™ Ultra processors can balance power and performance for fast and efficient AI experiences. NPUs are specialized hardware designed for AI capabilities and allow the AI PC to perform a variety of AI tasks efficiently while delivering enhanced privacy and security. 

Code Sample

This code sample is available in the AI PC Notebooks GitHub Repository. Firstly, the dataset will be loaded and then DSPy framework is configured to optimize prompts for the LLM pipeline.  

The following steps are implemented in the code sample and make sure to have the Intel oneAPI Base Toolkit installed before running the code sample.
 

  1. Load Riddle Dataset: The dataset that we will be using is the ARC dataset. This dataset contains grade-level science questions paired with multiple choice answers. The task for the LLM is to predict the correct multiple-choice answer to answer the question. In many cases, one may not have a dataset ready for their task. For these cases, one would need to create examples themselves. DSPy can work with a few examples and then optimize the prompts for the task.

    dataset = load_dataset("INK-USC/riddle_sense", split="validation")
  2. Create Question Signature: DSPy uses signatures to define the input and output for the LLM. This will be represented as a class in Python. Inside this class, we define the input and output for the LLM. The input is the riddle, and the output is the correct answer to the riddle. We know that the correct answer for the LLM is the correct multiple-choice answer to the question, and we can use Python typing to define this. DSPy will use this signature to prompt the LLM and also add prompts around this signature during optimization.

    class Question(dspy.Signature): """Answer science questions by selecting the correct answer from a list of choices. Respond with the letter of the correct answer.""" # noqa: E501 riddle = dspy.InputField() answer: Literal["A", "B", "C", "D"] = dspy.OutputField()
  3. Process Dataset for DSPy: The next step is to convert the list of questions and answers to a format that DSPy can understand. DSPy takes in a list of dspy. Example objects that specify the science question and the correct answer. We will convert the riddles and answers to this format.

    # Create dataset dspy_dataset = [] for row in dataset.itertuples(): # Extract data from row question = row.question answer = row.answer labels = row.choices["label"] context = row.choices["text"] # Create science question input based on the question and answer choices answer_choices = "" for label, choice in zip(labels, context): answer_choices += f"{label}. {choice}, " answer_choices = answer_choices[:-2] # Remove trailing comma science_question = f"{question}: {answer_choices}" # Create example example = dspy.Example(science_question=science_question, answer=answer).with_inputs("science_question") # Append example to dataset dspy_dataset.append(example)
  4. Load LLM using llama.cpp and configure DSPy to use LLM: After we select the LLM to be used, we will then load the LLM using llama-cpp-python, a Python wrapper for llama.cpp. The from_pretrained function will download the model and tokenizer from Huggingface and load it onto the machine. We will then use this LLM to prompt the riddles and answers. Once we have loaded the LLM, we need to configure DSPy to use this LLM. DSPy offers the LlamaCPP method which takes the llm object. DSPy will then use llama-cpp-python and the LLM to prompt the riddles and answers. The code sample builds llama-cpp-python with the SYCL backend using Intel® oneAPI DPC++/C++ Compiler, which allows LLMs to run on Intel GPUs. Intel® oneAPI DPC++/C++ Compiler, which allows LLMs to run on Intel GPUs.

    llm = Llama.from_pretrained( repo_id=model_to_repo[model_dropdown.value], filename=model_dropdown.value, # This tells Llama.cpp to put 5 layers of the model on the GPU. # The rest of the model will run on the CPU. n_gpu_layers=5, seed=SEED, # Increase context window size to 4096 so that the model can see the entire riddle # Having a large enough window size is important for the prompt optimization part n_ctx=4096, verbose=False, ) llamalm = dspy.LlamaCpp(model="llama", llama_model=llm, model_type="chat", seed=SEED) dspy.settings.configure(lm=llamalm)
  5. Set metric to evaluate LLM performance on task: The metric we will use for evaluating the LLM's performance is answer_exact_match, which returns True if the LLM answer matches the correct answer exactly and False otherwise. We will use this metric to evaluate the LLM's performance on the validation and test set. 

    metric = dspy.evaluate.metrics.answer_exact_match
  6. Define LLM pipeline: After we have our dataset, we then need to create a Module that represents our input and what prompt strategy the LLM should use. We will use the Module class from dspy to create a module that represents the input and output for the LLM. We will then use this module to create a pipeline that will be optimized by DSPy. Here, we make sure to use our Question signature to specify the input and output we want from the LLM. Then, we will use dspy.ChainOfThought to tell DSPy to use Chain-Of-Thought prompt-engineering strategy.

    class QuestionAnsweringAI(dspy.Module): def __init__(self): self.signature = Riddle self.respond = dspy.ChainOfThought(self.signature) def forward(self, science_question): return self.respond(science_question=science_question)
  7. Configure LLM evaluation: Now that we have defined the inputs, outputs, and the LLM pipeline, we then need to have a strategy to evaluate the LLM's performance with new prompts. We will use dspy. Evaluate function to accept a dataset, metric, and start the evaluation process.

    train_evaluate = dspy.Evaluate( devset=train, metric=metric, num_threads=1, display_progress=True, display_table=10 ) val_evaluate = dspy.Evaluate( devset=val, metric=metric, num_threads=1, display_progress=True, display_table=10 ) test_evaluate = dspy.Evaluate( devset=test, metric=metric, num_threads=1, display_progress=True, display_table=10 )
  8. Configure and execute DSPy optimizer: DSPy offers a variety of different optimizers to find the best prompts. We will use the MIPROv2 to find better prompts for the LLM. MIPROv2 is a prompt-engineering optimizer. MIPROv2 contains hyperparameters that control how long it takes to find prompts as well. We will use the light setting for the hyperparameters.

    optm = dspy.MIPROv2(metric=metric, auto="light", num_threads=1, seed=SEED) optimized_riddle_answerer = optm.compile( QuestionAnsweringAI(), trainset=train, valset=val, # The number of examples that is generated and included in the prompt max_bootstrapped_demos=2, # The number of examples from the training set that is included in the prompt max_labeled_demos=2, requires_permission_to_run=False, )
  9. Compare accuracy before and after optimization: Finally, we will display the accuracy of the LLM before and after prompt engineering. The LLM on the test set without optimization scored 35% accuracy whereas the optimized LLM scored 78% accuracy.

Try out the above code sample for yourself and learn how to optimize prompts with DSPy on Intel GPUs.

What’s Next 

We hope that this article and code sample introduces developers to LLM evaluation and prompt optimization, while allowing LLMs to run performantly on Intel GPUs with the Intel® oneAPI Base Toolkit. If your LLM customization needs exceed the capabilities of automated prompt engineering, we encourage you to dive into our RAG and fine-tuning resources. 

We encourage you to also check out and incorporate Intel’s other AI/ML Framework optimizations and tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions. 

Additional Resources