This article was originally published on VentureBeat.
Generative AI promises to greatly enhance human productivity but only a handful of enterprises possess the skills and resources to develop and train from scratch the foundation models essential for its deployment. The challenges are twofold. First, collecting the data to train the models was already challenging and has become even more so as content owners assert their intellectual property rights. Next, the resources needed for training can be prohibitively expensive. However, the societal value of unlocking the availability of and access to generative AI technologies remains high.
So, how can small enterprises or individual developers incorporate generative AI into their applications? By creating and deploying custom versions of the larger foundation models.
The large investment and effort to develop new generative AI models means that they will need to be general enough to address a wide range of uses—consider all the ways in which GPT*-based models have been used already. However, a general-purpose model often can't address the domain-specific needs of individual and enterprise use cases. Using a large general-purpose model for a narrow application also consumes excess computing resources, time, and energy.
Therefore, most enterprises and developers can find the best fit for their requirements and budget by starting with a large generative AI model as a foundation to adapt to their own needs at a fraction of the development effort. This also provides infrastructure flexibility by using existing CPUs or AI accelerators instead of being limited by shortages of specific GPUs. The key is to focus on the specific use case and narrow the scope while maximizing project flexibility by using open, standards-based software and ubiquitous hardware.
Take the Use Case Approach for AI Application Development
In software development, a use case defines the characteristics of the target user, the problem to be solved, and how the application will be used to solve it. This defines product requirements, dictates the software architecture, and provides a roadmap for the product lifecycle. Most crucially, this scopes the project and defines what does not need to be included.
Similarly, in the case of a generative AI project, defining a use case can reduce the size, compute requirements, and energy consumption of the AI model. At the same time, it can improve model accuracy by focusing on a specific dataset. Along with these benefits come reduced development effort and costs.
The factors that define a use case for generative AI will vary by project, but some common helpful questions can guide the process:
- Data requirements: What, and how much, training data is necessary and available? Is the data structured (data warehouse) or unstructured (data lake)? What are the regulations or restrictions with it? How will the application process the data—through batch or streaming? How often do you need to maintain or update the model? Training large language models (LLMs) from scratch takes so much time that they lack awareness of recent knowledge—so if being up-to-date is important to your application, then you would need a different approach. Or, if you are developing a healthcare application, privacy, and security restrictions on patient data typically dictate unique approaches to training and inference.
- Model requirements: Model size, model performance, openness, and explainability of results are all important considerations when choosing the right model for you. Performant LLM models range in size from billions to trillions of parameters—Llama 2 from Meta* offers versions ranging from 7 billion to 70 billion parameters, while GPT-4* from OpenAI reportedly has 1.76 trillion parameters. While larger model sizes are typically associated with higher performance, smaller models may better fit your overall requirements. Open models offer more choices for customization, whereas closed models work well off the shelf but are limited to API access. Control over customization allows you to ground the model in your data with traceable results, which would be important in an application such as generating summaries of financial statements for investors. On the other hand, allowing an off-the-shelf model to extrapolate beyond its trained parameters ("hallucinate") may be perfectly fine for generating ideas for advertising copy.
- Application requirements: What are the accuracy, latency, privacy, and safety standards that must be met? How many simultaneous users does it need to handle? How will users interact with it? For example, your implementation decisions will depend on whether your model should run on a low-latency edge device owned by the end-user or in a high-capacity cloud environment where each inference call costs you money.
- Compute requirements: Once the previous requirements are understood, what compute resources are required to meet them? Do you need to parallelize your pandas data processing using Modin*? Do your fine-tuning and inference requirements differ enough to require a hybrid cloud-edge compute environment? While you may have the talent and data to train a generative AI model from scratch, consider whether you have the budget to overhaul your compute infrastructure.
The previous factors will help drive conversations to define and scope the project requirements. Economics also factor in—the budget for data engineering, up-front development costs, and the ultimate business model that will provide a requirement for the inference costs dictate the data, training, and deployment strategies.
How Generative AI Technologies from Intel Can Help
Intel provides heterogeneous AI hardware options for a wide variety of compute requirements. To get the most out of your hardware, Intel provides optimized versions of the data analysis and end-to-end AI tools most teams use today. More recently, Intel has begun providing optimized models, including the number-one ranked 7B parameter model on the Hugging Face* open LLM leaderboard (as of November 2023). These tools and models, together with those provided by its AI developer ecosystem, can satisfy your application's accuracy, latency, and security considerations. First, you can start with the hundreds of pretrained models on Hugging Face or GitHub* that are optimized for Intel® hardware. Next, you can preprocess your data using Intel-optimized tools such as Modin, fine-tune foundation models using application-specific optimization tools such as Intel® Extension for Transformers* or Hugging Face* Optimum, and automate model tuning with SigOpt®. All of this builds on the optimizations that Intel has already contributed to open source AI frameworks, including TensorFlow*, PyTorch*, and DeepSpeed*.
Let's illustrate with some generative AI use case examples for customer service, retail, and healthcare applications.
Generative AI for Customer Service: Chatbot Use Case
Chatbots based on LLMs can improve customer service efficiency by providing instant answers to common questions, freeing customer service representatives to focus on more complex cases.
Foundation models are already trained to converse in multiple languages on a broad range of topics but lack depth on the offerings of a given business. A general-purpose LLM may also hallucinate, confidently generating output even in the absence of trained knowledge.
Fine-tuning and retrieval are two of the more popular methods to customize a foundation model. Fine-tuning incrementally updates a foundation model with custom information. Retrieval-based methods, such as retrieval-augmented generation (RAG), fetch information from a database external to the model. This database is built using the offering-specific data and documents, vectorized for use by the AI model. Both methods deliver offering-specific results and can be updated using only CPUs (such as Intel® Xeon® Scalable processors), which are ubiquitous and more readily available than specific accelerators.
The use case helps determine which method best fits the application’s requirements. Fine-tuning offers latency advantages since the knowledge is built into the generative AI model. Retrieval offers traceability from its answers directly to actual sources in the knowledge base, and updating this knowledge base does not require incremental training.
It’s also important to consider the compute requirements and costs for ongoing inference operations. The transformer architecture that powers most chatbots is usually limited more by memory bandwidth than raw compute power. Model optimization techniques such as quantization can reduce the memory bandwidth requirements, which reduces latency and inference compute costs.
There are plenty of foundation models to choose from. Many come in different parameter sizes. Starting with a clearly defined use case helps choose the right starting point and dictates how to customize it from there.
Customize a chatbot foundation model with RAG.
Generative AI for Retail: Virtual Try-on Use Case
Retailers can use generative AI to offer their customers a better, more immersive online experience. An example is the ability to try on clothes virtually so they see how they look and fit before buying. This improves customer satisfaction and retail supply chain efficiency by reducing returns and better forecasting customers' wants.
This use case is based on image generation, but the foundation model must be focused on generating images using the retailer’s clothing line. Fine-tuning image-based foundation models such as Stable Diffusion* may only require a small number of images running on CPU platforms. Techniques such as Low-Rank Adaptation (LoRA) can more surgically insert the retailer’s offerings into the Stable Diffusion model.
The other key input to this use case is the imagery or scan of the customer’s body. The use case implications start with how to preserve the customer's privacy. The images must stay on the local edge device, perhaps the customer's phone or a locally installed image capture device.
Does this mean the entire generative AI pipeline must run on the edge, or can this application be architected in a way that encodes the necessary information from the images to upload to the rest of the model running in a data center or cloud? This type of architecture decision is the domain of MLOps professionals, who are vital to the successful development of generative AI applications.
Now, given that some amount of AI inference needs to run efficiently on a variety of edge devices, it becomes vital to choose a framework that can optimize for deployment without rewriting code for each type of device.
See a generative AI virtual try-on application in action.
Generative AI for Healthcare: Patient Monitoring Use Case
Pairing generative AI with real-time patient monitoring data can generate personalized reports, action plans, or interventions. Synthesizing data, imagery, and case notes into a summary or a recommendation can improve healthcare provider productivity while reducing the need for patients to travel to or stay in healthcare facilities.
This use case requires multimodal AI, which combines different types of models to process the heterogeneous input data, likely combined with an LLM to generate reports. Because this is a more complex use case, starting with a multimodal reference implementation for a similar use case may accelerate a project.
Training healthcare models typically raises patient data privacy questions. Often, patient data must remain with the provider, so collecting data from multiple providers to train or fine-tune a model becomes impossible. Federated learning addresses this by sending the model to the data locations for training locally and then combining the results from the various locally trained models.
Inference also needs to maintain patient privacy. The most straightforward approach would be to run inference locally to the patient. Given the size and complexity of a multimodal generative AI system, running entirely on edge devices may be challenging. It may be possible to architect the system to combine edge and data center processing, but model optimization techniques will likely still be required for the models running on edge devices.
Developing a hybrid MLOps architecture like this is much more efficient if the AI tools and frameworks run optimally on a variety of devices without having to rewrite low-level code to optimize for each type of device.
AI Innovation Bridge hackathon winners and AI startup @ocuvera answered some questions about how they’re using Intel hardware to assist nurses. Read this blog to learn about their great work with #AI in healthcare. https://t.co/cSNnGBFJ2L pic.twitter.com/8eUZSU6Dcc
— Intel Software (@IntelSoftware) November 7, 2023
Learn about the architecture behind a patient monitoring system.
How to Get Started
Start by doing your best to define your use case, using the previous questions as guidance to determine the data, compute, model, and application requirements for the problem you are trying to solve with generative AI.
Then, determine what relevant foundation models, reference implementations, and resources are available in the AI ecosystem. From there, identify and implement the fine-tuning and model optimization techniques most relevant to your use case.
Compute needs will likely not be apparent at the beginning of the project and typically evolve throughout the project. Intel® Tiber™ AI Cloud offers access to a variety of CPUs, GPUs and AI accelerators to try out or to get started developing with.
Finally, to efficiently adapt to different compute platforms during development and then to deployment, use AI tools and frameworks that are open, standards-based, and run optimally on any of the above devices without having to rewrite low-level code for each type of device.