使用 MinIO 和 PyTorch Serve 优化 AI 模型服务

Sidharth Rajaram Sidharth Rajaram @sidharrrrrth

on AI/ML 2023 年 7 月 18 日

Optimizing AI Model Serving with MinIO and PyTorch Serve

通过利用 MinIO 对象存储的简单性，使你的 AI 模型服务更加轻量级。

简而言之

MinIO 对象存储可以用作你机器学习模型的“单一事实来源”，进而使 PyTorch Serve 在管理大型语言模型 (LLM) 更改时更高效地提供服务。与往常一样，示例代码在我们的 GitHub 仓库中。

PyTorch Serve 和模型归档问题

PyTorch Serve 已成为一个相对易于使用的推理服务器，用于管理和扩展 ML 模型的服务。要使用 PyTorch Serve，给定模型的文件、依赖项和“处理”指令（稍后将详细介绍这些“处理程序”）需要打包到一种称为模型归档的可移植归档格式中。换句话说，你的 PyTorch Serve 服务器实例处理模型推理 API 调用所需的一切都包含在其模型归档或 .MAR 文件中。

那么这幅图有什么问题呢？

模型归档文件（MAR 文件）很大，生成需要很长时间，尤其是在处理大型语言模型 (LLM) 或嵌入模型时，模型文件大小可能达到千兆字节。这会带来几个下游后果。

首先，它会增加正在提供服务的模型的更新周转时间，无论是生产环境还是实验环境：在你对模型架构进行微调后，你需要生成一个新的 MAR 文件，然后重新配置并重启服务器以加载这个新的 MAR 文件。此外，这会增加组织负担——跟踪你的模型文件以及多个模型归档文件。

其次，大型 MAR 文件的解压缩过程会占用大量内存，导致服务器第一个工作进程初始化时的等待时间过长（注意：PyTorch Serve 的核心是“工作进程”，它们基本上是相同的进程，每个进程在内存中都有一份训练好的模型副本）。请参见第一个图中的步骤 2。

理想情况下，我们希望 MAR 文件在一定程度上与模型架构分离，从而更能适应变化。此外，我们希望 MAR 文件也更轻量级，这将使服务更高效。

在这篇文章中，我们将学习创建一个不与模型文件绑定的 MAR 文件。相反，我们将在 PyTorch Serve 初始化期间使用 MinIO 来存储和检索模型文件。我们将看到，这将如何导致更轻量级的 MAR 文件和缩短模型更改后的周转时间。我们还将介绍利用此功能所需的特定代码更改以及一个端到端示例。

将模型文件存储在 MinIO 上

给定模型的文件通常如下所示

(这个特定的模型是 HuggingFace 上一个流行的大型语言模型)

传统上，人们会将这些文件下载到他们机器上的某个本地目录，然后将 PyTorch 模型归档工具指向这些文件以生成 MAR 文件。

PyTorch Serve 使用的优化始于以略微不同的方式执行此步骤。不是在本地存储模型文件并基于这些文件生成 MAR 文件，而是可以将所有模型文件作为对象存储在 MinIO 存储桶中。例如，在我们自己的 MinIO 项目中，我们使用名为“models”的存储桶，并将每个模型的文件存储在其中

存储桶中的每个文件夹都包含所有模型文件，这些文件与 HuggingFace 中的文件完全相同。此时，自然会产生一个问题：为什么不能直接从 HuggingFace 下载模型文件？虽然 HuggingFace Transformers API 确实有一些简单的构造函数，可用于直接从 HuggingFace 存储库加载模型文件，但这种方法存在一些缺点需要考虑。网络延迟或 HuggingFace 停机可能会使模型不可用或下载速度极慢。在生产环境中，最佳实践是尽可能避免第三方依赖项。在本地使用 MinIO 意味着完全避免这些问题。

MinIO 是一种专门构建的本地存储解决方案。因此，将 MinIO 用作模型文件的对象存储本身就带来了几个直接好处

所有模型及其文件的统一存储——“单一事实来源”。
开箱即用的每个文件版本控制功能——因此，每个模型都有版本控制功能。
你可以使用 MinIO 客户端快速、可靠且轻松地获取这些文件。

有关 MinIO 对象管理优势的更详细说明，请查看用于 AI ML 的对象管理。

将所有内容整合到一个新的处理程序中

那么，这实际上如何简化 TorchServe 的使用？我们如何在 MinIO 存储桶中使用这些模型文件？MAR 文件呢？要回答这些问题，我们首先需要仔细了解 PyTorch Serve 实际上是如何启动并提供服务的。

通常，PyTorch Serve 会分配一个临时目录（我们将其称为 TEMP）来保存与你的模型相关的所有文件。分配后，TEMP 将会填充来自 MAR 文件的解压缩内容：模型文件、依赖项和处理程序文件。当工作进程初始化时，它会根据 TEMP 中的文件在内存中加载模型副本，从而准备好处理任何推理请求。这种“原始”过程存在问题（参见上文：PyTorch Serve 和模型归档问题）。

相反，如果我们的 MAR 文件只包含处理程序文件和模型依赖项呢？生成和解压缩 MAR 文件将花费更少的时间。更重要的是，对模型架构的更新将不需要重新生成 MAR 文件。

为了实现这一点，我们需要在模型处理程序文件中添加几行代码。处理程序文件充当 PyTorch Serve 工作进程的指令指南。处理程序负责两件事：(1) 初始化模型对象和 (2) 处理推理调用

当工作进程初始化时，预期 TEMP 中包含所有模型文件。这就是 PyTorch Serve 能够相当无缝地增加工作进程数量的原因。通过稍微修改处理程序的 initialize() 函数，使其在启动时从 MinIO 中获取模型文件，我们就可以解决原始 PyTorch Serve 使用中的问题，并继续保持处理程序的预期

以下是处理程序代码中的修改方式

class MyHandler(BaseHandler):
...

def load_model_files_from_bucket(self, context):
"""
Fetch model files from MinIO if not present in Model Store.
"""
client = self.get_minio_client()
properties = context.system_properties
object_name_prefix_len = len(CURRENT_MODEL_NAME) + 1
# model_dir is the temporary directory (TEMP) allocated in the Model Store for this model
model_dir = properties.get("model_dir")
try:
for item in client.list_objects(MODEL_BUCKET, prefix=CURRENT_MODEL_NAME, recursive=True):
# We don't include the object name's prefix in the destination file path because we
# don't want the enclosing folder to be added to TEMP.
destination_file_path = model_dir + "/" + item.object_name[object_name_prefix_len:]
# only fetch the model file if it is not already in TEMP
if not os.path.exists(destination_file_path):
client.fget_object(MODEL_BUCKET, item.object_name, destination_file_path)
return True
except S3Error:
return False

def initialize(self, context):
"""
Worker initialization method.
Loads up a copy of the trained model.
"""
properties = context.system_properties
model_dir = properties.get("model_dir")
success = self.load_model_files_from_bucket(context)
if not success:
print("Something went wrong while attempting to fetch model files.")
return
...
self.model = ... # model specific implementation
self.initialized = True

...

这种简单的更改促成了我们在整篇文章中讨论的好处。有关更完整的示例实现，请查看这篇文章结尾处的示例。让我们从端到端的角度看一下修改后的流程

通过此流程，你现在可以：(1) 使用更轻量级的 MAR 文件，而且每次编辑要提供的模型时都不需要生成 MAR 文件，(2) 停止等待 PyTorch Serve 初始化模型时的繁琐解压缩过程，以及 (3) 依赖模型文件及其各自版本的单一事实来源。

所以呢？

现在，你已经能够使用更轻量级的 MAR 文件，并利用 MinIO 的版本控制功能和存储性能来存储模型文件，那么你可以做什么呢？

在我们自己的 MinIO 项目中，我们利用这种范例来快速试验和部署不同的模型，而不必担心它们之间的长时间周转时间。所有这些好处在处理大型语言模型和嵌入模型（坦率地说，它们是巨大的）时变得尤为明显。

一个端到端示例

让我们从模型选择到服务的过程中运行一个示例。对于这个示例，我们将尝试提供 Falcon-7B 的服务，这是一个流行的大型语言模型。与往常一样，此示例的代码可以在我们的GitHub 上找到。

首先，请按照 PyTorch Serve 的 README 中的说明确保你已安装 PyTorch Serve 的依赖项。

以下是 Hugging Face 上的模型文件的外观

我将首先使用 Git LFS 克隆文件到我的机器上（注意：文件很大）

$ git lfs install
$ git clone <https://hugging-face.cn/tiiuae/falcon-7b>

然后，我将把下载的文件夹（其中包含所有模型文件）上传到 MinIO 服务器中的“models”存储桶

我现在将编写一个处理程序，用于处理从 MinIO 获取模型文件，以及传统职责，例如加载模型副本和处理对 Falcon LLM 的推理调用。以下是处理程序（名为 miniofied_handler.py），其中包含对每个关键组件的一些解释。

"""
PyTorch Serve model handler using MinIO
See the full blog post at:

For more information about custom handlers and the Handler class:
<https://pytorch.ac.cn/serve/custom_service.html#custom-handler-with-class-level-entry-point>
"""
from minio import Minio
from minio.error import S3Error
import os
import torch
from transformers import AutoTokenizer
import transformers
from ts.torch_handler.base_handler import BaseHandler

# In this example, we serve the Falcon-7B Large Language Model (<https://hugging-face.cn/tiiuae/falcon-7b>)
# However, you can use your model of choice. Just make sure to edit the implementations of
# initialize() and handle() according to your model!

# Make sure the following are populated with your MinIO server details
# (Best practice is to use environment variables!)
MINIO_ENDPOINT = ''
MINIO_ACCESS_KEY = ''
MINIO_SECRET_KEY = ''
MODEL_BUCKET = 'models'
CURRENT_MODEL_NAME = "falcon-7b"

def get_minio_client():
"""
Initializes and returns a Minio client object
"""
client = Minio(
MINIO_ENDPOINT,
access_key=MINIO_ACCESS_KEY,
secret_key=MINIO_SECRET_KEY,
)
return client

class MinioModifiedHandler(BaseHandler):
"""
Handler class that loads model files from MinIO.
"""
def __init__(self):
super().__init__()
self.initialized = False
self.model = None
self.tokenizer = None

def load_model_files_from_bucket(self, context):
"""
Fetch model files from MinIO if not present in Model Store.
"""
client = self.get_minio_client()
properties = context.system_properties
object_name_prefix_len = len(CURRENT_MODEL_NAME) + 1
# model_dir is the temporary directory (TEMP) allocated in the Model Store for this model
model_dir = properties.get("model_dir")
try:
# fetch all the model files and place them in TEMP
# the following assumes a bucket organized like this:
# MODEL_BUCKET -> CURRENT_MODEL_NAME -> all the model files
for item in client.list_objects(MODEL_BUCKET, prefix=CURRENT_MODEL_NAME, recursive=True):
# We don't include the object name's prefix in the destination file path because we
# don't want the enclosing folder to be added to TEMP.
destination_file_path = model_dir + "/" + item.object_name[object_name_prefix_len:]
# only fetch the model file if it is not already in TEMP
if not os.path.exists(destination_file_path):
client.fget_object(MODEL_BUCKET, item.object_name, destination_file_path)
return True
except S3Error:
return False

def initialize(self, context):
"""
Worker initialization method.
Loads up a copy of the trained model.

See <https://hugging-face.cn/tiiuae/falcon-7b> for details about how
the Falcon-7B LLM is loaded with the use of the Transformers library
"""
properties = context.system_properties
model_dir = properties.get("model_dir")
success = self.load_model_files_from_bucket(context)
if not success:
print("Something went wrong while attempting to fetch model files.")
return
tokenizer = AutoTokenizer.from_pretrained(model_dir)
pipeline = transformers.pipeline(
"text-generation",
model=model_dir,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
self.model = pipeline
self.tokenizer = tokenizer
self.initialized = True

def handle(self, data, context):
"""
Entrypoint for inference call to TorchServe.
Note: This example assumes the request body looks like:
{
"input": "<input for inference>"
}
Note: Check the 'data' argument to see how your request body looks.
"""
input_text = data[0].get("body").get("input")
sequences = self.model(
input_text,
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=self.tokenizer.eos_token_id,
)
return [sequences]

现在我们已经有了自定义处理程序，就可以创建 MAR 文件了。可以在主机环境中直接安装模型特定的依赖项（即 transformers），但这最好将这些依赖项的安装限制在 PyTorch Serve 为你的模型创建的环境中。为了利用这一点，为处理程序特定的依赖项创建一个requirements.txt 文件

minio
torch
transformers

假设您位于与您刚刚创建的处理程序（以及可选的 requirements.txt）相同的目录中，您现在可以使用 torch-model-archiver 工具创建 MAR 文件（我选择名称“falcon-llm”）。

$ torch-model-archiver --model-name falcon-llm --version 1.0 --handler ./miniofied_handler.py --requirements-file ./requirements.txt

注意：仅当您使用 requirements.txt 用于处理程序/模型特定的依赖项，而不是直接安装到主机环境中时，才需要 --requirement-file 标志。为了让 PyTorch Serve 使用此 requirements 文件，您必须在 config.properties 中声明 install_py_dep_per_model=true，这是一个PyTorch Serve 用于配置的文件。

现在，假设您仍然位于与处理程序相同的目录中，并且现在是您的 MAR 文件，您就可以开始服务了。

$ torchserve --start --model-store ./ --models falcon-llm.mar

注意：服务器可以以多种不同的方式配置和运行（例如，运行的端口等）。有关更多详细信息，请参阅官方 PyTorch 文档。

恭喜！您现在可以开始在以下位置进行推理调用：http://127.0.0.1:8080/predictions/falcon-llm

$ curl --header "Content-Type: application/json" \
--request POST \
--data '{"input": "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\\nDaniel: Hello, Girafatron!\\nGirafatron:"}' \
http://127.0.0.1:8080/predictions/falcon-llm

总结

使用 MinIO 对象存储，您的服务基础设施现在更加轻量级，并且对模型架构的更改具有更高的弹性。因此，MAR 文件的通常较长的存档和解压缩时间被缩短了。此外，您的模型及其对应的 MAR 文件现在移动得更少，使您能够节省时间并减少对模型服务中常见的组织开销的担心。