使用 R、H2O 和 MinIO 进行日志文件异常检测的完整工作流程

Brian Costa Brian Costa 于操作员指南 2022年8月22日

A Complete Workflow for Log File Anomaly Detection with R, H2O and MinIO

在企业环境中处理日志文件并非易事。手动分析需要专业知识且耗时，这使得它成本高且效率低下。相反，许多组织应用机器学习 (ML) 技术来自动有效且高效地处理传入的日志。

这种类型的流程包含多个处理步骤。每个步骤都可以在收到要处理的工件（在本例中为日志文件块）到达通知时触发。如果需要进行审计，则工件将在流程的每个步骤之后保存一段时间，然后再删除。这种架构提供了巨大的优势，因为它允许流程中的每个步骤都是无状态的。状态包含在工件本身中，因为它通过处理管道前进。

使用无状态转换，每个步骤可以根据需要通过部署更多处理单元（给定流程步骤的代码的额外实例）独立扩展。日志文件分析等流程非常适合 Kubernetes 的动态扩展（向上和向下）特性，但 Kubernetes 并不是利用此方法的必要条件。

在本教程中，我们将首先开发必要的组件，训练模型，并创建一个可用于生产的日志处理流程，然后再将它们整合在一起。

异常检测是机器学习 (ML) 的一个强大领域，适用于许多领域。在之前的文章（使用 R、H2O 和 MinIO 进行异常检测）中，我深入解释了如何使用 MNIST 数据集进行异常检测。与那篇文章一样，我们将使用 H2O、R 和 Rstudio，并将相同的技术应用于检测 Apache 访问日志文件中的异常。如果您想了解更多关于使用这些工具进行异常检测的基础知识，请参阅之前的文章。

Apache 访问日志存储有关在您的 Apache Web 服务器上发生的请求的信息。例如，当有人访问您的网站或发出 http 或 https 请求时，会存储一个日志条目，以便为 Apache Web 服务器管理员提供信息，例如访问者的 IP 地址、他们正在查看的页面、请求的状态代码、使用的浏览器或响应的大小。

分析这些文件有很多原因。它们是 Web 服务器请求的记录，非常重要，因为它们可以深入了解向 Web 服务器发出请求的用户的使用模式。一个令人感兴趣的方面是可能具有恶意性质的请求的数量和模式。能够识别异常请求模式可以深入了解潜在的攻击。

MinIO 是高性能软件定义的与 S3 兼容的对象存储，使其成为 Amazon S3 的强大且灵活的替代品。S3 API 是当前用于处理 ML 和相关数据集的标准。MinIO 将用于在此流程中保存工件。为了能够继续学习，请安装 R 和 RStudio，并能够访问 H2O 集群，如果您尚未运行 MinIO，请下载并安装它。

这篇博文旨在作为开发用于日志文件的自定义基于 AI/ML 的异常检测系统的起点。每个组织通常都会自定义他们部署的扫描类型和检测方法。Apache 访问日志文件包含所有传入到 Apache Web 服务器的访问的详细信息。格式可以自定义，因此，如果此处使用的示例日志与您组织的日志文件格式不匹配，请相应地调整代码。

示例日志文件

本教程使用了一个公开可用的日志文件，大小约为 1.5GB。

示例日志文件中的行

157.48.153.185 - - [19/Dec/2020:14:08:06 +0100] "GET /apache-log/access.log HTTP/1.1" 200 233 "-" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "-"
157.48.153.185 - - [19/Dec/2020:14:08:08 +0100] "GET /favicon.ico HTTP/1.1" 404 217 "http://www.almhuette-raith.at/apache-log/access.log" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "-"
216.244.66.230 - - [19/Dec/2020:14:14:26 +0100] "GET /robots.txt HTTP/1.1" 200 304 "-" "Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)" "-"
54.36.148.92 - - [19/Dec/2020:14:16:44 +0100] "GET /index.php?option=com_phocagallery&view=category&id=2%3Awinterfotos&Itemid=53 HTTP/1.1" 200 30662 "-" "Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)" "-"
92.101.35.224 - - [19/Dec/2020:14:29:21 +0100] "GET /administrator/index.php HTTP/1.1" 200 4263 "" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" "-"

创建 MinIO 存储桶

我们将在 MinIO 中创建一些存储桶以保存各种工件，因为流程将在 MinIO 控制台中看到的那样执行

您可以在 MinIO 控制台中看到创建的存储桶。

我们使用的第一个存储桶是 access-log-files。这是日志文件存放的地方。它们可以使用多种机制传递，Apache Kafka 就是我们在生产环境中经常看到的一种（有关更多详细信息，请参阅 MinIO 集成）。然后将使用 MinIO Lambda 计算存储桶通知在每个文件到达并写入 MinIO 存储桶时触发处理。

底层函数

在我们可以运行生产日志文件处理流程之前，我们需要一个经过训练的异常检测模型来应用。训练模型需要创建一个训练集——一个包含用于训练的日志请求实例的 R 数据框。

在本教程中，所需的库调用已收集到一个源文件中。文件 packages.R 加载所需的库。文件 PreProcessLogFile.R 包含一个函数，该函数将传入的日志块转换为数据框。代码读取数据，删除一些列，重命名其他列，操作时间组件，并根据源 IP 地址使用外部数据源（https://www.maxmind.com）使用外部数据源来增强日志数据，添加洲和国家/地区信息。在将原始数据处理成更易用的数据框后，我们将数据框存储在 access-log-dataframes 存储桶中。

==== packages.R
#load necessary libraries

if (!require("plumber")) {
  install.packages("plumber")
  library(plumber)
}

if (!require("jsonlite")) {
  install.packages("jsonlite")
  library(jsonlite)
}


if (!require("aws.s3")) {
  install.packages("aws.s3")
  library(aws.s3)
}

if (!require("rgeolocate")) {
  install.packages("rgeolocate")
  library(rgeolocate)
}

if (!require("lubridate")) {
  install.packages("lubridate")
  library(lubridate)
}



if (!require("h2o")) {
  install.packages("h2o")
  library(h2o)
}
==== end packages.R

==== PreProcessLogFile.R
PreProcessLogFile <- function (srcBucket,srcObject,destBucket,destObject) {
  
  # set the credentials this r instances uses to access minio
  Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin", # enter your credentials
             "AWS_SECRET_ACCESS_KEY" = "minioadmin", # enter your credentials
             "AWS_DEFAULT_REGION" = "",
             "AWS_S3_ENDPOINT" = "10.0.0.10:9000")
  
  
  
  b <- get_bucket(bucket = srcBucket, use_https = F, region = "")
  df_access <- aws.s3::s3read_using(FUN = read.table, object = srcObject, bucket = b, 
                                    opts = list(use_https = FALSE, region = ""))
  
  # join the time string back together
  df_access$V4 <- paste(df_access$V4, df_access$V5)
  
  # remove some noise
  drops <- c("V2", "V3", "V5", "V11") 
  df_access <-df_access[, !(names(df_access) %in% drops)]
  
  # rename the columns
  names(df_access)[1] <- "client_ip"
  names(df_access)[2] <- "access_time"
  names(df_access)[3] <- "client_request"
  names(df_access)[4] <- "status_code"
  names(df_access)[5] <- "response_size"
  names(df_access)[6] <- "referrer"
  names(df_access)[7] <- "user_agent"
  
  
  # load the Geolocation data
  # https://www.maxmind.com
  ips<- df_access$client_ip
  maxmind_file <- "data/Geo/Geolite2-Country_20220517/Geolite2-Country.mmdb"
  country_info <- maxmind(
    ips,
    maxmind_file,
    fields = c("continent_name", "country_name", "country_code")
  )
  
  # add some columns for location information
  df_access$continent_name <- as.factor(country_info$continent_name)
  df_access$country_name <- as.factor(country_info$country_name)
  df_access$country_code <- as.factor(country_info$country_code)
  
  # use lubridate to help coerce the access_time string into something usable
  # we might be interested in what day of week, hour, min, second an access occurred - as factors
  df_access$access_time <- dmy_hms(df_access$access_time)
  df_access$wday <- wday(df_access$access_time)
  df_access$access_hour <- as.factor(format(df_access$access_time, format = "%H"))
  df_access$access_min <- as.factor(format(df_access$access_time, format = "%M"))
  df_access$access_sec <- as.factor(format(df_access$access_time, format = "%S"))
  
  # since trying to train an identity function for anomaly detection using the
  # exact access_time is difficult, and doesn't really help, remove the access_time
  df_access <- subset(df_access, select = -c(access_time))
  
  # all the inputs need to be numeric or factor, so clean up the rest
  df_access$client_ip <- as.factor(df_access$client_ip)
  df_access$client_request <- as.factor(df_access$client_request)
  df_access$status_code <- as.factor(df_access$status_code)
  df_access$response_size <- as.numeric(df_access$response_size)
  df_access$referrer <- as.factor(df_access$referrer)
  df_access$user_agent <- as.factor(df_access$user_agent)
  
  # save off the munged data frame - don't want to do this pre-processing again
  b <- get_bucket(bucket = destBucket, region = "", use_https = F)
  s3write_using(df_access, FUN = saveRDS, object = destObject, bucket = b, 
                opts = list(use_https = FALSE, region = "", multipart = TRUE))
}


==== end PreProcessLogFile.R

训练异常检测自动编码器

一旦数据被组织并放入数据框中，我们就可以很容易地使用 H2O 训练深度学习自动编码器。该过程是读取数据框，将其拆分为训练集和测试集，识别预测变量，最后训练模型。模型训练完成后，将其保存回 bin-models MinIO 存储桶以供流程使用。如何实现这一点，以及什么是异常以及自动编码器如何检测异常，之前已在使用 R、H2O 和 MinIO 进行异常检测中讨论过。

==== TrainModel.R
source("packages.R")

source("PreProcessLogFile.R")

# set the credentials this r instances uses to access minio
Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin", # enter your credentials
           "AWS_SECRET_ACCESS_KEY" = "minioadmin", # enter your credentials
           "AWS_DEFAULT_REGION" = "",
           "AWS_S3_ENDPOINT" = "10.0.0.10:9000") 




# initialize the h2o server
# h2o.init(ip="10.0.0.10", port=54321,startH2O=FALSE)
# h2o.set_s3_credentials("minioadmin", "minioadmin")
h2o.init(jvm_custom_args = "-Dsys.ai.h2o.persist.s3.endPoint=http://10.0.0.10:9000 -Dsys.ai.h2o.persist.s3.enable.path.style=true")
h2o.set_s3_credentials("minioadmin", "minioadmin")

# turn the log chunk into a dataframe
bucketName <- "access-log-files"
objectName <- "access_sample.log"
destBucketName <- "access-log-dataframes"
destObjectName <- "access-log-dataframe.rda"
PreProcessLogFile(bucketName, objectName, destBucketName, destObjectName)



#load the previously pre-processed dataframe
b <- get_bucket(bucket = 'access-log-dataframes', use_https = F, region ="")
df_access <-s3read_using(FUN = readRDS, object = "access-log-200000.rda", bucket = b, 
                         opts = list(use_https = FALSE, region = ""))

# load into h2o and convert into h2o binary format
df_access.hex = as.h2o(df_access, destination_frame= "df_access.hex")

# split this dataframe into a train and test set
splits <- h2o.splitFrame(data = df_access.hex, 
                         ratios = c(0.6),  #partition data into 60%, 40%
                         seed = 1)  #setting a seed will guarantee reproducibility
train_hex <- splits[[1]]
test_hex <- splits[[2]]

#save the validate set to csv, going to use it in another step
b <- get_bucket(bucket = 'access-log-dataframes', region = "", use_https = F)
s3write_using(test_hex, FUN = saveRDS, object = "access-log-test.rda", bucket = b, 
              opts = list(use_https = FALSE, region = "", multipart = TRUE))

predictors <- c(1:13)

# use the training data to create a deeplearning based autoencoder model
# about 3 million tunable parameters with the factorization of the fields
ae_model <- h2o.deeplearning(x=predictors,
                             training_frame=train_hex,
                             activation="Tanh",
                             autoencoder=TRUE,
                             hidden=c(50),
                             l1=1e-5,
                             ignore_const_cols=FALSE,
                             epochs=1)

# save the model as bin
model_path <- h2o.saveModel(ae_model, path = "s3://bin-models/apache-access-log-file-autoencoder-bin")


==== end TrainModel.R

应用异常检测自动编码器

一旦我们拥有了训练好的模型，我们就可以在新的日志块到达时将模型应用于这些块以识别异常。下面是一个名为 IdentifyAnomalies.R 的文件，其中包含应用的函数。

==== IdentifyAnomalies.R

IdentifyAnomalies <- function(srcBucketName, srcObjectName, destBucketName, destObjectName, modelBucket, modelName) {
  
  # set the credentials this r instances uses to access minio
  Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin", # enter your credentials
             "AWS_SECRET_ACCESS_KEY" = "minioadmin", # enter your credentials
             "AWS_DEFAULT_REGION" = "",
             "AWS_S3_ENDPOINT" = "<MinIO-IP-Address>:9000") 
  
  
  # initialize the h2o server
  h2o.init(ip="<MinIO-IP-Address>", port=54321,startH2O=FALSE)
  h2o.set_s3_credentials("minioadmin", "minioadmin")
  
  
  # load the model that was previously saved
  # the name of the model needs to have been saved off somewhere so this specific model can be loaded
  model_path <- paste0("s3://",modelBucket,"/",modelName)
  ae_model <- h2o.loadModel(model_path)
  
  
  #load the previously pre-processed dataframe
  b <- get_bucket(bucket = srcBucketName, use_https = F, region ="")
  test_access <-s3read_using(FUN = readRDS, object = srcObjectName, bucket = b, 
                             opts = list(use_https = FALSE, region = ""))
  
  # load into h2o and convert into h2o binary format
  test_hex = as.h2o(test_access, destination_frame= paste0(srcObjectName,".hex"))
  
  
  # h2o.anomaly computes the per-row reconstruction error for the test data set
  # (passing it through the autoencoder model and computing mean square error (MSE) for each row)
  test_rec_error <- as.data.frame(h2o.anomaly(ae_model, test_hex)) 
  
  listAccesses <- function(data, rec_error, rows) {
    row_idx <- order(rec_error[,1],decreasing=F)[rows]
    my_rec_error <- rec_error[row_idx,]
    my_data <- as.data.frame(data[row_idx,])
    
  }
  
  
  # These are the biggest outliers
  num_rows = nrow(test_hex)
  test_r<-as.data.frame(test_hex)
  
  beg <- floor(num_rows-20)
  end <- num_rows
  anomaly_df <- listAccesses(test_r, test_rec_error, c(beg:end))
  anomaly_df <- anomaly_df[,c("country_name","wday","access_hour")]
  
  
  # write this dataframe into the next bucket
  
  b <- get_bucket(bucket = destBucketName, region = "", use_https = F)
  s3write_using(anomaly_df, FUN = saveRDS, object = destObjectName, bucket = b, 
                opts = list(use_https = FALSE, region = "", multipart = TRUE))
}

==== end IdentifyAnomalies.R

使用 Webhook 构建推理流程

MinIO Lambda 计算存储桶通知用于构建事件驱动的流程。当日志文件到达时，会创建一个事件。这些事件触发工件通过新日志文件块的推理处理过程的进展。 MinIO Lambda 计算存储桶通知可以与许多通知机制集成。在本教程中，我们将使用与 Webhook 的集成。R 使用 Plumber 包支持创建 RESTful Web 接口。要使用 plumber，一个文件需要定义访问方法/端点元组以及每个端点执行的功能代码。以下是本教程的 Plumber.R 文件。Plumber 使用前面描述的两个其他文件中的 PreProcessLogFile() 和 IdentifyAnomalies() 函数。在此文件顶部是 JSON 存储桶通知事件的示例。

Plumber 获取此带注释的 R 脚本并将其转换为可执行的 Web 服务器。在下面，端点在 Plumber.R 文件中定义

==== Plumber.R

# Plumber.R

# Each of these has a function used below
source("PreProcessLogFile.R")
source("IdentifyAnomalies.R")



#* Log some information about the incoming request
#* @filter logger
function(req){
  cat(as.character(Sys.time()), "-",
      req$REQUEST_METHOD, req$PATH_INFO, "-",
      req$HTTP_USER_AGENT, "@", req$REMOTE_ADDR, "\n")
  plumber::forward()
}

#* endpoint handler for the MinIO bucket notifications
#* @post /
function(req){
  
  json_text <- req$postBody
  
  # {
  #   "EventName": "s3:ObjectCreated:Put",
  #   "Key": "access-log-files/access_sample_short.log",
  #   "Records": [
  #     {
  #       "eventVersion": "2.0",
  #       "eventSource": "minio:s3",
  #       "awsRegion": "",
  #       "eventTime": "2022-08-10T18:19:38.663Z",
  #       "eventName": "s3:ObjectCreated:Put",
  #       "userIdentity": {
  #         "principalId": "minioadmin"
  #       },
  #       "requestParameters": {
  #         "principalId": "minioadmin",
  #         "region": "",
  #         "sourceIPAddress": "<MinIO-IP-Address>"
  #       },
  #       "responseElements": {
  #         "content-length": "0",
  #         "x-amz-request-id": "170A0EB32509842A",
  #         "x-minio-deployment-id": "e88d6f13-657f-4641-b349-74ce2795d730",
  #         "x-minio-origin-endpoint": "<MinIO-IP-Address>:9000"
  #       },
  #       "s3": {
  #         "s3SchemaVersion": "1.0",
  #         "configurationId": "Config",
  #         "bucket": {
  #           "name": "access-log-files",
  #           "ownerIdentity": {
  #             "principalId": "minioadmin"
  #           },
  #           "arn": "arn:aws:s3:::access-log-files"
  #         },
  #         "object": {
  #           "key": "access_sample_short.log",
  #           "size": 1939,
  #           "eTag": "bb48fe358c017940ecc5fb7392357641",
  #           "contentType": "application/octet-stream",
  #           "userMetadata": {
  #             "content-type": "application/octet-stream"
  #           },
  #           "sequencer": "170A0EB3F219CCC4"
  #         }
  #       },
  #       "source": {
  #         "host": "<MinIO-IP-Address>",
  #         "port": "",
  #         "userAgent": "MinIO (linux; amd64) minio-go/v7.0.34"
  #       }
  #     }
  #   ]
  # }
  
  # extract the raw JSON into a data structure
  j <- fromJSON(json_text, flatten = TRUE)
  
  # get the eventName, the bucketName, and the objectName
  # for this example we know it's a put so we can ignore the eventName
  eventName <- j[["EventName"]]
  
  bucketName <- j[["Records"]]$s3.bucket.name
  objectName <- j[["Records"]]$s3.object.key
  
  
  destBucketName <- "access-log-dataframes"
  destObjectName <- "access-log-dataframe.rda"
  
  
  # turn the log chunk into a dataframe
  PreProcessLogFile(bucketName, objectName, destBucketName, destObjectName)
  
  # the destination for this step becomes the src for the next
  # rename then just to maintain sanity
  srcBucketName <- destBucketName
  srcObjectName <-destObjectName
  
  destBucketName <- "access-log-anomaly-dataframes"
  destObjectName <- "access-log-anomaly-dataframe.rda"
  
  modelBucketName <- "bin-models/apache-access-log-file-autoencoder-bin"
  modelName <- <The-Name-Of-The-Built-Model>
  
  
  # use the trained model and identify anomalies in the log chunk that just arrived
  IdentifyAnomalies(srcBucketName, srcObjectName, destBucketName, destObjectName, modelBucketName, modelName)
  
}

==== end Plumber.R

一旦我们有了定义此 RESTful Web 接口端点文件，我们需要创建一个文件来解析带注释的脚本并启动服务器监听。我们将创建文件 Server.R 来完成此操作。

==== Server.R
# the REST endpoint server

#source the required packages libraries
source("packages.R")

# process the Plumber.R file and show the valid endpoints
root <- pr("Plumber.R")
root

# make the endpoints active
root %>% pr_run(host = "<REST-endpoint-IP-Address>", port = 8806)


==== end Server.R

我们已配置 R 充当 RESTful Web 服务器，并通过 Webhook 通知存储桶事件。我们需要通过在 R Studio 中运行 Server.R 来启动 Web 服务器。我们这样做是因为当我们尝试配置 Webhook 时，MinIO 会验证 Web 服务器是否存在并正在运行。服务器运行后，我们可以配置 MinIO 使用 Webhook。

使用 Webhook 配置 Lambda 计算存储桶通知

配置 MinIO 以发出通知有两个步骤。第一步是将 MinIO 集群中的端点配置为通知的目标，这可以通过使用 MinIO mc 客户端来完成。在我的情况下，Webhook 侦听器在我的笔记本电脑上运行，IP 地址为 192.168.1.155，端口为 8806。请根据您的环境进行调整。

mc admin config set myminio notify_webhook:preProcessLogFiles queue_limit="0"  queue_dir="" endpoint="http://192.168.1.155:8806"

设置此配置后，您需要重新启动 MinIO 集群。

第二步是配置 MinIO 何时应发出通知的具体情况。此处，当对象放入 myminio/access-log-files 存储桶时，会发出通知。

mc event add myminio/access-log-files arn:minio:sqs::preProcessLogFiles:webhook --event put

我们已经为流程奠定了基础。当日志文件被 PUT 到 access-log-files 存储桶中时，上面的代码会被触发，将日志文件转换为数据框，然后应用训练好的模型来识别 http 请求中的异常。

事件驱动的 ML 异常流程的实际应用

接下来，我们将一个日志文件复制到 access-log-files 存储桶中以触发流程。

我创建了一个较小的日志文件，它仅包含整个示例文件的 300k 行。当我将其 PUT 到存储桶中时，MinIO 会发送一个通知事件，启动异常检测流程。

IdentifyAnomalies() 函数创建并保存了一个数据框来存储异常 - 那些重建误差最大的实例。当我将较小的样本日志文件复制到 access-log-files 桶中时，它启动了工作流。结果是 anomalies_df 数据框被写入到桶中。在生产工作流中，此数据框的到达可能会触发 Lambda 计算桶通知以进一步处理此数据框的内容，例如将这些行添加到系统中以进一步检查这些请求。

以下是基于训练好的深度学习自动编码器重建误差最大的 21 个请求。我减少了列数，以便于检查。请记住，**成为异常仅仅表示所考虑的实例在输入向量空间中与用于训练异常检测自动编码器的训练数据存在一定距离**。因此，至关重要的是，自动编码器必须使用代表被认为正常的取值范围的训练实例进行训练。应进一步检查这些重建误差高的实例，以确定它们是否为问题。

虽然在生产工作流中不会这样做，但我们也可以使用下面的脚本直观地检查这些实例，看看是否存在任何明显的模式。

==== PlotAnomalies.R
library(aws.s3)
library(ggplot2)


# set the credentials this r instances uses to access minio
Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin", # enter your credentials
           "AWS_SECRET_ACCESS_KEY" = "minioadmin", # enter your credentials
           "AWS_DEFAULT_REGION" = "",
           "AWS_S3_ENDPOINT" = "<MinIO-IP-Address>:9000") 


b <- get_bucket(bucket = "access-log-anomaly-dataframes", region = "", use_https = F)
df <- s3read_using(FUN = readRDS, object = "access-log-anomaly-dataframe.rda", bucket = b, 
              opts = list(use_https = FALSE, region = "", multipart = TRUE))


# now lets look at the results
df$access_time <- (df$wday*24) +  as.numeric(as.character(df$access_hour))


jitter <- position_jitter(width = 0.2, height = 0.2)
p<-ggplot() +
  layer(data = df,
        stat = "identity",
        geom = "point",
        mapping = aes(x = country_name, y = access_time, color = "red"),
        position = jitter) +
  theme(axis.text.x = element_text(angle = 90))
  
plot(p)


===== end PlotAnomalies.R

当按国家/地区和 access_time（此处 access_time 为 wday * 24 + access_hour）绘制条目时，我们开始看到一些聚类。

本教程到此结束，但重要的是要理解这不会是日志文件分析工作流的终点。这篇博文介绍了如何获取原始日志文件，将其转换为有用的数据框，然后训练异常检测深度学习自动编码器并保存训练好的模型。然后，我们使用日志预处理代码并应用训练好的模型对新的日志块进行推理，利用 MinIO Lambda 计算桶通知来驱动工作流。

开始使用日志文件异常检测工作流

从合适的工具开始简化了构建 ML 数据管道，减少了从原始数据中获取洞察所需的时间和精力。

在本博文中重点介绍的技术——R、H2O 和 MinIO——构成了一个功能强大、灵活且快速的 ML 工具箱。本教程提供了一个使用异常检测处理标准化日志文件格式的示例。处理日志文件数据必然涉及以合理的方式预处理原始数据——并且“合理”的定义随着时间的推移而不断发展，因为这是一个持续的研究领域。无论您在组织中如何定义“合理”，MinIO 都可以作为事件驱动数据预处理和 ML 数据管道的强大基础工具。

下载 MinIO 并立即构建您的 ML 工具箱。如果您有任何疑问，请发送电子邮件至 hello@min.io，或加入 MinIO Slack 频道并提出您的问题。