简体   繁体   English

使用 ingest-attachment 批量索引(大约 40 k 类型的 .docx 文件)的嵌套方法是什么?

[英]What is the nest way to bulk index(around 40 k files of type .docx) using ingest-attachment?

I am fairly familiar with the ELK stack and currently using Elastic search 6.6.我对 ELK 堆栈相当熟悉,目前使用的是 Elastic search 6.6。 Our use case is content search for about 40K .docx files (uploaded by Portfolio managers as research reports. Max file size allowed is 10 MB, but mostly file sizes are in few Kb).我们的用例是搜索大约 40K .docx 文件(由投资组合经理作为研究报告上传。允许的最大文件大小为 10 MB,但大多数文件大小为几 Kb)。 I have used the ingest attachment plug in to index sample test files and I am able to also search the content using KIBANA for ex: POST /attachment_test/my_type/_search?pretty=true我使用了摄取附件插件来索引示例测试文件,并且我还可以使用 KIBANA 搜索内容,例如:POST /attachment_test/my_type/_search?pretty=true

{
  "query": {
    "match": {
      "attachment.content": "JP Morgan"
    }
  }
}

returns me the expected results.返回我预期的结果。 My doubts:我的疑惑:

  1. Using the ingest plug in, we need to push data to the plug in. I am using VS 2017 and elastic NEST dll.使用摄取插件,我们需要将数据推送到插件。我使用的是VS 2017和弹性NEST dll。 Which means, I have to programmatically read the 40K documents and push them to ES using the NEST commands?这意味着,我必须以编程方式读取 40K 文档并使用 NEST 命令将它们推送到 ES?

  2. I have gone through the Fscrawler project and know that it can achieve the purpose but I am keeping it as my last resort我经历过 Fscrawler 项目,知道它可以达到目的,但我把它作为我的最后手段

  3. If I were to use approach 1 (code), is there any bulk upload API available for posting number of attachments together to ES (in batches)?如果我使用方法 1(代码),是否有任何批量上传 API 可用于将附件数量一起发布到 ES(批量)?

Finally, I uploaded 40K files in to the elastic index using C# code:最后,我使用 C# 代码将 40K 文件上传到弹性索引:

 private static void PopulateIndex(ElasticClient client)
    {
        var directory =System.Configuration.ConfigurationManager.AppSettings["CallReportPath"].ToString();
        var callReportsCollection = Directory.GetFiles(directory, "*.doc"); //this will fetch both doc and docx
        //callReportsCollection.ToList().AddRange(Directory.GetFiles(directory, "*.doc"));
        ConcurrentBag<string> reportsBag = new ConcurrentBag<string>(callReportsCollection);
        int i = 0;
        var callReportElasticDataSet = new DLCallReportSearch().GetCallReportDetailsForElastic();//.AsEnumerable();//.Take(50).CopyToDataTable();
        try
        {
            Parallel.ForEach(reportsBag, callReport =>
            //Array.ForEach(callReportsCollection,callReport=>
            {
                var base64File = Convert.ToBase64String(File.ReadAllBytes(callReport));
                var fileSavedName = callReport.Replace(directory, "");
                // var dt = dLCallReportSearch.GetCallFileName(fileSavedName.Replace("'", "''"));//replace the ' in a file name with '';
                var rows = callReportElasticDataSet.Select("CALL_SAVE_FILE like '%" + fileSavedName.Replace("'", "''") + "'");
                if (rows != null && rows.Count() > 0)
                {
                    var row = rows.FirstOrDefault();
                    //foreach (DataRow row in rows)
                    //{
                    i++;
                    client.Index(new Document
                    {
                        Id = i,
                        DocId = Convert.ToInt32(row["CALL_ID"].ToString()),
                        Path = row["CALL_SAVE_FILE"].ToString().Replace(CallReportPath, ""),
                        Title = row["CALL_FILE"].ToString().Replace(CallReportPath, ""),
                        Author = row["USER_NAME"].ToString(),
                        DateOfMeeting = string.IsNullOrEmpty(row["CALL_DT"].ToString()) ? (DateTime?)null : Convert.ToDateTime(row["CALL_DT"].ToString()),
                        Location = row["CALL_LOCATION"].ToString(),
                        UploadDate = string.IsNullOrEmpty(row["CALL_REPORT_DT"].ToString()) ? (DateTime?)null : Convert.ToDateTime(row["CALL_REPORT_DT"].ToString()),
                        CompanyName = row["COMP_NAME"].ToString(),
                        CompanyId = Convert.ToInt32(row["COMP_ID"].ToString()),
                        Country = row["COU_NAME"].ToString(),
                        CountryCode = row["COU_CD"].ToString(),
                        RegionCode = row["REGION_CODE"].ToString(),
                        RegionName = row["REGION_NAME"].ToString(),
                        SectorCode = row["SECTOR_CD"].ToString(),
                        SectorName = row["SECTOR_NAME"].ToString(),
                        Content = base64File
                    }, p => p.Pipeline("attachments"));
                    //}
                }
            });
        }
        catch (Exception ex)
        {
            throw ex;
        }
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM