简体   繁体   中英

What is the nest way to bulk index(around 40 k files of type .docx) using ingest-attachment?

I am fairly familiar with the ELK stack and currently using Elastic search 6.6. Our use case is content search for about 40K .docx files (uploaded by Portfolio managers as research reports. Max file size allowed is 10 MB, but mostly file sizes are in few Kb). I have used the ingest attachment plug in to index sample test files and I am able to also search the content using KIBANA for ex: POST /attachment_test/my_type/_search?pretty=true

{
  "query": {
    "match": {
      "attachment.content": "JP Morgan"
    }
  }
}

returns me the expected results. My doubts:

  1. Using the ingest plug in, we need to push data to the plug in. I am using VS 2017 and elastic NEST dll. Which means, I have to programmatically read the 40K documents and push them to ES using the NEST commands?

  2. I have gone through the Fscrawler project and know that it can achieve the purpose but I am keeping it as my last resort

  3. If I were to use approach 1 (code), is there any bulk upload API available for posting number of attachments together to ES (in batches)?

Finally, I uploaded 40K files in to the elastic index using C# code:

 private static void PopulateIndex(ElasticClient client)
    {
        var directory =System.Configuration.ConfigurationManager.AppSettings["CallReportPath"].ToString();
        var callReportsCollection = Directory.GetFiles(directory, "*.doc"); //this will fetch both doc and docx
        //callReportsCollection.ToList().AddRange(Directory.GetFiles(directory, "*.doc"));
        ConcurrentBag<string> reportsBag = new ConcurrentBag<string>(callReportsCollection);
        int i = 0;
        var callReportElasticDataSet = new DLCallReportSearch().GetCallReportDetailsForElastic();//.AsEnumerable();//.Take(50).CopyToDataTable();
        try
        {
            Parallel.ForEach(reportsBag, callReport =>
            //Array.ForEach(callReportsCollection,callReport=>
            {
                var base64File = Convert.ToBase64String(File.ReadAllBytes(callReport));
                var fileSavedName = callReport.Replace(directory, "");
                // var dt = dLCallReportSearch.GetCallFileName(fileSavedName.Replace("'", "''"));//replace the ' in a file name with '';
                var rows = callReportElasticDataSet.Select("CALL_SAVE_FILE like '%" + fileSavedName.Replace("'", "''") + "'");
                if (rows != null && rows.Count() > 0)
                {
                    var row = rows.FirstOrDefault();
                    //foreach (DataRow row in rows)
                    //{
                    i++;
                    client.Index(new Document
                    {
                        Id = i,
                        DocId = Convert.ToInt32(row["CALL_ID"].ToString()),
                        Path = row["CALL_SAVE_FILE"].ToString().Replace(CallReportPath, ""),
                        Title = row["CALL_FILE"].ToString().Replace(CallReportPath, ""),
                        Author = row["USER_NAME"].ToString(),
                        DateOfMeeting = string.IsNullOrEmpty(row["CALL_DT"].ToString()) ? (DateTime?)null : Convert.ToDateTime(row["CALL_DT"].ToString()),
                        Location = row["CALL_LOCATION"].ToString(),
                        UploadDate = string.IsNullOrEmpty(row["CALL_REPORT_DT"].ToString()) ? (DateTime?)null : Convert.ToDateTime(row["CALL_REPORT_DT"].ToString()),
                        CompanyName = row["COMP_NAME"].ToString(),
                        CompanyId = Convert.ToInt32(row["COMP_ID"].ToString()),
                        Country = row["COU_NAME"].ToString(),
                        CountryCode = row["COU_CD"].ToString(),
                        RegionCode = row["REGION_CODE"].ToString(),
                        RegionName = row["REGION_NAME"].ToString(),
                        SectorCode = row["SECTOR_CD"].ToString(),
                        SectorName = row["SECTOR_NAME"].ToString(),
                        Content = base64File
                    }, p => p.Pipeline("attachments"));
                    //}
                }
            });
        }
        catch (Exception ex)
        {
            throw ex;
        }
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM