![](/img/trans.png)
[英]How to index a pdf file using Elasticsearch ingest-attachment plugin?
[英]What is the nest way to bulk index(around 40 k files of type .docx) using ingest-attachment?
我对 ELK 堆栈相当熟悉,目前使用的是 Elastic search 6.6。 我们的用例是搜索大约 40K .docx 文件(由投资组合经理作为研究报告上传。允许的最大文件大小为 10 MB,但大多数文件大小为几 Kb)。 我使用了摄取附件插件来索引示例测试文件,并且我还可以使用 KIBANA 搜索内容,例如:POST /attachment_test/my_type/_search?pretty=true
{
"query": {
"match": {
"attachment.content": "JP Morgan"
}
}
}
返回我预期的结果。 我的疑惑:
使用摄取插件,我们需要将数据推送到插件。我使用的是VS 2017和弹性NEST dll。 这意味着,我必须以编程方式读取 40K 文档并使用 NEST 命令将它们推送到 ES?
我经历过 Fscrawler 项目,知道它可以达到目的,但我把它作为我的最后手段
如果我使用方法 1(代码),是否有任何批量上传 API 可用于将附件数量一起发布到 ES(批量)?
最后,我使用 C# 代码将 40K 文件上传到弹性索引:
private static void PopulateIndex(ElasticClient client)
{
var directory =System.Configuration.ConfigurationManager.AppSettings["CallReportPath"].ToString();
var callReportsCollection = Directory.GetFiles(directory, "*.doc"); //this will fetch both doc and docx
//callReportsCollection.ToList().AddRange(Directory.GetFiles(directory, "*.doc"));
ConcurrentBag<string> reportsBag = new ConcurrentBag<string>(callReportsCollection);
int i = 0;
var callReportElasticDataSet = new DLCallReportSearch().GetCallReportDetailsForElastic();//.AsEnumerable();//.Take(50).CopyToDataTable();
try
{
Parallel.ForEach(reportsBag, callReport =>
//Array.ForEach(callReportsCollection,callReport=>
{
var base64File = Convert.ToBase64String(File.ReadAllBytes(callReport));
var fileSavedName = callReport.Replace(directory, "");
// var dt = dLCallReportSearch.GetCallFileName(fileSavedName.Replace("'", "''"));//replace the ' in a file name with '';
var rows = callReportElasticDataSet.Select("CALL_SAVE_FILE like '%" + fileSavedName.Replace("'", "''") + "'");
if (rows != null && rows.Count() > 0)
{
var row = rows.FirstOrDefault();
//foreach (DataRow row in rows)
//{
i++;
client.Index(new Document
{
Id = i,
DocId = Convert.ToInt32(row["CALL_ID"].ToString()),
Path = row["CALL_SAVE_FILE"].ToString().Replace(CallReportPath, ""),
Title = row["CALL_FILE"].ToString().Replace(CallReportPath, ""),
Author = row["USER_NAME"].ToString(),
DateOfMeeting = string.IsNullOrEmpty(row["CALL_DT"].ToString()) ? (DateTime?)null : Convert.ToDateTime(row["CALL_DT"].ToString()),
Location = row["CALL_LOCATION"].ToString(),
UploadDate = string.IsNullOrEmpty(row["CALL_REPORT_DT"].ToString()) ? (DateTime?)null : Convert.ToDateTime(row["CALL_REPORT_DT"].ToString()),
CompanyName = row["COMP_NAME"].ToString(),
CompanyId = Convert.ToInt32(row["COMP_ID"].ToString()),
Country = row["COU_NAME"].ToString(),
CountryCode = row["COU_CD"].ToString(),
RegionCode = row["REGION_CODE"].ToString(),
RegionName = row["REGION_NAME"].ToString(),
SectorCode = row["SECTOR_CD"].ToString(),
SectorName = row["SECTOR_NAME"].ToString(),
Content = base64File
}, p => p.Pipeline("attachments"));
//}
}
});
}
catch (Exception ex)
{
throw ex;
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.