使用 python 获取 ES 索引中的所有文档

Question

我正在尝试将所有文档保存在名为：news (44908 Document) 的 ES 索引中，并将它们保存在 DataFrame

但是在运行脚本时，我只得到前十个文档。

这是我的代码：

esClient = Elasticsearch()

response = esClient.search(index = 'news',
                                body = {},
                                )

#scrollId = response["_scroll_id"]
#print(scrollId)

esDocs = response["hits"]["hits"]
fields = {}
for num, doc in enumerate(esDocs):
    sourceData = doc["_source"]
    
    #response = esClient.scroll(scroll_id=scrollId, scroll = '1m')
    #scrollId = response['_scroll_id']
    #print(scrollId)
    
    for key, val in sourceData.items():
        
        if key == 'tags' or key == 'text' or key == 'title':
            
            try:
                fields[key] = np.append(fields[key], val)
            except KeyError:
                fields[key] = np.array([val])
        else:
            continue;

df = pd.DataFrame(fields)

我尝试使用.scroll()但它没有用。 我仍然只得到 10 个第一个文件。

我也尝试指定size = number ，但这不是我要找的...

这是我的 output Dataframe

注意：我正在使用 Jupyter Notbook

Answer 1

您需要指定size ，要返回的文档数量

esClient.search(index = 'news', body = {'size': 44908})

但这是太多的文件，它可能会崩溃。

Answer 2

如果您尝试通过 pandas DataFrame ZDB974238714CA8DE634A7dCE1D03 访问Elasticsearch索引。 然后，您不必将所有文档加载到 memory 中即可对其执行操作。

<披露：我是 Eland 的维护者，受雇于 Elastic>

使用 python 获取 ES 索引中的所有文档

问题描述

2 个解决方案

解决方案1
0 2020-07-11 12:47:30

解决方案2
0 已采纳 2020-07-11 13:53:22

使用 python 获取 ES 索引中的所有文档

问题描述

2 个解决方案

解决方案1 0 2020-07-11 12:47:30

解决方案2 0 已采纳 2020-07-11 13:53:22

解决方案1
0 2020-07-11 12:47:30

解决方案2
0 已采纳 2020-07-11 13:53:22