簡體   English   中英

如何將Elasticsearch Ingest附件處理器插件與Python軟件包elasticsearch-dsl結合使用

[英]How do you use the Elasticsearch Ingest Attachment Processor Plugin with the Python package elasticsearch-dsl

我在嘗試將Ingest 附件處理器插件與ElasticSearch結合使用時遇到麻煩(在AWS上為5.5,在本地為5.6)。 我正在Python(3.6)中進行開發,並且正在使用elasticsearch-dls庫

我正在使用Persistence,並按如下方式進行類設置:

import base64
from elasticsearch_dsl.field import Attachment, Text
from elasticsearch_dsl import DocType, analyzer

lower_keyword = analyzer('keyword', tokenizer="keyword", filter=["lowercase"])

class ExampleIndex(DocType):
class Meta:
    index = 'example'
    doc_type = 'Example'

    id = Text()
    name = Text(analyzer=lower_keyword)
    my_file = Attachment()

然后,我有一個這樣的函數,我可以調用該函數來創建索引並保存文檔。

def index_doc(a_file):
    # Ensure that the Index is created before any documents are saved
    try:
        i = Index('example')
        i.doc_type(ExampleIndex)
        i.create()

        # todo - Pipeline creation needs to go here - But how do you do it?

    except Exception:
        pass

    # Check for existing index
    indices = ExampleIndex()
    try:
        s = indices.search()
        r = s.query('match', name=a_file.name).execute()
        if r.success():
            for h in r:
                indices = ExampleIndex.get(id=h.meta.id)
                break
    except NotFoundError:
        pass
    except Exception:
        logger.exception("Something went wrong")
        raise

    # Populate the document   
    indices.name = a_file.name
    with open(a_file.path_to_file, 'rb') as f:
        contents = f.read()
    indices.my_file = base64.b64encode(contents).decode("ascii")

    indices.save(pipeline="attachment") if indices.my_file else indices.save()

我有一個文本文件,內容是測試文件 當其內容以base64編碼時,它們將變為VGhpcyBpcyBhIHRlc3QgZG9jdW1lbnQK

如果我直接使用CURL,那么它可以工作:

創建管道:

curl -XPUT 'localhost:9200/_ingest/pipeline/attachment?pretty' -H 'Content-Type: application/json' -d' {   "description" : "Extract attachment information",   "processors" : [
    {
      "attachment" : {
        "field" : "my_file"
      }
    }   ] }

放入數據

curl -XPUT 'localhost:9200/example/Example/AV9nkyJMZAQ2lQ3CtsLb?pipeline=attachment&pretty'\
-H 'Content-Type: application/json' \
-d '{"my_file": "VGhpcyBpcyBhIHRlc3QgZG9jdW1lbnQK"}'

獲取數據http:// localhost:9200 / example / Example / AV9nkyJMZAQ2lQ3CtsLb?pretty

{
    "_index" : "example",
    "_type" : "Example",
    "_id" : "AV9nkyJMZAQ2lQ3CtsLb",
    "_version" : 4,
    "found" : true,
    "_source" : {
        "my_file" : "VGhpcyBpcyBhIHRlc3QgZG9jdW1lbnQK",
        "attachment" : {
            "content_type" : "text/plain; charset=ISO-8859-1",
            "language" : "en",
            "content" : "This is a test document",
            "content_length" : 25
        }
    }
}

麻煩的是我看不到如何使用elasticsearch-dsl Python庫重新創建它

更新:除了管道的最初創建,我現在可以使所有工作正常。 如果我使用CURL創建管道,則可以通過將.save()方法調用更改為.save(pipeline =“ attachment”)來使用它。 我已經更新了我以前的功能,以顯示此功能,並對需要創建管線的位置發表評論。

這是創建管道的CURL實現的示例

curl - XPUT 'localhost:9200/_ingest/pipeline/attachment?pretty' \
     - H 'Content-Type: application/json' \
     - d '"description": "Extract attachment information","processors": [{"attachment": {"field": "my_field"}}]}'

問題的答案是使用較低層的elasticseatch.py庫中的IngestClient來創建管道,然后再使用它。

from elasticsearch.client.ingest import IngestClient
p = IngestClient(es_connection)
p.put_pipeline(id='attachment', body={
    'description': "Extract attachment information",
    'processors': [
        {"attachment": {"field": "cv"}}
    ]
})

使用elasticsearch-dsl持久流(DocType)在ElasticSearch中創建管道,索引和文檔的完整工作示例是:

import base64
from uuid import uuid4
from elasticsearch.client.ingest import IngestClient
from elasticsearch.exceptions import NotFoundError
from elasticsearch_dsl import analyzer, DocType, Index
from elasticsearch_dsl.connections import connections
from elasticsearch_dsl.field import Attachment, Text


# Establish a connection
host = '127.0.0.1'
port = 9200
es = connections.create_connection(host=host, port=port)

# Some custom analyzers
html_strip = analyzer('html_strip', tokenizer="standard", filter=["standard", "lowercase", "stop", "snowball"],
                      char_filter=["html_strip"])
lower_keyword = analyzer('keyword', tokenizer="keyword", filter=["lowercase"])


class ExampleIndex(DocType):
    class Meta:
        index = 'example'
        doc_type = 'Example'

    id = Text()
    uuid = Text()
    name = Text()
    town = Text(analyzer=lower_keyword)
    my_file = Attachment(analyzer=html_strip)


def save_document(doc):
    """

    :param obj doc: Example object containing values to save
    :return:
    """
    try:
        # Create the Pipeline BEFORE creating the index
        p = IngestClient(es)
        p.put_pipeline(id='myattachment', body={
            'description': "Extract attachment information",
            'processors': [
                {
                    "attachment": {
                        "field": "my_file"
                    }
                }
            ]
        })

        # Create the index. An exception will be raise if it already exists
        i = Index('example')
        i.doc_type(ExampleIndex)
        i.create()
    except Exception:
        # todo - should be restricted to the expected Exception subclasses
        pass

    indices = ExampleIndex()
    try:
        s = indices.search()
        r = s.query('match', uuid=doc.uuid).execute()
        if r.success():
            for h in r:
                indices = ExampleIndex.get(id=h.meta.id)
                break
    except NotFoundError:
        # New record
        pass
    except Exception:
        print("Unexpected error")
        raise

    # Now set the doc properties
    indices.uuid = doc.uuid
    indices.name = doc.name
    indices.town = doc.town
    if doc.my_file:
        with open(doc.my_file, 'rb') as f:
            contents = f.read()
        indices.my_file = base64.b64encode(contents).decode("ascii")

    # Save the index, using the Attachment pipeline if a file was attached
    return indices.save(pipeline="myattachment") if indices.my_file else indices.save()


class MyObj(object):
    uuid = uuid4()
    name = ''
    town = ''
    my_file = ''

    def __init__(self, name, town, file):
        self.name = name
        self.town = town
        self.my_file = file


me = MyObj("Steve", "London", '/home/steve/Documents/test.txt')

res = save_document(me)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM