繁体   English   中英

在 Elasticsearch 中插入多个文档 - 批量文档格式化程序

[英]Insert multiple documents in Elasticsearch - bulk doc formatter

TLDR; 如何批量格式化我的 JSON 文件以摄取到 Elasticsearch?

我正在尝试将一些 NOAA 数据摄取到 Elasticsearch 并且一直在使用NOAA Python SDK

我编写了以下 Python 脚本来加载数据并将其存储为 JSON 格式。

from noaa_sdk import noaa
import json

n = noaa.NOAA()
alerts = n.alerts()
f = open('nhc_alerts.json', 'w')
json.dump(alerts, f)
f.write('\n')

JSON Output:

{"@context": ["https://raw.githubusercontent.com/geojson/geojson-ld/master/contexts/geojson-base.jsonld", {"wx": "https://api.weather.gov/ontology#", "@vocab": "https://api.weather.gov/ontology#"}], "type": "FeatureCollection", "features": [{"id": "https://api.weather.gov/alerts/NWS-IDP-PROD-KEEPALIVE-5246", "type": "Feature", "geometry": null, "properties": {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-KEEPALIVE-5246", "@type": "wx:Alert", "id": "NWS-IDP-PROD-KEEPALIVE-5246", "areaDesc": "Montgomery", "geocode": {"UGC": ["MDC031"], "SAME": ["024031"]}, "affectedZones": ["https://api.weather.gov/zones/county/MDC031"], "references": [], "sent": "2020-04-25T19:21:03+00:00", "effective": "2020-04-25T19:21:03+00:00", "onset": null, "expires": "2020-04-25T19:31:03+00:00", "ends": null, "status": "Test", "messageType": "Alert", "category": "Met", "severity": "Unknown", "certainty": "Unknown", "urgency": "Unknown", "event": "Test Message", "sender": "w-nws.webmaster@noaa.gov", "senderName": "NWS", "headline": null, "description": "Monitoring message only. Please disregard.", "instruction": "Monitoring message only. Please disregard.", "response": "None", "parameters": {"PIL": ["NWSKEPWBC"], "BLOCKCHANNEL": ["CMAS", "EAS", "NWEM"]}}}, {"id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4179499-3536427", "type": "Feature", "geometry": null, "properties": {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4179499-3536427", "@type": "wx:Alert", "id": "NWS-IDP-PROD-4179499-3536427", "areaDesc": "La Salle; Livingston", "geocode": {"UGC": ["ILZ019", "ILZ032"], "SAME": ["017099", "017105"]}, "affectedZones": ["https://api.weather.gov/zones/forecast/ILZ019", "https://api.weather.gov/zones/forecast/ILZ032"], "references": [{"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4179245-3536278", "identifier": "NWS-IDP-PROD-4179245-3536278", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-04-25T10:02:00-05:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4178935-3536074", "identifier": "NWS-IDP-PROD-4178935-3536074", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-04-25T03:09:00-05:00"}], "sent": "2020-04-25T14:21:00-05:00", "effective": "2020-04-25T14:21:00-05:00", "onset": "2020-04-25T14:21:00-05:00", "expires": "2020-04-25T22:30:00-05:00", "ends": "2020-04-26T01:00:00-05:00", "status": "Actual", "messageType": "Update", "category": "Met", "severity": "Severe", "certainty": "Possible", "urgency": "Future", "event": "Flood Watch", "sender": "w-nws.webmaster@noaa.gov", "senderName": "NWS Chicago IL", "headline": "Flood Watch issued April 25 at 2:21PM CDT until April 26 at 1:00AM CDT by NWS Chicago IL", "description": "The Flood Watch is now in effect for\n\n* Livingston and La Salle counties in north central Illinois\n\n* Until 1 AM CDT Sunday\n\n* WHAT...Steady rain. One to two inches of rain has already\nfallen. Additional rainfall amounts of one inch or locally more\nare possible which may lead to total rainfall amounts in excess\nof three inches.\n\n* IMPACTS...Rises in rivers and small streams will occur with\nflooding possible. This especially includes the Vermilion River\nand its tributary streams, and the Illinois River. Roadways,\nviaducts, ditches, agricultural land, and other poor drainage\nareas may become flooded.", "instruction": "A Flood Watch means there is a potential for flooding based on\ncurrent forecasts.\n\nYou should monitor later forecasts and be alert for possible\nFlood Warnings. Those living in areas prone to flooding should be\nprepared to take action should flooding develop.", "response": "Prepare", "parameters": {"NWSheadline": ["FLOOD WATCH NOW IN EFFECT UNTIL 1 AM CDT SUNDAY"], "VTEC": ["/O.EXT.KLOT.FA.A.0002.000000T0000Z-200426T0600Z/"], "EAS-ORG": ["WXR"], "PIL": ["LOTFFALOT"], "BLOCKCHANNEL": ["CMAS", "EAS", "NWEM"], "eventEndingTime": ["2020-04-26T01:00:00-05:00"]}}}, {"id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4179497-3536425", "type": "Feature", "geometry": null, "properties": {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4179497-3536425", "@type": "wx:Alert", "id": "NWS-IDP-PROD-4179497-3536425", "areaDesc": "San Luis Obispo County Central Coast; Santa Barbara County Central Coast; Santa Ynez Valley", "geocode": {"UGC": ["CAZ034", "CAZ035", "CAZ036"], "SAME": ["006079", "006083"]}, "affectedZones": ["https://api.weather.gov/zones/forecast/CAZ034", "https://api.weather.gov/zones/forecast/CAZ035", "https://api.weather.gov/zones/forecast/CAZ036"], "references": [{"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4177692-3535278", "identifier": "NWS-IDP-PROD-4177692-3535278", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-04-24T08:54:00-07:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4178774-3535999", "identifier": "NWS-IDP-PROD-4178774-3535999", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-04-24T21:37:00-07:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4179040-3536147", "identifier": "NWS-IDP-PROD-4179040-3536147", "sender": "w-nws.webmaster@noaa.gov", "sent": 

这个脚本处理了我遇到的一些格式化问题,我的下一个障碍是尝试格式化它,以便我可以在 elasticsearch 中使用批量导入 function。 我偶然发现了一个在一定程度上有效的答案,我遇到的问题是它会插入适当的索引字符串,但它是在每个字符之后执行的。

批量转换脚本:

import json


JSON_FILE_IN = "nhc_alerts.json"
JSON_FILE_OUT = "nhc_bulk.json"


out = open(JSON_FILE_OUT, 'w')
with open(JSON_FILE_IN, 'r') as json_in:
    docs = json.dumps(json_in.read())
    for doc in docs:
        out.write('%s\n' % json.dumps({'index': {}}));
        out.write('%s\n' % json.dumps(doc, indent=0).replace('\n', ''))

Output 来自批量脚本:

{"index": {}}
"\""
{"index": {}}
"{"
{"index": {}}
"\\"
{"index": {}}
"\""
{"index": {}}
"@"
{"index": {}}
"c"
{"index": {}}
"o"
{"index": {}}
"n"
{"index": {}}
"t"
{"index": {}}
"e"
{"index": {}}
"x"
{"index": {}}
"t"
{"index": {}}
"\\"
{"index": {}}
"\""
{"index": {}}
":"
{"index": {}}
" "
{"index": {}}
"["
{"index": {}}
"\\"
{"index": {}}
"\""
{"index": {}}
"h"
{"index": {}}
"t"
{"index": {}}
"t"
{"index": {}}
"p"
{"index": {}}
"s"
{"index": {}}
":"
{"index": {}}
"/"
{"index": {}}
"/"
{"index": {}}
"r"
{"index": {}}
"a"
{"index": {}}
"w"
{"index": {}}
"."
{"index": {}}
"g"
{"index": {}}
"i"
{"index": {}}
"t"
{"index": {}}
"h"
{"index": {}}
"u"
{"index": {}}
"b"
{"index": {}}
"u"
{"index": {}}
"s"
{"index": {}}
"e"
{"index": {}}
"r"
{"index": {}}
"c"
{"index": {}}
"o"
{"index": {}}
"n"
{"index": {}}

理想情况下,我想将这两个脚本合二为一,但此时,如果能完成工作,我将运行两个单独的脚本。

您可以使用官方 python package 的bulk方法:

import json

from noaa_sdk import noaa
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk


noaa_client = noaa.NOAA()
alerts = noaa_client.alerts()['features']

es = Elasticsearch()


def save_alerts():
    with open('nhc_alerts.json', 'w') as f:
        f.write(json.dumps(alerts))


def bulk_sync():
    actions = [
        {
            "_index": "my_noaa_index",
            "_source": alert
        } for alert in alerts
    ]

    bulk(es, actions)


save_alerts()
bulk_sync()

问题是alerts JSON 转储都在一行上,所以它不会像现在这样工作。 您需要迭代所有警报(我怀疑alerts.features数组中的任何内容)并一次性完成所有操作,而无需通过中间文件,如下所示:

n = noaa.NOAA()
alerts = n.alerts()
f = open('nhc_alerts.json', 'w')
for alert in alerts['features']:
  f.write('%s\n' % json.dumps({'index': {}}));
  f.write('%s\n' % json.dumps(alert, indent=0).replace('\n', ''))
f.write('\n')

我怀疑这条线稍后会在json.dumps(json_in.read())上导致错误。 json.dumps返回一个字符串。 当你迭代一个字符串时,就像你在下一行中所做的那样,然后你迭代字符。

我认为您真正想要的是以下内容。 它将alert["features“]的每个feature保存为 json 格式的新行。

from noaa_sdk import noaa
import json
from pathlib import Path


noaa_client = noaa.NOAA()
alerts = noaa_client.alerts()

save_path = Path('.') / "alert.json"
with save_path.open("a") as f:
    for feature in alerts["features"]:
        json.dump(feature, f)
        f.write("\n")

结果:

{"id": "https://api.weather.gov/alerts/NWS-IDP-PROD-KEEPALIVE-16211", "type": "Feature", "geometry": null, "properties": {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-KEEPALIVE-16211", "@type": "wx:Alert", "id": "NWS-IDP-PROD-KEEPALIVE-16211", "areaDesc": "Montgomery", "geocode": {"UGC": ["MDC031"], "SAME": ["024031"]}, "affectedZones": ["https://api.weather.gov/zones/county/MDC031"], "references": [], "sent": "2020-05-06T16:55:56+00:00", "effective": "2020-05-06T16:55:56+00:00", "onset": null, "expires": "2020-05-06T17:05:56+00:00", "ends": null, "status": "Test", "messageType": "Alert", "category": "Met", "severity": "Unknown", "certainty": "Unknown", "urgency": "Unknown", "event": "Test Message", "sender": "w-nws.webmaster@noaa.gov", "senderName": "NWS", "headline": null, "description": "Monitoring message only. Please disregard.", "instruction": "Monitoring message only. Please disregard.", "response": "None", "parameters": {"PIL": ["NWSKEPWBC"], "BLOCKCHANNEL": ["CMAS", "EAS", "NWEM"]}}}
{"id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197938-3548807", "type": "Feature", "geometry": null, "properties": {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197938-3548807", "@type": "wx:Alert", "id": "NWS-IDP-PROD-4197938-3548807", "areaDesc": "Coastal waters from NC VA border to Currituck Beach Light NC out 20 nm; Coastal Waters from Cape Charles Light to Virginia-North Carolina border out to 20 nm", "geocode": {"UGC": ["ANZ658", "ANZ656"], "SAME": ["073658", "073656"]}, "affectedZones": ["https://api.weather.gov/zones/forecast/ANZ658", "https://api.weather.gov/zones/forecast/ANZ656"], "references": [{"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197751-3548667", "identifier": "NWS-IDP-PROD-4197751-3548667", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T09:51:00-04:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197640-3548624", "identifier": "NWS-IDP-PROD-4197640-3548624", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T06:35:00-04:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197422-3548452", "identifier": "NWS-IDP-PROD-4197422-3548452", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T03:25:00-04:00"}], "sent": "2020-05-06T12:54:00-04:00", "effective": "2020-05-06T12:54:00-04:00", "onset": "2020-05-07T04:00:00-04:00", "expires": "2020-05-06T21:00:00-04:00", "ends": "2020-05-07T13:00:00-04:00", "status": "Actual", "messageType": "Update", "category": "Met", "severity": "Minor", "certainty": "Likely", "urgency": "Expected", "event": "Small Craft Advisory", "sender": "w-nws.webmaster@noaa.gov", "senderName": "NWS Wakefield VA", "headline": "Small Craft Advisory issued May 6 at 12:54PM EDT until May 7 at 1:00PM EDT by NWS Wakefield VA", "description": "* WHAT...Northwest winds 15 to 20 kt with gusts up to 25 kt and\nseas 3 to 5 ft expected.\n\n* WHERE...Coastal Waters from Cape Charles Light to Virginia-\nNorth Carolina border out to 20 nm and Coastal waters from NC\nVA border to Currituck Beach Light NC out 20 nm.\n\n* WHEN...From 4 AM to 1 PM EDT Thursday.\n\n* IMPACTS...Conditions will be hazardous to small craft.", "instruction": "Inexperienced mariners, especially those operating smaller\nvessels, should avoid navigating in hazardous conditions.", "response": "Avoid", "parameters": {"NWSheadline": ["SMALL CRAFT ADVISORY REMAINS IN EFFECT FROM 4 AM TO 1 PM EDT THURSDAY"], "VTEC": ["/O.CON.KAKQ.SC.Y.0054.200507T0800Z-200507T1700Z/"], "PIL": ["AKQMWWAKQ"], "BLOCKCHANNEL": ["CMAS", "EAS", "NWEM"], "eventEndingTime": ["2020-05-07T13:00:00-04:00"]}}}
{"id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197936-3548805", "type": "Feature", "geometry": null, "properties": {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197936-3548805", "@type": "wx:Alert", "id": "NWS-IDP-PROD-4197936-3548805", "areaDesc": "Chesapeake Bay from Smith Point to Windmill Point VA; Chesapeake Bay from New Point Comfort to Little Creek VA; Chesapeake Bay from Windmill Point to New Point Comfort VA; Chesapeake Bay from Little Creek VA to Cape Henry VA including the Chesapeake Bay Bridge Tunnel", "geocode": {"UGC": ["ANZ630", "ANZ632", "ANZ631", "ANZ634"], "SAME": ["073630", "073632", "073631", "073634"]}, "affectedZones": ["https://api.weather.gov/zones/forecast/ANZ630", "https://api.weather.gov/zones/forecast/ANZ632", "https://api.weather.gov/zones/forecast/ANZ631", "https://api.weather.gov/zones/forecast/ANZ634"], "references": [{"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197423-3548453", "identifier": "NWS-IDP-PROD-4197423-3548453", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T03:25:00-04:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197750-3548666", "identifier": "NWS-IDP-PROD-4197750-3548666", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T09:51:00-04:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197641-3548625", "identifier": "NWS-IDP-PROD-4197641-3548625", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T06:35:00-04:00"}], "sent": "2020-05-06T12:54:00-04:00", "effective": "2020-05-06T12:54:00-04:00", "onset": "2020-05-06T22:00:00-04:00", "expires": "2020-05-06T21:00:00-04:00", "ends": "2020-05-07T13:00:00-04:00", "status": "Actual", "messageType": "Update", "category": "Met", "severity": "Minor", "certainty": "Likely", "urgency": "Expected", "event": "Small Craft Advisory", "sender": "w-nws.webmaster@noaa.gov", "senderName": "NWS Wakefield VA", "headline": "Small Craft Advisory issued May 6 at 12:54PM EDT until May 7 at 1:00PM EDT by NWS Wakefield VA", "description": "* WHAT...North winds 10 to 20 kt with gusts up to 25 kt and\nwaves 2 to 3 ft expected.\n\n* WHERE...Chesapeake Bay from Little Creek VA to Cape Henry VA\nincluding the Chesapeake Bay Bridge Tunnel, Chesapeake Bay\nfrom New Point Comfort to Little Creek VA, Chesapeake Bay from\nSmith Point to Windmill Point VA and Chesapeake Bay from\nWindmill Point to New Point Comfort VA.\n\n* WHEN...From 10 PM this evening to 1 PM EDT Thursday.\n\n* IMPACTS...Conditions will be hazardous to small craft.", "instruction": "Inexperienced mariners, especially those operating smaller\nvessels, should avoid navigating in hazardous conditions.", "response": "Avoid", "parameters": {"NWSheadline": ["SMALL CRAFT ADVISORY REMAINS IN EFFECT FROM 10 PM THIS EVENING TO 1 PM EDT THURSDAY"], "VTEC": ["/O.CON.KAKQ.SC.Y.0054.200507T0200Z-200507T1700Z/"], "PIL": ["AKQMWWAKQ"], "BLOCKCHANNEL": ["CMAS", "EAS", "NWEM"], "eventEndingTime": ["2020-05-07T13:00:00-04:00"]}}}
...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM