簡體   English   中英

在 Elasticsearch 中插入多個文檔 - 批量文檔格式化程序

[英]Insert multiple documents in Elasticsearch - bulk doc formatter

TLDR; 如何批量格式化我的 JSON 文件以攝取到 Elasticsearch?

我正在嘗試將一些 NOAA 數據攝取到 Elasticsearch 並且一直在使用NOAA Python SDK

我編寫了以下 Python 腳本來加載數據並將其存儲為 JSON 格式。

from noaa_sdk import noaa
import json

n = noaa.NOAA()
alerts = n.alerts()
f = open('nhc_alerts.json', 'w')
json.dump(alerts, f)
f.write('\n')

JSON Output:

{"@context": ["https://raw.githubusercontent.com/geojson/geojson-ld/master/contexts/geojson-base.jsonld", {"wx": "https://api.weather.gov/ontology#", "@vocab": "https://api.weather.gov/ontology#"}], "type": "FeatureCollection", "features": [{"id": "https://api.weather.gov/alerts/NWS-IDP-PROD-KEEPALIVE-5246", "type": "Feature", "geometry": null, "properties": {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-KEEPALIVE-5246", "@type": "wx:Alert", "id": "NWS-IDP-PROD-KEEPALIVE-5246", "areaDesc": "Montgomery", "geocode": {"UGC": ["MDC031"], "SAME": ["024031"]}, "affectedZones": ["https://api.weather.gov/zones/county/MDC031"], "references": [], "sent": "2020-04-25T19:21:03+00:00", "effective": "2020-04-25T19:21:03+00:00", "onset": null, "expires": "2020-04-25T19:31:03+00:00", "ends": null, "status": "Test", "messageType": "Alert", "category": "Met", "severity": "Unknown", "certainty": "Unknown", "urgency": "Unknown", "event": "Test Message", "sender": "w-nws.webmaster@noaa.gov", "senderName": "NWS", "headline": null, "description": "Monitoring message only. Please disregard.", "instruction": "Monitoring message only. Please disregard.", "response": "None", "parameters": {"PIL": ["NWSKEPWBC"], "BLOCKCHANNEL": ["CMAS", "EAS", "NWEM"]}}}, {"id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4179499-3536427", "type": "Feature", "geometry": null, "properties": {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4179499-3536427", "@type": "wx:Alert", "id": "NWS-IDP-PROD-4179499-3536427", "areaDesc": "La Salle; Livingston", "geocode": {"UGC": ["ILZ019", "ILZ032"], "SAME": ["017099", "017105"]}, "affectedZones": ["https://api.weather.gov/zones/forecast/ILZ019", "https://api.weather.gov/zones/forecast/ILZ032"], "references": [{"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4179245-3536278", "identifier": "NWS-IDP-PROD-4179245-3536278", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-04-25T10:02:00-05:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4178935-3536074", "identifier": "NWS-IDP-PROD-4178935-3536074", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-04-25T03:09:00-05:00"}], "sent": "2020-04-25T14:21:00-05:00", "effective": "2020-04-25T14:21:00-05:00", "onset": "2020-04-25T14:21:00-05:00", "expires": "2020-04-25T22:30:00-05:00", "ends": "2020-04-26T01:00:00-05:00", "status": "Actual", "messageType": "Update", "category": "Met", "severity": "Severe", "certainty": "Possible", "urgency": "Future", "event": "Flood Watch", "sender": "w-nws.webmaster@noaa.gov", "senderName": "NWS Chicago IL", "headline": "Flood Watch issued April 25 at 2:21PM CDT until April 26 at 1:00AM CDT by NWS Chicago IL", "description": "The Flood Watch is now in effect for\n\n* Livingston and La Salle counties in north central Illinois\n\n* Until 1 AM CDT Sunday\n\n* WHAT...Steady rain. One to two inches of rain has already\nfallen. Additional rainfall amounts of one inch or locally more\nare possible which may lead to total rainfall amounts in excess\nof three inches.\n\n* IMPACTS...Rises in rivers and small streams will occur with\nflooding possible. This especially includes the Vermilion River\nand its tributary streams, and the Illinois River. Roadways,\nviaducts, ditches, agricultural land, and other poor drainage\nareas may become flooded.", "instruction": "A Flood Watch means there is a potential for flooding based on\ncurrent forecasts.\n\nYou should monitor later forecasts and be alert for possible\nFlood Warnings. Those living in areas prone to flooding should be\nprepared to take action should flooding develop.", "response": "Prepare", "parameters": {"NWSheadline": ["FLOOD WATCH NOW IN EFFECT UNTIL 1 AM CDT SUNDAY"], "VTEC": ["/O.EXT.KLOT.FA.A.0002.000000T0000Z-200426T0600Z/"], "EAS-ORG": ["WXR"], "PIL": ["LOTFFALOT"], "BLOCKCHANNEL": ["CMAS", "EAS", "NWEM"], "eventEndingTime": ["2020-04-26T01:00:00-05:00"]}}}, {"id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4179497-3536425", "type": "Feature", "geometry": null, "properties": {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4179497-3536425", "@type": "wx:Alert", "id": "NWS-IDP-PROD-4179497-3536425", "areaDesc": "San Luis Obispo County Central Coast; Santa Barbara County Central Coast; Santa Ynez Valley", "geocode": {"UGC": ["CAZ034", "CAZ035", "CAZ036"], "SAME": ["006079", "006083"]}, "affectedZones": ["https://api.weather.gov/zones/forecast/CAZ034", "https://api.weather.gov/zones/forecast/CAZ035", "https://api.weather.gov/zones/forecast/CAZ036"], "references": [{"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4177692-3535278", "identifier": "NWS-IDP-PROD-4177692-3535278", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-04-24T08:54:00-07:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4178774-3535999", "identifier": "NWS-IDP-PROD-4178774-3535999", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-04-24T21:37:00-07:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4179040-3536147", "identifier": "NWS-IDP-PROD-4179040-3536147", "sender": "w-nws.webmaster@noaa.gov", "sent": 

這個腳本處理了我遇到的一些格式化問題,我的下一個障礙是嘗試格式化它,以便我可以在 elasticsearch 中使用批量導入 function。 我偶然發現了一個在一定程度上有效的答案,我遇到的問題是它會插入適當的索引字符串,但它是在每個字符之后執行的。

批量轉換腳本:

import json


JSON_FILE_IN = "nhc_alerts.json"
JSON_FILE_OUT = "nhc_bulk.json"


out = open(JSON_FILE_OUT, 'w')
with open(JSON_FILE_IN, 'r') as json_in:
    docs = json.dumps(json_in.read())
    for doc in docs:
        out.write('%s\n' % json.dumps({'index': {}}));
        out.write('%s\n' % json.dumps(doc, indent=0).replace('\n', ''))

Output 來自批量腳本:

{"index": {}}
"\""
{"index": {}}
"{"
{"index": {}}
"\\"
{"index": {}}
"\""
{"index": {}}
"@"
{"index": {}}
"c"
{"index": {}}
"o"
{"index": {}}
"n"
{"index": {}}
"t"
{"index": {}}
"e"
{"index": {}}
"x"
{"index": {}}
"t"
{"index": {}}
"\\"
{"index": {}}
"\""
{"index": {}}
":"
{"index": {}}
" "
{"index": {}}
"["
{"index": {}}
"\\"
{"index": {}}
"\""
{"index": {}}
"h"
{"index": {}}
"t"
{"index": {}}
"t"
{"index": {}}
"p"
{"index": {}}
"s"
{"index": {}}
":"
{"index": {}}
"/"
{"index": {}}
"/"
{"index": {}}
"r"
{"index": {}}
"a"
{"index": {}}
"w"
{"index": {}}
"."
{"index": {}}
"g"
{"index": {}}
"i"
{"index": {}}
"t"
{"index": {}}
"h"
{"index": {}}
"u"
{"index": {}}
"b"
{"index": {}}
"u"
{"index": {}}
"s"
{"index": {}}
"e"
{"index": {}}
"r"
{"index": {}}
"c"
{"index": {}}
"o"
{"index": {}}
"n"
{"index": {}}

理想情況下,我想將這兩個腳本合二為一,但此時,如果能完成工作,我將運行兩個單獨的腳本。

您可以使用官方 python package 的bulk方法:

import json

from noaa_sdk import noaa
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk


noaa_client = noaa.NOAA()
alerts = noaa_client.alerts()['features']

es = Elasticsearch()


def save_alerts():
    with open('nhc_alerts.json', 'w') as f:
        f.write(json.dumps(alerts))


def bulk_sync():
    actions = [
        {
            "_index": "my_noaa_index",
            "_source": alert
        } for alert in alerts
    ]

    bulk(es, actions)


save_alerts()
bulk_sync()

問題是alerts JSON 轉儲都在一行上,所以它不會像現在這樣工作。 您需要迭代所有警報(我懷疑alerts.features數組中的任何內容)並一次性完成所有操作,而無需通過中間文件,如下所示:

n = noaa.NOAA()
alerts = n.alerts()
f = open('nhc_alerts.json', 'w')
for alert in alerts['features']:
  f.write('%s\n' % json.dumps({'index': {}}));
  f.write('%s\n' % json.dumps(alert, indent=0).replace('\n', ''))
f.write('\n')

我懷疑這條線稍后會在json.dumps(json_in.read())上導致錯誤。 json.dumps返回一個字符串。 當你迭代一個字符串時,就像你在下一行中所做的那樣,然后你迭代字符。

我認為您真正想要的是以下內容。 它將alert["features“]的每個feature保存為 json 格式的新行。

from noaa_sdk import noaa
import json
from pathlib import Path


noaa_client = noaa.NOAA()
alerts = noaa_client.alerts()

save_path = Path('.') / "alert.json"
with save_path.open("a") as f:
    for feature in alerts["features"]:
        json.dump(feature, f)
        f.write("\n")

結果:

{"id": "https://api.weather.gov/alerts/NWS-IDP-PROD-KEEPALIVE-16211", "type": "Feature", "geometry": null, "properties": {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-KEEPALIVE-16211", "@type": "wx:Alert", "id": "NWS-IDP-PROD-KEEPALIVE-16211", "areaDesc": "Montgomery", "geocode": {"UGC": ["MDC031"], "SAME": ["024031"]}, "affectedZones": ["https://api.weather.gov/zones/county/MDC031"], "references": [], "sent": "2020-05-06T16:55:56+00:00", "effective": "2020-05-06T16:55:56+00:00", "onset": null, "expires": "2020-05-06T17:05:56+00:00", "ends": null, "status": "Test", "messageType": "Alert", "category": "Met", "severity": "Unknown", "certainty": "Unknown", "urgency": "Unknown", "event": "Test Message", "sender": "w-nws.webmaster@noaa.gov", "senderName": "NWS", "headline": null, "description": "Monitoring message only. Please disregard.", "instruction": "Monitoring message only. Please disregard.", "response": "None", "parameters": {"PIL": ["NWSKEPWBC"], "BLOCKCHANNEL": ["CMAS", "EAS", "NWEM"]}}}
{"id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197938-3548807", "type": "Feature", "geometry": null, "properties": {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197938-3548807", "@type": "wx:Alert", "id": "NWS-IDP-PROD-4197938-3548807", "areaDesc": "Coastal waters from NC VA border to Currituck Beach Light NC out 20 nm; Coastal Waters from Cape Charles Light to Virginia-North Carolina border out to 20 nm", "geocode": {"UGC": ["ANZ658", "ANZ656"], "SAME": ["073658", "073656"]}, "affectedZones": ["https://api.weather.gov/zones/forecast/ANZ658", "https://api.weather.gov/zones/forecast/ANZ656"], "references": [{"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197751-3548667", "identifier": "NWS-IDP-PROD-4197751-3548667", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T09:51:00-04:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197640-3548624", "identifier": "NWS-IDP-PROD-4197640-3548624", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T06:35:00-04:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197422-3548452", "identifier": "NWS-IDP-PROD-4197422-3548452", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T03:25:00-04:00"}], "sent": "2020-05-06T12:54:00-04:00", "effective": "2020-05-06T12:54:00-04:00", "onset": "2020-05-07T04:00:00-04:00", "expires": "2020-05-06T21:00:00-04:00", "ends": "2020-05-07T13:00:00-04:00", "status": "Actual", "messageType": "Update", "category": "Met", "severity": "Minor", "certainty": "Likely", "urgency": "Expected", "event": "Small Craft Advisory", "sender": "w-nws.webmaster@noaa.gov", "senderName": "NWS Wakefield VA", "headline": "Small Craft Advisory issued May 6 at 12:54PM EDT until May 7 at 1:00PM EDT by NWS Wakefield VA", "description": "* WHAT...Northwest winds 15 to 20 kt with gusts up to 25 kt and\nseas 3 to 5 ft expected.\n\n* WHERE...Coastal Waters from Cape Charles Light to Virginia-\nNorth Carolina border out to 20 nm and Coastal waters from NC\nVA border to Currituck Beach Light NC out 20 nm.\n\n* WHEN...From 4 AM to 1 PM EDT Thursday.\n\n* IMPACTS...Conditions will be hazardous to small craft.", "instruction": "Inexperienced mariners, especially those operating smaller\nvessels, should avoid navigating in hazardous conditions.", "response": "Avoid", "parameters": {"NWSheadline": ["SMALL CRAFT ADVISORY REMAINS IN EFFECT FROM 4 AM TO 1 PM EDT THURSDAY"], "VTEC": ["/O.CON.KAKQ.SC.Y.0054.200507T0800Z-200507T1700Z/"], "PIL": ["AKQMWWAKQ"], "BLOCKCHANNEL": ["CMAS", "EAS", "NWEM"], "eventEndingTime": ["2020-05-07T13:00:00-04:00"]}}}
{"id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197936-3548805", "type": "Feature", "geometry": null, "properties": {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197936-3548805", "@type": "wx:Alert", "id": "NWS-IDP-PROD-4197936-3548805", "areaDesc": "Chesapeake Bay from Smith Point to Windmill Point VA; Chesapeake Bay from New Point Comfort to Little Creek VA; Chesapeake Bay from Windmill Point to New Point Comfort VA; Chesapeake Bay from Little Creek VA to Cape Henry VA including the Chesapeake Bay Bridge Tunnel", "geocode": {"UGC": ["ANZ630", "ANZ632", "ANZ631", "ANZ634"], "SAME": ["073630", "073632", "073631", "073634"]}, "affectedZones": ["https://api.weather.gov/zones/forecast/ANZ630", "https://api.weather.gov/zones/forecast/ANZ632", "https://api.weather.gov/zones/forecast/ANZ631", "https://api.weather.gov/zones/forecast/ANZ634"], "references": [{"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197423-3548453", "identifier": "NWS-IDP-PROD-4197423-3548453", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T03:25:00-04:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197750-3548666", "identifier": "NWS-IDP-PROD-4197750-3548666", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T09:51:00-04:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197641-3548625", "identifier": "NWS-IDP-PROD-4197641-3548625", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T06:35:00-04:00"}], "sent": "2020-05-06T12:54:00-04:00", "effective": "2020-05-06T12:54:00-04:00", "onset": "2020-05-06T22:00:00-04:00", "expires": "2020-05-06T21:00:00-04:00", "ends": "2020-05-07T13:00:00-04:00", "status": "Actual", "messageType": "Update", "category": "Met", "severity": "Minor", "certainty": "Likely", "urgency": "Expected", "event": "Small Craft Advisory", "sender": "w-nws.webmaster@noaa.gov", "senderName": "NWS Wakefield VA", "headline": "Small Craft Advisory issued May 6 at 12:54PM EDT until May 7 at 1:00PM EDT by NWS Wakefield VA", "description": "* WHAT...North winds 10 to 20 kt with gusts up to 25 kt and\nwaves 2 to 3 ft expected.\n\n* WHERE...Chesapeake Bay from Little Creek VA to Cape Henry VA\nincluding the Chesapeake Bay Bridge Tunnel, Chesapeake Bay\nfrom New Point Comfort to Little Creek VA, Chesapeake Bay from\nSmith Point to Windmill Point VA and Chesapeake Bay from\nWindmill Point to New Point Comfort VA.\n\n* WHEN...From 10 PM this evening to 1 PM EDT Thursday.\n\n* IMPACTS...Conditions will be hazardous to small craft.", "instruction": "Inexperienced mariners, especially those operating smaller\nvessels, should avoid navigating in hazardous conditions.", "response": "Avoid", "parameters": {"NWSheadline": ["SMALL CRAFT ADVISORY REMAINS IN EFFECT FROM 10 PM THIS EVENING TO 1 PM EDT THURSDAY"], "VTEC": ["/O.CON.KAKQ.SC.Y.0054.200507T0200Z-200507T1700Z/"], "PIL": ["AKQMWWAKQ"], "BLOCKCHANNEL": ["CMAS", "EAS", "NWEM"], "eventEndingTime": ["2020-05-07T13:00:00-04:00"]}}}
...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM