[英]Pandas semi structured JSON data frame to simple Pandas dataframe
我有一個從紅移集群中獲取的數據塊。 前 4 列由“|”分隔那么 2 列是 JSON。
XXX|ABANDONED|1197|11|"{""currency"":""EUR"" item_id"":""143"" type"":""FLIGHT"" name"":""PAR-FEZ"" price"":1111 origin"":""PAR"" destination"":""FEZ"" merchant"":""GOV"" flight_type"":""OW"" flight_segment"":[{ origin"":""ORY"" destination"":""FEZ"" departure_date_time"":""2015-08-02T07:20"" arrival_date_time"":""2015-08-02T09:05"" carrier"":""AT"" f_class"":""ECONOMY""}]}"|"{""type"":""FLIGHT"" name"":""FI_ORY-OUD"" item_id"":""FLIGHT"" currency"":""EUR"" price"":111 origin"":""ORY"" destination"":""OUD"" flight_type"":""OW"" flight_segment"":[{""origin"":""ORY"" destination"":""OUD"" departure_date_time"":""2015-08-02T13:55"" arrival_date_time"":""2015-08-02T15:30"" flight_number"":""AT625"" carrier"":""AT"" f_class"":""ECONOMIC_DISCOUNTED""}]}"
在 Python 2.7 中工作想分離出 JSON 值並將其轉換為 Pandas 數據幀,但我在 pyparsing 方面缺乏經驗。
我的方法是將文件作為帶有“|”的 Pandas 數據框讀入作為分隔符,而不是使用包含 JSON 的列並使用 'JSON_normalise' 將其展平,但 JSON_normalise 不會索引熊貓的列
我在這里和這里發現了解決方案,但一個不適合我的“混合數據”,另一個是對於相當大的 JSON 文件來說過於簡單
關於如何在這些數據上部署 Pyparsing 的任何提示都會非常有幫助。 謝謝
將上面的輸入字符串作為名為 'data' 的變量,這個 Python+pyparsing 代碼會對它有所了解。 不幸的是,第四個'|'右邊的那個東西不是真正的 JSON。 幸運的是,這是不夠好格式,它可以在沒有過度不適被解析。 請參閱下面程序中的嵌入注釋:
from pyparsing import *
from datetime import datetime
# for the most part, we suppress punctuation - it's important at parse time
# but just gets in the way afterwards
LBRACE,RBRACE,COLON,DBLQ,LBRACK,RBRACK = map(Suppress, '{}:"[]')
DBLQ2 = DBLQ + DBLQ
# define some scalar value expressions, including parse-time conversion parse actions
realnum = Regex(r'[+-]?\d+\.\d*').setParseAction(lambda t:float(t[0]))
integer = Regex(r'[+-]?\d+').setParseAction(lambda t:int(t[0]))
timestamp = Regex(r'""\d{4}-\d{2}-\d{2}T\d{2}:\d{2}""')
timestamp.setParseAction(lambda t: datetime.strptime(t[0][2:-2],'%Y-%m-%dT%H:%M'))
string_value = QuotedString('""')
# define our base key ':' value expression; use a Forward() placeholder
# for now for value, since these things can be recursive
key = Optional(DBLQ2) + Word(alphas, alphanums+'_') + DBLQ2
value = Forward()
key_value = Group(key + COLON + value)
# objects can be values too - use the Dict class to capture keys as field names
obj = Group(Dict(LBRACE + OneOrMore(key_value) + RBRACE))
objlist = (LBRACK + ZeroOrMore(obj) + RBRACK)
# define expression for previously-declared value, using <<= operator
value <<= timestamp | string_value | realnum | integer | obj | Group(objlist)
# the outermost objects are enclosed in "s, and list of them can be given with '|' delims
quotedObj = DBLQ + obj + DBLQ
obsList = delimitedList(quotedObj, delim='|')
現在將該解析器應用於您的“數據”:
fields = data.split('|',4)
result = obsList.parseString(fields[-1])
# we get back a list of objects, dump them out
for r in result:
print r.dump()
print
給出:
[['currency', 'EUR'], ['item_id', '143'], ['type', 'FLIGHT'], ['name', 'PAR-FEZ'], ['price', 1111], ['origin', 'PAR'], ['destination', 'FEZ'], ['merchant', 'GOV'], ['flight_type', 'OW'], ['flight_segment', [[['origin', 'ORY'], ['destination', 'FEZ'], ['departure_date_time', datetime.datetime(2015, 8, 2, 7, 20)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 9, 5)], ['carrier', 'AT'], ['f_class', 'ECONOMY']]]]]
- currency: EUR
- destination: FEZ
- flight_segment:
[0]:
[['origin', 'ORY'], ['destination', 'FEZ'], ['departure_date_time', datetime.datetime(2015, 8, 2, 7, 20)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 9, 5)], ['carrier', 'AT'], ['f_class', 'ECONOMY']]
- arrival_date_time: 2015-08-02 09:05:00
- carrier: AT
- departure_date_time: 2015-08-02 07:20:00
- destination: FEZ
- f_class: ECONOMY
- origin: ORY
- flight_type: OW
- item_id: 143
- merchant: GOV
- name: PAR-FEZ
- origin: PAR
- price: 1111
- type: FLIGHT
[['type', 'FLIGHT'], ['name', 'FI_ORY-OUD'], ['item_id', 'FLIGHT'], ['currency', 'EUR'], ['price', 111], ['origin', 'ORY'], ['destination', 'OUD'], ['flight_type', 'OW'], ['flight_segment', [[['origin', 'ORY'], ['destination', 'OUD'], ['departure_date_time', datetime.datetime(2015, 8, 2, 13, 55)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 15, 30)], ['flight_number', 'AT625'], ['carrier', 'AT'], ['f_class', 'ECONOMIC_DISCOUNTED']]]]]
- currency: EUR
- destination: OUD
- flight_segment:
[0]:
[['origin', 'ORY'], ['destination', 'OUD'], ['departure_date_time', datetime.datetime(2015, 8, 2, 13, 55)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 15, 30)], ['flight_number', 'AT625'], ['carrier', 'AT'], ['f_class', 'ECONOMIC_DISCOUNTED']]
- arrival_date_time: 2015-08-02 15:30:00
- carrier: AT
- departure_date_time: 2015-08-02 13:55:00
- destination: OUD
- f_class: ECONOMIC_DISCOUNTED
- flight_number: AT625
- origin: ORY
- flight_type: OW
- item_id: FLIGHT
- name: FI_ORY-OUD
- origin: ORY
- price: 111
- type: FLIGHT
請注意,不是字符串的值(整數、時間戳等)已經轉換為 Python 類型。 由於字段名稱已保存為 dict 鍵,因此您可以按名稱訪問字段,如下所示:
res[0].currency
res[0].price
res[0].destination
res[0].flight_segment[0].origin
len(res[0].flight_segment) # gives how many segments
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.