將復雜JSON導出為CSV

Question

我有一個JSON文件，我從API下載。 目前，我已經能夠將其導出為JSON，並通過Excel Power Query准確地解析數據。

數據除以廣告系列IDS（在這種情況下，只有兩個），然后，在所選時段的每一天，有幾個不同的指標相關聯。 例如，這些是一些（不完整的）行，您可以看到它應該如何工作。

campaignId  metadata.id  metrics.impressions   metrics.clicks
s00821idk   2019-05-19   12000293121           100
s00821idk   2019-05-18   12300223151           103

我嘗試使用Excel來解析這些數據，這種方法違背了使用API的目的。 由於我是從Python導出，通過Excel運行它，然后將其放在Google表格中。

我想在Python中進行所有轉換，以便我可以使用Google表格API並將其放在那里。

在以下鏈接中，我提供了導出的JSON文件。 文件

如果你能幫助我以這種方式構建數據，那會很高興。 非常感謝。

Answer 1

如上所述，您需要完全展平多個嵌套值，迭代以獲得您想要的內容。 它可以完成，但它非常大（每個廣告系列ID為24,000+列），因此需要2分鍾來迭代您提供的整個內容。

import json
import pandas as pd
import re


with open('C:/data.json') as f:
    jsonObj = json.load(f)


def flatten_json(y):
    out = {}
    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x
    flatten(y)
    return out


flat = flatten_json(jsonObj)

results = pd.DataFrame()
special_cols = []

columns_list = list(flat.keys())
for item in columns_list:
    try:
        row_idx = re.findall(r'\_(\d+)\_', item )[0]
    except:
        special_cols.append(item)
        continue
    column = re.findall(r'\_\d+\_(.*)', item )[0]
    column = column.replace('_', '')

    row_idx = int(row_idx)
    value = flat[item]

    results.loc[row_idx, column] = value


for item in special_cols:
    results[item] = flat[item]

results.to_csv('file.csv', index=False)

輸出：

print (results)
                           campaignId  ... totalCampaigns
0  0081da282b2dbe8140508074366cac91ba  ...              2
1  00c03d801da285767a093d0b4d5188fb34  ...              2

[2 rows x 24533 columns]

Answer 2

IIUC - 以下方法如何。
它遍歷所有campaignResults ，在每個results迭代所有results並寫出campaignID ， metadata.id和（作為示例） metrics.impressions和metrics.clicks每行：

import json
sep = '\t'
with open(jsonfile) as jsonin, open('j2c.csv', 'w') as f:
    j = json.load(jsonin)
    f.write(f'campaignId{sep}metadata.id{sep}metrics.impressions{sep}metrics.clicks\n')
    for cR in range(j['totalCampaigns']):
        for r in range(j['campaignResults'][cR]['totalResults']):
            f.write(j['campaignResults'][cR]['campaignId']+ sep)
            f.write(j['campaignResults'][cR]['results'][r]['metadata']['id']+ sep)
            f.write(str(j['campaignResults'][cR]['results'][r]['metrics']['impressions']) + sep)
            f.write(str(j['campaignResults'][cR]['results'][r]['metrics']['clicks']) + '\n')

結果：

# campaignId    metadata.id metrics.impressions metrics.clicks
# 0081da282b2dbe8140508074366cac91ba    2019-05-20  176430.0    59.0
# 0081da282b2dbe8140508074366cac91ba    2019-05-19  169031.0    59.0
# 0081da282b2dbe8140508074366cac91ba    2019-05-18  108777.0    62.0
# 0081da282b2dbe8140508074366cac91ba    2019-05-17  272088.0    60.0
# 0081da282b2dbe8140508074366cac91ba    2019-05-16  198100.0    62.0
# ...
# 00c03d801da285767a093d0b4d5188fb34    2018-01-10  0.0 0.0
# 00c03d801da285767a093d0b4d5188fb34    2018-01-09  0.0 0.0
# 00c03d801da285767a093d0b4d5188fb34    2018-01-08  0.0 0.0
# 00c03d801da285767a093d0b4d5188fb34    2018-01-07  0.0 0.0
# 00c03d801da285767a093d0b4d5188fb34    2018-01-06  0.0 0.0

我仍然不能確切地理解您要提取的數據 - 只有具有類似日期的模式或僅具體日期的值？
除此之外，我並沒有真正得到你的json文件的結構，所以我嘗試創建一個樹打印輸出，這可能有助於獲得更清晰的視圖並更准確地表達問題：

with open(file) as f:
    j = json.load(f)

def getStructure(dct, ind=''):
    indsym = '.\t'
    for k, v in dct.items():
        if type(v) is list:
            print(f'{ind}{k}[{len(v)}]')
            getStructure(v[0], ind + indsym)
        elif type(v) is dict:
            print(f'{ind}{k}')
            getStructure(v, ind + indsym)
        else:
            print(f'{ind}{k}')

getStructure(j)

結果：

# campaignResults[2]
# .       campaignId
# .       results[500]
# .       .       metadata
# .       .       .       id
# .       .       .       fromDate
# .       .       .       toDate
# .       .       .       lastCappingTime
# .       .       metrics
# .       .       .       impressions
# .       .       .       clicks
# .       .       .       conversions
# .       .       .       spend
# .       .       .       ecpc
# .       .       .       ctr
# .       .       .       conversionRate
# .       .       .       cpa
# .       .       .       totalValue
# .       .       .       averageValue
# .       .       .       conversionMetrics[6]
# .       .       .       .       name
# .       .       .       .       conversions
# .       .       .       .       conversionRate
# .       .       .       .       cpa
# .       .       .       .       totalValue
# .       .       .       .       averageValue
# .       totalResults
# totalCampaigns

這里有一個小問題：我認為類似的列表元素中並不總是有相同的鍵：

j['campaignResults'][0]['results'][0]['metadata'].keys()
# dict_keys(['id', 'fromDate', 'toDate', 'lastCappingTime'])

j['campaignResults'][1]['results'][0]['metadata'].keys()
# dict_keys(['id', 'fromDate', 'toDate'])

請注意，上面的getStructure函數只查看列表的第一個元素以獲取該結構的結構。

將復雜JSON導出為CSV

問題描述

2 個解決方案

解決方案1
0 2019-05-22 15:56:30

解決方案2
0 2019-05-23 09:12:04

將復雜JSON導出為CSV

問題描述

2 個解決方案

解決方案1 0 2019-05-22 15:56:30

解決方案2 0 2019-05-23 09:12:04

解決方案1
0 2019-05-22 15:56:30

解決方案2
0 2019-05-23 09:12:04