[英]python - read file into dictionary - delimited with curly brackets, no comma delimiter
I'm new to python (pandas, numPy, etc.). 我是python的新手(pandas,numPy等)。 I'd like to know the perfect approach to solve this task in the best and performant way.
我想知道以最佳和高效的方式解决此任务的完美方法。
I have a huge file that has the following format - expect everything is in one line: 我有一个具有以下格式的大文件-期望所有内容都在一行中:
{"order_reference":"0658-2147","billing_address_zip_code":"8800"}
{"order_reference":"0453-2200","billing_address_zip_code":"8400"}
{"order_reference":"0554-3027","billing_address_zip_code":"8820"}
{"order_reference":"0382-3108","billing_address_zip_code":"3125"}
{"order_reference":"0534-4059","billing_address_zip_code":"3775"}
{"order_reference":"0118-1566","billing_address_zip_code":"3072"}
{"order_reference":"0384-6897","billing_address_zip_code":"8630"}
{"order_reference":"0361-5226","billing_address_zip_code":"4716"}
{"order_reference":"0313-6812","billing_address_zip_code":"9532"}
{"order_reference":"0344-6262","billing_address_zip_code":"3600"}
What is the easiest way to read this file into a dictionary in python or dataFrame in numPy? 将文件读入python或numPy的dataFrame中的最简单方法是什么? The goal is to join the billing_address_zip_code to a big JSON file to get more insights of the order_reference.
目标是将billing_address_zip_code加入一个大的JSON文件中,以获得对order_reference的更多见解。
Is there a function for that to use pandas? 有使用熊猫的功能吗? I guess this would be the fastest way, but as it isn't standard JSON, I don't know how to do it.
我想这将是最快的方法,但是由于它不是标准的JSON,因此我不知道该怎么做。
I'm sorry for the beginner questions, but I search quite a bit on the internet and couldn't find the right answer. 对于初学者的问题我感到很抱歉,但是我在互联网上搜索了很多内容,却找不到正确的答案。 And it would really help me to figure out the right approach to this kind of tasks.
这确实可以帮助我找出解决此类任务的正确方法。 For any help or links, I'm very thankful.
对于任何帮助或链接,我非常感谢。 Simon
西蒙
PS: Which cloud environment do you use for this kind of tasks? PS:您将哪种云环境用于此类任务? Which works best with python and the data science libraries?
哪个最适合python和数据科学库?
UPDATE 更新
I used the following code to format into a valid JSON and loaded it with json.loads() successfully: 我使用以下代码将其格式化为有效的JSON,并成功将其与json.loads()一起加载:
#syntay: python 3
import json
#small test file
my_list = "["+open("orders_play_around.json").read().replace("}{","},\n{")+"]"
d = json.loads(my_list)
So far so good. 到现在为止还挺好。 Now the next challenge, how do I join this json dictionary with another JSON file that has a join on the billing_address_zip_code ?
现在是下一个挑战,如何将这个json字典与另一个在billing_address_zip_code上具有联接的 JSON文件联接 ? The other JSON looks like this:
另一个JSON如下所示:
{
"data": [
{
"BFS-Nr": 1,
"Raum mit städtischem Charakter 2012": 4,
"Typologie der MS-Regionen 2000 (2)": 3,
"E": 679435,
"Zusatzziffer": 0,
"Agglomerationsgrössenklasse 2012": 1,
"Gemeinde-typen (9 Typen) 2000 (1)": 4,
"N": 235653,
"Stadt/Land-Typologie 2012": 3,
"Städte 2012": 0,
"Gemeinde-Grössenklasse 2015": 7,
"BFS Nr.": 1,
"Sprachgebiete 2016": 1,
"Europäsiche Berggebietsregionen (2)": 1,
"Gemeindename_1": "Aeugst am Albis",
"Anwendungsgebiete für Steuerer-leichterungen 2016": 0,
"Kantonskürzel": "ZH",
"Kanton": 1,
"Metropolräume 2000 (2)": 1,
"PLZ": 8914,
"Bezirk": 101,
"Gemeindetypologie 2012\n(25 Typen)": 237,
"Raumplanungs-regionen": 105,
"Gemeindetypologie 2012\n(9 Typen)": 23,
"Agglomerationen und Kerne ausserhalb Agglomerationen 2012": 261,
"Ortschaftsname": "Aeugst am Albis",
"Arbeitsmarktregionen 2000 (2)": 10,
"Gemeinde-typen\n(22 Typen) 2000 (1)": 11,
"Städtische / Ländliche Gebiete 2000 (1)": 2,
"Grossregionen": 4,
"Gemeindename": "Aeugst am Albis",
"MS-Regionen (2)": 4,
"Touris-mus Regionen 2017": 3,
"DEGURBA 2011 eurostat": 3
},
{....}
}
What is the easiest way to join them on a key PLZ from plz.js and billing_address_zip_code from orders_play_around.json? 什么是加入他们从orders_play_around.json plz.js和billing_address_zip_code一键PLZ最简单的方法? I could load it into JSON file without any problems:
我可以将其加载到JSON文件中,而不会出现任何问题:
plz_data=open('plz.js').read()
plz = json.loads(plz_data)
Sorry for the long message. 抱歉,长消息。 But hopefully, someone can help me with this easy problem.
但是希望有人可以帮助我解决这个简单的问题。 The goal would be to plot it on a map or on a graph, where I can see which PLZ (zipcode) has the most orders.
目标是将其绘制在地图或图形上,在这里我可以看到哪个PLZ(邮政编码)的订单最多。
Since you mention turning your file to proper JSON is your initial goal, and you don't mind sed
, try: 既然您提到将文件转换为正确的JSON是您的最初目标,并且您不介意
sed
,请尝试:
sed 's|}{|}\n{|g' originalfile > result
Note I added in newlines, not commas. 请注意,我以换行符而不是逗号添加。 Probably better for your future editing.
可能对您将来的编辑更好。 You can use the
-i
flag so sed
edits in place, but this is safer. 您可以使用
-i
标志,以便sed
进行适当的编辑,但这比较安全。 If you really want to use Python it's not a big deal with standard Python. 如果您真的想使用Python,那么与标准Python没什么大不了的。 Safest is to read character by character:
最安全的是逐字阅读:
with open("originalfile") as fd:
while True:
ch=fd.read(1)
if not ch: break
if ch =="{": print("\n")
print(ch,end="")
or just replace and print (never tested limits of Python, I'm guessing this will work: 或只是替换并打印(从未测试过Python限制),我想这会起作用:
print(open("originalfile").read().replace("}{","}\n{"))
no need for regex
for this - It's a bit of overkill. 不需要
regex
-这有点矫kill过正。 Once this is a proper Json
file it will be easier to use, including loading Json
through pandas.read_json
. 一旦这是一个合适的
Json
文件,它将更易于使用,包括通过pandas.read_json
加载Json
。
Here's one way. 这是一种方法。
data = []
with open("originalfile") as fp:
for l in fp:
clean_line = ([x.replace("{","").replace("}\n","").replace("\"","") for x in l.split(",")])
data.append(clean_line)
Then you can convert the data list into a pandas dataframe and export to JSON. 然后,您可以将数据列表转换为pandas数据框并导出为JSON。
df = pandas.DataFrame(data)
df.to_json()
If you want to remove the text, eg "billing_address_zip_code", and keep only data, then you can do 如果要删除文本,例如“ billing_address_zip_code”,仅保留数据,则可以
data = []
with open(filepath) as fp:
for l in fp:
splitted = ([x.split(":")[1] for x in l.split(",")])
data.append(([x.replace("}\n","").replace("\"","") for x in splitted]))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.