简体   繁体   English

python-将文件读入字典-用大括号分隔,没有逗号分隔符

[英]python - read file into dictionary - delimited with curly brackets, no comma delimiter

I'm new to python (pandas, numPy, etc.). 我是python的新手(pandas,numPy等)。 I'd like to know the perfect approach to solve this task in the best and performant way. 我想知道以最佳和高效的方式解决此任务的完美方法。

I have a huge file that has the following format - expect everything is in one line: 我有一个具有以下格式的大文件-期望所有内容都在一行中:

{"order_reference":"0658-2147","billing_address_zip_code":"8800"}
{"order_reference":"0453-2200","billing_address_zip_code":"8400"}
{"order_reference":"0554-3027","billing_address_zip_code":"8820"}
{"order_reference":"0382-3108","billing_address_zip_code":"3125"}
{"order_reference":"0534-4059","billing_address_zip_code":"3775"}
{"order_reference":"0118-1566","billing_address_zip_code":"3072"}
{"order_reference":"0384-6897","billing_address_zip_code":"8630"}
{"order_reference":"0361-5226","billing_address_zip_code":"4716"}
{"order_reference":"0313-6812","billing_address_zip_code":"9532"}
{"order_reference":"0344-6262","billing_address_zip_code":"3600"}

What is the easiest way to read this file into a dictionary in python or dataFrame in numPy? 将文件读入python或numPy的dataFrame中的最简单方法是什么? The goal is to join the billing_address_zip_code to a big JSON file to get more insights of the order_reference. 目标是将billing_address_zip_code加入一个大的JSON文件中,以获得对order_reference的更多见解。

  • I was thinking to solve it with regExp, but as the file is huge, and need to join to another file, I think I should use Pandas, shouldn't I? 我当时想用regExp解决它,但是由于文件很大,并且需要连接到另一个文件,我认为我应该使用Pandas,不是吗?
  • Or as all datasets are the same length, I could also insert by length 或者因为所有数据集的长度相同,所以我也可以按长度插入

Is there a function for that to use pandas? 有使用熊猫的功能吗? I guess this would be the fastest way, but as it isn't standard JSON, I don't know how to do it. 我想这将是最快的方法,但是由于它不是标准的JSON,因此我不知道该怎么做。

I'm sorry for the beginner questions, but I search quite a bit on the internet and couldn't find the right answer. 对于初学者的问题我感到很抱歉,但是我在互联网上搜索了很多内容,却找不到正确的答案。 And it would really help me to figure out the right approach to this kind of tasks. 这确实可以帮助我找出解决此类任务的正确方法。 For any help or links, I'm very thankful. 对于任何帮助或链接,我非常感谢。 Simon 西蒙

PS: Which cloud environment do you use for this kind of tasks? PS:您将哪种云环境用于此类任务? Which works best with python and the data science libraries? 哪个最适合python和数据科学库?

UPDATE 更新

I used the following code to format into a valid JSON and loaded it with json.loads() successfully: 我使用以下代码将其格式化为有效的JSON,并成功将其与json.loads()一起加载:

#syntay: python 3
import json

#small test file
my_list = "["+open("orders_play_around.json").read().replace("}{","},\n{")+"]"

d = json.loads(my_list)

So far so good. 到现在为止还挺好。 Now the next challenge, how do I join this json dictionary with another JSON file that has a join on the billing_address_zip_code ? 现在是下一个挑战,如何将这个json字典与另一个在billing_address_zip_code上具有联接的 JSON文件联接 The other JSON looks like this: 另一个JSON如下所示:

{
"data": [
{
  "BFS-Nr": 1,
  "Raum mit städtischem Charakter 2012": 4,
  "Typologie der MS-Regionen 2000 (2)": 3,
  "E": 679435,
  "Zusatzziffer": 0,
  "Agglomerationsgrössenklasse 2012": 1,
  "Gemeinde-typen (9 Typen) 2000 (1)": 4,
  "N": 235653,
  "Stadt/Land-Typologie 2012": 3,
  "Städte 2012": 0,
  "Gemeinde-Grössenklasse 2015": 7,
  "BFS Nr.": 1,
  "Sprachgebiete 2016": 1,
  "Europäsiche Berggebietsregionen (2)": 1,
  "Gemeindename_1": "Aeugst am Albis",
  "Anwendungsgebiete für Steuerer-leichterungen 2016": 0,
  "Kantonskürzel": "ZH",
  "Kanton": 1,
  "Metropolräume 2000 (2)": 1,
  "PLZ": 8914,
  "Bezirk": 101,
  "Gemeindetypologie 2012\n(25 Typen)": 237,
  "Raumplanungs-regionen": 105,
  "Gemeindetypologie 2012\n(9 Typen)": 23,
  "Agglomerationen und Kerne ausserhalb Agglomerationen 2012": 261,
  "Ortschaftsname": "Aeugst am Albis",
  "Arbeitsmarktregionen 2000 (2)": 10,
  "Gemeinde-typen\n(22 Typen) 2000 (1)": 11,
  "Städtische / Ländliche Gebiete 2000 (1)": 2,
  "Grossregionen": 4,
  "Gemeindename": "Aeugst am Albis",
  "MS-Regionen (2)": 4,
  "Touris-mus Regionen 2017": 3,
  "DEGURBA 2011 eurostat": 3
},
{....}
}

What is the easiest way to join them on a key PLZ from plz.js and billing_address_zip_code from orders_play_around.json? 什么是加入他们从orders_play_around.json plz.js和billing_address_zip_code一键PLZ最简单的方法? I could load it into JSON file without any problems: 我可以将其加载到JSON文件中,而不会出现任何问题:

plz_data=open('plz.js').read()
plz = json.loads(plz_data)

Sorry for the long message. 抱歉,长消息。 But hopefully, someone can help me with this easy problem. 但是希望有人可以帮助我解决这个简单的问题。 The goal would be to plot it on a map or on a graph, where I can see which PLZ (zipcode) has the most orders. 目标是将其绘制在地图或图形上,在这里我可以看到哪个PLZ(邮政编码)的订单最多。

Since you mention turning your file to proper JSON is your initial goal, and you don't mind sed , try: 既然您提到将文件转换为正确的JSON是您的最初目标,并且您不介意sed ,请尝试:

sed 's|}{|}\n{|g' originalfile > result

Note I added in newlines, not commas. 请注意,我以换行符而不是逗号添加。 Probably better for your future editing. 可能对您将来的编辑更好。 You can use the -i flag so sed edits in place, but this is safer. 您可以使用-i标志,以便sed进行适当的编辑,但这比较安全。 If you really want to use Python it's not a big deal with standard Python. 如果您真的想使用Python,那么与标准Python没什么大不了的。 Safest is to read character by character: 最安全的是逐字阅读:

with open("originalfile") as fd:
    while True:
        ch=fd.read(1)
        if not ch: break
        if ch =="{": print("\n")
        print(ch,end="")

or just replace and print (never tested limits of Python, I'm guessing this will work: 或只是替换并打印(从未测试过Python限制),我想这会起作用:

print(open("originalfile").read().replace("}{","}\n{"))

no need for regex for this - It's a bit of overkill. 不需要regex -这有点矫kill过正。 Once this is a proper Json file it will be easier to use, including loading Json through pandas.read_json . 一旦这是一个合适的Json文件,它将更易于使用,包括通过pandas.read_json加载Json

Here's one way. 这是一种方法。

data = []
with open("originalfile") as fp:
    for l in fp:
        clean_line = ([x.replace("{","").replace("}\n","").replace("\"","") for x in l.split(",")])
        data.append(clean_line)

Then you can convert the data list into a pandas dataframe and export to JSON. 然后,您可以将数据列表转换为pandas数据框并导出为JSON。

df = pandas.DataFrame(data)
df.to_json()

If you want to remove the text, eg "billing_address_zip_code", and keep only data, then you can do 如果要删除文本,例如“ billing_address_zip_code”,仅保留数据,则可以

data = []
with open(filepath) as fp:
    for l in fp:
        splitted = ([x.split(":")[1] for x in l.split(",")])
        data.append(([x.replace("}\n","").replace("\"","") for x in splitted]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python将带有逗号分隔符的.csv文件转换为字典 - Python Converting .csv file with comma delimiter to dictionary Python-将逗号分隔的文件读入数组 - Python - read comma delimited file into array 逗号中的拆分字符串在python中的圆括号或大括号中不存在 - Split string on comma not present in round brackets or curly brackets in python Python:将带分隔符的字符串的输入文件读入嵌套字典并循环浏览 - Python: Read the input file of delimited strings into a nested dictionary and loop through it python loadtxt读取分隔的文件 - python loadtxt to read delimited file 尝试读取逗号分隔文件时的分隔符问题(Python、Pandas &.csv) - Issues with the delimiter when trying to read a comma separated file (Python, Pandas & .csv) 制表符分隔的文件到字典(python) - Tab-delimited file into dictionary (python) 从制表符分隔的文件(不是1:1)创建python字典 - Create a python dictionary from a tab delimited file that is not 1:1 如何读取以'\\ x01'分隔的CSV文件并在python中创建字典 - How to read a CSV File delimited by '\x01' and create a dictionary in python 什么是最好的 python 正则表达式来排除一对大括号之间的逗号的单个实例? - What is the best python regular expression to exclude only a single instance of a comma between a pair of curly brackets?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM