简体   繁体   中英

python - read file into dictionary - delimited with curly brackets, no comma delimiter

I'm new to python (pandas, numPy, etc.). I'd like to know the perfect approach to solve this task in the best and performant way.

I have a huge file that has the following format - expect everything is in one line:

{"order_reference":"0658-2147","billing_address_zip_code":"8800"}
{"order_reference":"0453-2200","billing_address_zip_code":"8400"}
{"order_reference":"0554-3027","billing_address_zip_code":"8820"}
{"order_reference":"0382-3108","billing_address_zip_code":"3125"}
{"order_reference":"0534-4059","billing_address_zip_code":"3775"}
{"order_reference":"0118-1566","billing_address_zip_code":"3072"}
{"order_reference":"0384-6897","billing_address_zip_code":"8630"}
{"order_reference":"0361-5226","billing_address_zip_code":"4716"}
{"order_reference":"0313-6812","billing_address_zip_code":"9532"}
{"order_reference":"0344-6262","billing_address_zip_code":"3600"}

What is the easiest way to read this file into a dictionary in python or dataFrame in numPy? The goal is to join the billing_address_zip_code to a big JSON file to get more insights of the order_reference.

  • I was thinking to solve it with regExp, but as the file is huge, and need to join to another file, I think I should use Pandas, shouldn't I?
  • Or as all datasets are the same length, I could also insert by length

Is there a function for that to use pandas? I guess this would be the fastest way, but as it isn't standard JSON, I don't know how to do it.

I'm sorry for the beginner questions, but I search quite a bit on the internet and couldn't find the right answer. And it would really help me to figure out the right approach to this kind of tasks. For any help or links, I'm very thankful. Simon

PS: Which cloud environment do you use for this kind of tasks? Which works best with python and the data science libraries?

UPDATE

I used the following code to format into a valid JSON and loaded it with json.loads() successfully:

#syntay: python 3
import json

#small test file
my_list = "["+open("orders_play_around.json").read().replace("}{","},\n{")+"]"

d = json.loads(my_list)

So far so good. Now the next challenge, how do I join this json dictionary with another JSON file that has a join on the billing_address_zip_code ? The other JSON looks like this:

{
"data": [
{
  "BFS-Nr": 1,
  "Raum mit städtischem Charakter 2012": 4,
  "Typologie der MS-Regionen 2000 (2)": 3,
  "E": 679435,
  "Zusatzziffer": 0,
  "Agglomerationsgrössenklasse 2012": 1,
  "Gemeinde-typen (9 Typen) 2000 (1)": 4,
  "N": 235653,
  "Stadt/Land-Typologie 2012": 3,
  "Städte 2012": 0,
  "Gemeinde-Grössenklasse 2015": 7,
  "BFS Nr.": 1,
  "Sprachgebiete 2016": 1,
  "Europäsiche Berggebietsregionen (2)": 1,
  "Gemeindename_1": "Aeugst am Albis",
  "Anwendungsgebiete für Steuerer-leichterungen 2016": 0,
  "Kantonskürzel": "ZH",
  "Kanton": 1,
  "Metropolräume 2000 (2)": 1,
  "PLZ": 8914,
  "Bezirk": 101,
  "Gemeindetypologie 2012\n(25 Typen)": 237,
  "Raumplanungs-regionen": 105,
  "Gemeindetypologie 2012\n(9 Typen)": 23,
  "Agglomerationen und Kerne ausserhalb Agglomerationen 2012": 261,
  "Ortschaftsname": "Aeugst am Albis",
  "Arbeitsmarktregionen 2000 (2)": 10,
  "Gemeinde-typen\n(22 Typen) 2000 (1)": 11,
  "Städtische / Ländliche Gebiete 2000 (1)": 2,
  "Grossregionen": 4,
  "Gemeindename": "Aeugst am Albis",
  "MS-Regionen (2)": 4,
  "Touris-mus Regionen 2017": 3,
  "DEGURBA 2011 eurostat": 3
},
{....}
}

What is the easiest way to join them on a key PLZ from plz.js and billing_address_zip_code from orders_play_around.json? I could load it into JSON file without any problems:

plz_data=open('plz.js').read()
plz = json.loads(plz_data)

Sorry for the long message. But hopefully, someone can help me with this easy problem. The goal would be to plot it on a map or on a graph, where I can see which PLZ (zipcode) has the most orders.

Since you mention turning your file to proper JSON is your initial goal, and you don't mind sed , try:

sed 's|}{|}\n{|g' originalfile > result

Note I added in newlines, not commas. Probably better for your future editing. You can use the -i flag so sed edits in place, but this is safer. If you really want to use Python it's not a big deal with standard Python. Safest is to read character by character:

with open("originalfile") as fd:
    while True:
        ch=fd.read(1)
        if not ch: break
        if ch =="{": print("\n")
        print(ch,end="")

or just replace and print (never tested limits of Python, I'm guessing this will work:

print(open("originalfile").read().replace("}{","}\n{"))

no need for regex for this - It's a bit of overkill. Once this is a proper Json file it will be easier to use, including loading Json through pandas.read_json .

Here's one way.

data = []
with open("originalfile") as fp:
    for l in fp:
        clean_line = ([x.replace("{","").replace("}\n","").replace("\"","") for x in l.split(",")])
        data.append(clean_line)

Then you can convert the data list into a pandas dataframe and export to JSON.

df = pandas.DataFrame(data)
df.to_json()

If you want to remove the text, eg "billing_address_zip_code", and keep only data, then you can do

data = []
with open(filepath) as fp:
    for l in fp:
        splitted = ([x.split(":")[1] for x in l.split(",")])
        data.append(([x.replace("}\n","").replace("\"","") for x in splitted]))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM