简体   繁体   中英

Formatting a string in required format in Python

I have a data in format:

id1 id2 value Something like

1   234  0.2
1   235  0.1

and so on. I want to convert it in json format:

{
  "nodes": [ {"name":"1"},  #first element
             {"name":"234"}, #second element
             {"name":"235"} #third element
             ] ,
   "links":[{"source":1,"target":2,"value":0.2},
             {"source":1,"target":3,"value":0.1}
           ]
}

So, from the original data to above format.. the nodes contain all the set of (distinct) names present in the original data and the links are basically the line number of source and target in the values list returned by nodes. For example:

   1 234 0.2

1 is in the first element in the list of values holded by the key "nodes" 234 is the second element in the list of values holded by the key "nodes"

Hence the link dictionary is {"source":1,"target":2,"value":0.2}

How do i do this efficiently in python.. I am sure there should be better way than what I am doing which is so messy :( Here is what I am doing from collections import defaultdict

def open_file(filename,output=None):
    f = open(filename,"r")
    offset = 3429
    data_dict = {}
    node_list = []
    node_dict = {}
    link_list = []
    num_lines = 0
    line_ids = []
    for line in f:
        line = line.strip()
        tokens = line.split()
        mod_wid  = int(tokens[1]) + offset


        if not node_dict.has_key(tokens[0]):
            d = {"name": tokens[0],"group":1}
            node_list.append(d)
            node_dict[tokens[0]] = True
            line_ids.append(tokens[0])
        if not node_dict.has_key(mod_wid):
            d = {"name": str(mod_wid),"group":1}
            node_list.append(d)
            node_dict[mod_wid] = True
            line_ids.append(mod_wid)


        link_d = {"source": line_ids.index(tokens[0]),"target":line_ids.index(mod_wid),"value":tokens[2]}
        link_list.append(link_d)
        if num_lines > 10000:
            break
        num_lines +=1


    data_dict = {"nodes":node_list, "links":link_list}

    print "{\n"
    for k,v in data_dict.items():
        print  '"'+k +'"' +":\n [ \n " 
        for each_v in v:
            print each_v ,","
        print "\n],"
    print "}"

open_file("lda_input.tsv")

I'm assuming by "efficiently" you're talking about programmer efficiency—how easy it is to read, maintain, and code the logic—rather than runtime speed efficiency. If you're worried about the latter, you're probably worried for no reason. (But the code below will probably be faster anyway.)

The key to coming up with a better solution is to think more abstractly. Think about rows in a CSV file, not lines in a text file; create a dict that can be rendered in JSON rather than trying to generate JSON via string processing; wrap things up in functions if you want to do them repeatedly; etc. Something like this:

import csv
import json
import sys

def parse(inpath, namedict):
    lastname = [0]
    def lookup_name(name):
        try:
            print('Looking up {} in {}'.format(name, names))
            return namedict[name]
        except KeyError:
            lastname[0] += 1
            print('Adding {} as {}'.format(name, lastname[0]))
            namedict[name] = lastname[0]
            return lastname[0]
    with open(inpath) as f:
        reader = csv.reader(f, delimiter=' ', skipinitialspace=True)
        for id1, id2, value in reader:
            yield {'source': lookup_name(id1),
                   'target': lookup_name(id2),
                   'value': value}

for inpath in sys.argv[1:]:
    names = {}
    links = list(parse(inpath, names))
    nodes = [{'name': name} for name in names]
    outpath = inpath + '.json'
    with open(outpath, 'w') as f:
        json.dump({'nodes': nodes, 'links': links}, f, indent=4)

Don't construct the JSON manually. Make it out of an existing Python object with the json module:

def parse(data):
    nodes = set()
    links = set()

    for line in data.split('\n'):
        fields = line.split()

        id1, id2 = map(int, fields[:2])
        value = float(fields[2])

        nodes.update((id1, id2))
        links.add((id1, id2, value))

    return {
        'nodes': [{
            'name': node
        } for node in nodes],
        'links': [{
            'source': link[0],
            'target': link[1],
            'value': link[2]
        } for link in links]
    }

Now, you can use json.dumps to get a string:

>>> import json
>>> data = '1   234  0.2\n1   235  0.1'
>>> parsed = parse(data)
>>> parsed
    {'links': [{'source': 1, 'target': 235, 'value': 0.1},
  {'source': 1, 'target': 234, 'value': 0.2}],
 'nodes': [{'name': 1}, {'name': 234}, {'name': 235}]}
>>> json.dumps(parsed)
    '{"nodes": [{"name": 1}, {"name": 234}, {"name": 235}], "links": [{"source": 1, "target": 235, "value": 0.1}, {"source": 1, "target": 234, "value": 0.2}]}'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM