简体   繁体   中英

Convert Tabular Data with “X” columns into dictionary without pandas or csv packages

I am preparing for a test and one of the topics is to parse tabular data without using csv/panda packages.

The ask is to take data with an arbitrary number of columns and convert it into a dictionary. The delimiter can be a space, colon or comma. For instance, here is some data with comma as the delimiter -

person,age,nationality,language, education
Jack,18,Canadian,English, bs
Rahul,25,Indian,Hindi, ms
Mark,50,American,English, phd
Kyou, 21, Japanese, English, bs

This should be converted into a dictionary format like this -

{'person': ['Jack', 'Rahul', 'Mark', 'Kyou'], 'age': ['18', '25', '50', '21'], 'education': ['doc', 'eng', 'llb', 'ca'], 'language': ['English', 'Hindi', 'English', 'English'
], 'nationality': ['Canadian', 'Indian', 'American', 'Japanese']}

Columns can vary among different files. My program should be flexible to handle this variety. For instance, in the next file there might be another column titled "gender".

I was able to get this working but feel my code is very "clunky". It works but I would like to do something more "pythonic".

from collections import OrderedDict


def parse_data(myfile):
    # initialize myd as an ordered dictionary
    myd = OrderedDict()
    # open file with data
    with open (myfile, "r") as f:
        # use readlines to store tabular data in list format
        data = f.readlines()
        # use the first row to initialize the ordered dictionary keys
        for item in data[0].split(','):
            myd[item.strip()] = [] # initializing dict keys with column names
        # variable use to access different column values in remaining rows
        i = 0  
        # access each key in the ordered dict
        for key in myd:
            '''Tabular data starting from line # 1 is accessed and
            split on the "," delimiter. The variable "i" is used to access 
            each column incrementally. Ordered dict format of myd ensures 
            columns are paired appropriately'''
            myd[key] = [ item.split(',')[i].strip() for item in data[1:]]
            i += 1
    print dict(myd)

# my-input.txt 
parse_data("my-input.txt")

Can you please suggest how can I make my code "cleaner"?

Here is a more pythonic way to approach this.

def parse(file):
    with open(file, 'r') as f:
        headings = f.readline().strip().split(',')
        values = [l.strip().split(',') for l in f]
    output_dict = {h: v for h, v in zip(headings, [*zip(*values)])}
    return output_dict

print(parse('test.csv'))

First, take the first line in the file as the headings to use for the keys in the dictionary (this will break with duplicate headings)

Then, all the remaining values are read into a list of lists of strings using a list comprehension.

Finally the dictionary is compiled by zipping the list of headings with a transpose (thats what the [*zip(*values))] represents - if you are willing to use numpy you can replace this with numpy.array(values).T for example)

Slightly better version

def parse_data(myfile):
  # read lines and strip out extra whitespaces and newline characters 
  lines = [line.strip() for line in open(myfile,"r").readlines()]

  dict = {} # initialize our dict variable

  # start loop from second line
  for x in range(1,len(lines)):

    # for each line split values and store them in dict[col] 
    for y in range(len(lines[0].split(","))):

      # if col is not present in dict create new column and initialize it with a list
      if lines[0].split(",")[y] not in dict:
        dict[lines[0].split(",")[y]] = []

      # store the corresponding column value to the dict
      dict[lines[0].split(",")[y]].append(lines[x].split(",")[y])

parse_data("my-input.txt")

See it in action here .

Hope it helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM