I am preparing for a test and one of the topics is to parse tabular data without using csv/panda packages.
The ask is to take data with an arbitrary number of columns and convert it into a dictionary. The delimiter can be a space, colon or comma. For instance, here is some data with comma as the delimiter -
person,age,nationality,language, education
Jack,18,Canadian,English, bs
Rahul,25,Indian,Hindi, ms
Mark,50,American,English, phd
Kyou, 21, Japanese, English, bs
This should be converted into a dictionary format like this -
{'person': ['Jack', 'Rahul', 'Mark', 'Kyou'], 'age': ['18', '25', '50', '21'], 'education': ['doc', 'eng', 'llb', 'ca'], 'language': ['English', 'Hindi', 'English', 'English'
], 'nationality': ['Canadian', 'Indian', 'American', 'Japanese']}
Columns can vary among different files. My program should be flexible to handle this variety. For instance, in the next file there might be another column titled "gender".
I was able to get this working but feel my code is very "clunky". It works but I would like to do something more "pythonic".
from collections import OrderedDict
def parse_data(myfile):
# initialize myd as an ordered dictionary
myd = OrderedDict()
# open file with data
with open (myfile, "r") as f:
# use readlines to store tabular data in list format
data = f.readlines()
# use the first row to initialize the ordered dictionary keys
for item in data[0].split(','):
myd[item.strip()] = [] # initializing dict keys with column names
# variable use to access different column values in remaining rows
i = 0
# access each key in the ordered dict
for key in myd:
'''Tabular data starting from line # 1 is accessed and
split on the "," delimiter. The variable "i" is used to access
each column incrementally. Ordered dict format of myd ensures
columns are paired appropriately'''
myd[key] = [ item.split(',')[i].strip() for item in data[1:]]
i += 1
print dict(myd)
# my-input.txt
parse_data("my-input.txt")
Can you please suggest how can I make my code "cleaner"?
Here is a more pythonic way to approach this.
def parse(file):
with open(file, 'r') as f:
headings = f.readline().strip().split(',')
values = [l.strip().split(',') for l in f]
output_dict = {h: v for h, v in zip(headings, [*zip(*values)])}
return output_dict
print(parse('test.csv'))
First, take the first line in the file as the headings to use for the keys in the dictionary (this will break with duplicate headings)
Then, all the remaining values are read into a list of lists of strings using a list comprehension.
Finally the dictionary is compiled by zipping the list of headings with a transpose (thats what the [*zip(*values))]
represents - if you are willing to use numpy you can replace this with numpy.array(values).T
for example)
Slightly better version
def parse_data(myfile):
# read lines and strip out extra whitespaces and newline characters
lines = [line.strip() for line in open(myfile,"r").readlines()]
dict = {} # initialize our dict variable
# start loop from second line
for x in range(1,len(lines)):
# for each line split values and store them in dict[col]
for y in range(len(lines[0].split(","))):
# if col is not present in dict create new column and initialize it with a list
if lines[0].split(",")[y] not in dict:
dict[lines[0].split(",")[y]] = []
# store the corresponding column value to the dict
dict[lines[0].split(",")[y]].append(lines[x].split(",")[y])
parse_data("my-input.txt")
See it in action here .
Hope it helps!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.