将带有“X”列的表格数据转换为没有pandas或csv包的字典

Question

I am preparing for a test and one of the topics is to parse tabular data without using csv/panda packages. 我正在准备测试，其中一个主题是解析表格数据而不使用csv / panda包。

The ask is to take data with an arbitrary number of columns and convert it into a dictionary. 问题是获取具有任意数量列的数据并将其转换为字典。 The delimiter can be a space, colon or comma. 分隔符可以是空格，冒号或逗号。 For instance, here is some data with comma as the delimiter - 例如，这里有一些逗号作为分隔符的数据 -

person,age,nationality,language, education
Jack,18,Canadian,English, bs
Rahul,25,Indian,Hindi, ms
Mark,50,American,English, phd
Kyou, 21, Japanese, English, bs

This should be converted into a dictionary format like this - 这应该转换成这样的字典格式 -

{'person': ['Jack', 'Rahul', 'Mark', 'Kyou'], 'age': ['18', '25', '50', '21'], 'education': ['doc', 'eng', 'llb', 'ca'], 'language': ['English', 'Hindi', 'English', 'English'
], 'nationality': ['Canadian', 'Indian', 'American', 'Japanese']}

Columns can vary among different files. 列可以在不同文件之间变化。 My program should be flexible to handle this variety. 我的程序应该灵活处理这种变化。 For instance, in the next file there might be another column titled "gender". 例如，在下一个文件中可能会有另一个标题为“性别”的列。

I was able to get this working but feel my code is very "clunky". 我能够让这个工作，但觉得我的代码非常“笨重”。 It works but I would like to do something more "pythonic". 它有效，但我想做更多“pythonic”的事情。

from collections import OrderedDict


def parse_data(myfile):
    # initialize myd as an ordered dictionary
    myd = OrderedDict()
    # open file with data
    with open (myfile, "r") as f:
        # use readlines to store tabular data in list format
        data = f.readlines()
        # use the first row to initialize the ordered dictionary keys
        for item in data[0].split(','):
            myd[item.strip()] = [] # initializing dict keys with column names
        # variable use to access different column values in remaining rows
        i = 0  
        # access each key in the ordered dict
        for key in myd:
            '''Tabular data starting from line # 1 is accessed and
            split on the "," delimiter. The variable "i" is used to access 
            each column incrementally. Ordered dict format of myd ensures 
            columns are paired appropriately'''
            myd[key] = [ item.split(',')[i].strip() for item in data[1:]]
            i += 1
    print dict(myd)

# my-input.txt 
parse_data("my-input.txt")

Can you please suggest how can I make my code "cleaner"? 您能否建议我如何使我的代码“更清洁”？

Answer 1

Here is a more pythonic way to approach this. 这是一种更加pythonic的方式来解决这个问题。

def parse(file):
    with open(file, 'r') as f:
        headings = f.readline().strip().split(',')
        values = [l.strip().split(',') for l in f]
    output_dict = {h: v for h, v in zip(headings, [*zip(*values)])}
    return output_dict

print(parse('test.csv'))

First, take the first line in the file as the headings to use for the keys in the dictionary (this will break with duplicate headings) 首先，将文件中的第一行作为标题用于字典中的键（这将打破重复的标题）

Then, all the remaining values are read into a list of lists of strings using a list comprehension. 然后，使用列表推导将所有剩余值读入字符串列表的列表中。

Finally the dictionary is compiled by zipping the list of headings with a transpose (thats what the [*zip(*values))] represents - if you are willing to use numpy you can replace this with numpy.array(values).T for example) 最后，通过使用转置（即[*zip(*values))]表示的标题列表来编译字典 - 如果您愿意使用numpy，则可以将其替换为numpy.array(values).T for例）

Answer 2

Slightly better version 稍微好一点的版本

def parse_data(myfile):
  # read lines and strip out extra whitespaces and newline characters 
  lines = [line.strip() for line in open(myfile,"r").readlines()]

  dict = {} # initialize our dict variable

  # start loop from second line
  for x in range(1,len(lines)):

    # for each line split values and store them in dict[col] 
    for y in range(len(lines[0].split(","))):

      # if col is not present in dict create new column and initialize it with a list
      if lines[0].split(",")[y] not in dict:
        dict[lines[0].split(",")[y]] = []

      # store the corresponding column value to the dict
      dict[lines[0].split(",")[y]].append(lines[x].split(",")[y])

parse_data("my-input.txt")

See it in action here . 在这里看到它。

Hope it helps! 希望能帮助到你！

将带有“X”列的表格数据转换为没有pandas或csv包的字典

问题描述

2 个解决方案

解决方案1
2 2019-07-25 06:33:59

解决方案2
0 2019-07-25 06:13:14

将带有“X”列的表格数据转换为没有pandas或csv包的字典

问题描述

2 个解决方案

解决方案1 2 2019-07-25 06:33:59

解决方案2 0 2019-07-25 06:13:14

解决方案1
2 2019-07-25 06:33:59

解决方案2
0 2019-07-25 06:13:14