将带有“X”列的表格数据转换为没有pandas或csv包的字典

Question

我正在准备测试，其中一个主题是解析表格数据而不使用csv / panda包。

问题是获取具有任意数量列的数据并将其转换为字典。 分隔符可以是空格，冒号或逗号。 例如，这里有一些逗号作为分隔符的数据 -

person,age,nationality,language, education
Jack,18,Canadian,English, bs
Rahul,25,Indian,Hindi, ms
Mark,50,American,English, phd
Kyou, 21, Japanese, English, bs

这应该转换成这样的字典格式 -

{'person': ['Jack', 'Rahul', 'Mark', 'Kyou'], 'age': ['18', '25', '50', '21'], 'education': ['doc', 'eng', 'llb', 'ca'], 'language': ['English', 'Hindi', 'English', 'English'
], 'nationality': ['Canadian', 'Indian', 'American', 'Japanese']}

列可以在不同文件之间变化。 我的程序应该灵活处理这种变化。 例如，在下一个文件中可能会有另一个标题为“性别”的列。

我能够让这个工作，但觉得我的代码非常“笨重”。 它有效，但我想做更多“pythonic”的事情。

from collections import OrderedDict


def parse_data(myfile):
    # initialize myd as an ordered dictionary
    myd = OrderedDict()
    # open file with data
    with open (myfile, "r") as f:
        # use readlines to store tabular data in list format
        data = f.readlines()
        # use the first row to initialize the ordered dictionary keys
        for item in data[0].split(','):
            myd[item.strip()] = [] # initializing dict keys with column names
        # variable use to access different column values in remaining rows
        i = 0  
        # access each key in the ordered dict
        for key in myd:
            '''Tabular data starting from line # 1 is accessed and
            split on the "," delimiter. The variable "i" is used to access 
            each column incrementally. Ordered dict format of myd ensures 
            columns are paired appropriately'''
            myd[key] = [ item.split(',')[i].strip() for item in data[1:]]
            i += 1
    print dict(myd)

# my-input.txt 
parse_data("my-input.txt")

您能否建议我如何使我的代码“更清洁”？

Answer 1

这是一种更加pythonic的方式来解决这个问题。

def parse(file):
    with open(file, 'r') as f:
        headings = f.readline().strip().split(',')
        values = [l.strip().split(',') for l in f]
    output_dict = {h: v for h, v in zip(headings, [*zip(*values)])}
    return output_dict

print(parse('test.csv'))

首先，将文件中的第一行作为标题用于字典中的键（这将打破重复的标题）

然后，使用列表推导将所有剩余值读入字符串列表的列表中。

最后，通过使用转置（即[*zip(*values))]表示的标题列表来编译字典 - 如果您愿意使用numpy，则可以将其替换为numpy.array(values).T for例）

Answer 2

稍微好一点的版本

def parse_data(myfile):
  # read lines and strip out extra whitespaces and newline characters 
  lines = [line.strip() for line in open(myfile,"r").readlines()]

  dict = {} # initialize our dict variable

  # start loop from second line
  for x in range(1,len(lines)):

    # for each line split values and store them in dict[col] 
    for y in range(len(lines[0].split(","))):

      # if col is not present in dict create new column and initialize it with a list
      if lines[0].split(",")[y] not in dict:
        dict[lines[0].split(",")[y]] = []

      # store the corresponding column value to the dict
      dict[lines[0].split(",")[y]].append(lines[x].split(",")[y])

parse_data("my-input.txt")

在这里看到它。

希望能帮助到你！

将带有“X”列的表格数据转换为没有pandas或csv包的字典

问题描述

2 个解决方案

解决方案1
2 2019-07-25 06:33:59

解决方案2
0 2019-07-25 06:13:14

将带有“X”列的表格数据转换为没有pandas或csv包的字典

问题描述

2 个解决方案

解决方案1 2 2019-07-25 06:33:59

解决方案2 0 2019-07-25 06:13:14

解决方案1
2 2019-07-25 06:33:59

解决方案2
0 2019-07-25 06:13:14