将复杂的 txt 文件（包括数组）读取到 Python

Question

I have a txt file that one of its columns includes arrays- every row has a different sized array as this column's value (the array has strings inside, separated by comma and defined by quotes).我有一个 txt 文件，其中一列包含数组 - 每行都有一个不同大小的数组作为该列的值（数组内部有字符串，用逗号分隔并用引号定义）。 Here's a sample:这是一个示例：

   Color  Count People
0   Red   2     ["Ben", "Mike"]
1  Blue   0     []
2  Yellow 3     ["Ben", "Mike", "Tom"]

I want to work on this file with Pandas as a data frame.我想用 Pandas 作为数据框处理这个文件。 I have tried many ways including regex to read the file but without success.我尝试了很多方法，包括正则表达式来读取文件，但没有成功。

I think I may have to read it line by line and build a table iteratively but I am not sure how.我想我可能必须逐行阅读它并迭代地构建一个表格，但我不确定如何。

Would be grateful for suggestions将不胜感激的建议

Answer 1

Your sample looks a lot like the display of a pandas dataframe.您的示例看起来很像 Pandas 数据框的显示。 Saving such a frame to csv has some of the same problems, though maybe simpler.将这样的帧保存到 csv 有一些相同的问题，但可能更简单。 I recently responded to a similar case, different in that the lists were replaced with numpy arrays.我最近回应了一个类似的案例，不同之处在于列表被替换为 numpy 数组。

Anyways, let's try a simpler recreation.不管怎样，让我们尝试一个更简单的娱乐。

Define a dictionary:定义字典：

In [77]: {'Color':'Red', 'Count':2, 'People':['Ben','Mike']}                                              
Out[77]: {'Color': 'Red', 'Count': 2, 'People': ['Ben', 'Mike']}
In [78]: pd.DataFrame([_])                                                                                
Out[78]: 
  Color  Count       People
0   Red      2  [Ben, Mike]

Or with 3 rows:或 3 行：

In [79]: pd.DataFrame([_77,_77,_77])                                                                      
Out[79]: 
  Color  Count       People
0   Red      2  [Ben, Mike]
1   Red      2  [Ben, Mike]
2   Red      2  [Ben, Mike]

If we write that to a csv:如果我们将其写入 csv：

In [80]: df = pd.DataFrame([_77,_77,_77])                                                                 
In [81]: df.to_csv('stack59675154.txt')                                                                   
In [82]: cat stack59675154.txt                                                                            
,Color,Count,People
0,Red,2,"['Ben', 'Mike']"
1,Red,2,"['Ben', 'Mike']"
2,Red,2,"['Ben', 'Mike']"

this like your csv, except is uses a comma delimiter, and quotes the lists.这就像你的 csv，除了使用逗号分隔符，并引用列表。 It is actually saving the print string of the list, str(['Ben', 'Mike']) .它实际上是保存列表的打印字符串str(['Ben', 'Mike']) 。

pandas can read this:熊猫可以读到：

In [83]: data = pd.read_csv('stack59675154.txt')                                                          
In [84]: data                                                                                             
Out[84]: 
   Unnamed: 0 Color  Count           People
0           0   Red      2  ['Ben', 'Mike']
1           1   Red      2  ['Ben', 'Mike']
2           2   Red      2  ['Ben', 'Mike']
In [85]: data['People'][0]                                                                                
Out[85]: "['Ben', 'Mike']"

But the People column actually contains strings, not lists.但是People列实际上包含字符串，而不是列表。 Fortunately the strings can be eval :幸运的是，字符串可以是eval ：

In [86]: eval(_)                                                                                          
Out[86]: ['Ben', 'Mike']

There's a safer string evaluator, but I won't get into that.有一个更安全的字符串评估器，但我不会深入研究。

0,Red,2,"['Ben', 'Mike']" without the quotes, the reader would have problems, since it will see the delimiter inside the brackets. 0,Red,2,"['Ben', 'Mike']"没有引号，读者会遇到问题，因为它会看到括号内的分隔符。 The simple numpy.genfromtxt reader can even handle those quotes.简单的numpy.genfromtxt阅读器甚至可以处理这些引号。 Simple readers just use line_list = line.split(delimiter) , getting a list of strings, one for each 'column'.简单的阅读器只需使用line_list = line.split(delimiter) ，获取字符串列表，每个“列”一个。

I believe the reader(s) in the csv module can be instructed to handle nesting quotes, and maybe even nesting brackets.我相信可以指示csv模块中的读者处理嵌套引号，甚至嵌套括号。

将复杂的 txt 文件（包括数组）读取到 Python

问题描述

1 个解决方案

解决方案1
0 2020-01-10 06:08:47

将复杂的 txt 文件（包括数组）读取到 Python

问题描述

1 个解决方案

解决方案1 0 2020-01-10 06:08:47

解决方案1
0 2020-01-10 06:08:47