简体   繁体   中英

Read a complex txt file (includes arrays) to Python

I have a txt file that one of its columns includes arrays- every row has a different sized array as this column's value (the array has strings inside, separated by comma and defined by quotes). Here's a sample:

   Color  Count People
0   Red   2     ["Ben", "Mike"]
1  Blue   0     []
2  Yellow 3     ["Ben", "Mike", "Tom"]

I want to work on this file with Pandas as a data frame. I have tried many ways including regex to read the file but without success.

I think I may have to read it line by line and build a table iteratively but I am not sure how.

Would be grateful for suggestions

Your sample looks a lot like the display of a pandas dataframe. Saving such a frame to csv has some of the same problems, though maybe simpler. I recently responded to a similar case, different in that the lists were replaced with numpy arrays.

Anyways, let's try a simpler recreation.

Define a dictionary:

In [77]: {'Color':'Red', 'Count':2, 'People':['Ben','Mike']}                                              
Out[77]: {'Color': 'Red', 'Count': 2, 'People': ['Ben', 'Mike']}
In [78]: pd.DataFrame([_])                                                                                
Out[78]: 
  Color  Count       People
0   Red      2  [Ben, Mike]

Or with 3 rows:

In [79]: pd.DataFrame([_77,_77,_77])                                                                      
Out[79]: 
  Color  Count       People
0   Red      2  [Ben, Mike]
1   Red      2  [Ben, Mike]
2   Red      2  [Ben, Mike]

If we write that to a csv:

In [80]: df = pd.DataFrame([_77,_77,_77])                                                                 
In [81]: df.to_csv('stack59675154.txt')                                                                   
In [82]: cat stack59675154.txt                                                                            
,Color,Count,People
0,Red,2,"['Ben', 'Mike']"
1,Red,2,"['Ben', 'Mike']"
2,Red,2,"['Ben', 'Mike']"

this like your csv, except is uses a comma delimiter, and quotes the lists. It is actually saving the print string of the list, str(['Ben', 'Mike']) .

pandas can read this:

In [83]: data = pd.read_csv('stack59675154.txt')                                                          
In [84]: data                                                                                             
Out[84]: 
   Unnamed: 0 Color  Count           People
0           0   Red      2  ['Ben', 'Mike']
1           1   Red      2  ['Ben', 'Mike']
2           2   Red      2  ['Ben', 'Mike']
In [85]: data['People'][0]                                                                                
Out[85]: "['Ben', 'Mike']"

But the People column actually contains strings, not lists. Fortunately the strings can be eval :

In [86]: eval(_)                                                                                          
Out[86]: ['Ben', 'Mike']

There's a safer string evaluator, but I won't get into that.

0,Red,2,"['Ben', 'Mike']" without the quotes, the reader would have problems, since it will see the delimiter inside the brackets. The simple numpy.genfromtxt reader can even handle those quotes. Simple readers just use line_list = line.split(delimiter) , getting a list of strings, one for each 'column'.

I believe the reader(s) in the csv module can be instructed to handle nesting quotes, and maybe even nesting brackets.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM