简体   繁体   中英

Python read text file with newline and and paragraph separated elements

I am trying to read a text file to a nested list in Python. That is, I would like to have the output as:

[[$5.79, Breyers Ice Cream, Homemade Vanilla, 48 oz], [$6.39, Haagen-dazs, Vanilla Bean Ice Cream, 1 pt], etc...]]

The ultimate goal is to read the information into a pandas DataFrame for some exploratory analysis.

The Data (in a .txt file)

$5.79  
Breyers Ice Cream  
Homemade Vanilla  
48 oz

$6.39  
Haagen-dazs  
Vanilla Bean Ice Cream  
1 pt

$6.89  
So Delicious  
Dairy Free Coconutmilk No Sugar Added Dipped Vanilla Bars  
4 x 2.3 oz

$5.79  
Popsicle Fruit Pops Mango  
12 ct

What I've Tried

with open(sample.txt) as f:
   creams = f.read()


creams = f.split("\n\n")

However, this returns:

['$5.79\nBreyers Ice Cream\nHomemade Vanilla\n48 oz', '$6.39\nHaagen-dazs\nVanilla Bean Ice Cream\n1 pt',

I have also tried utilizing list comprehension methods that look cleaner than the above code, but these attempts handle the newlines, not the paragraphs or returns. For example:

[x for x in open('<file_name>.txt').read().splitlines()]  
#Gives
['$5.79', 'Breyers Ice Cream', 'Homemade Vanilla', '48 oz', '', '$6.39', 'Haagen-dazs', 'Vanilla Bean Ice Cream', '1 pt', '', '

I know I would need to nest a list within the list comprehension, but I'm unsure how to perform the split.

Note: This is my first posted question, sorry for the length or lack of brevity. Seeking help because there are similar questions but not with the outcome I desire.

You are nearly there once you have the four-line groups separated. All that's left is to split the groups again by a single newline.

with open('creams.txt','r') as f:
    creams = f.read()

creams = creams.split("\n\n")
creams = [lines.split('\n') for lines in creams]
print(creams)

You just have to split it again.

with open('sample.txt','r') as file:
    creams = file.read()

creams = creams.split("\n\n")
creams = [lines.split('\n') for lines in creams]

print(creams)
#[['$5.79  ', 'Breyers Ice Cream  ', 'Homemade Vanilla  ', '48 oz'], ['$6.39  ', 'Haagen-dazs  ', 'Vanilla Bean Ice Cream  ', '1 pt'], ['$6.89  ', 'So Delicious  ', 'Dairy Free Coconutmilk No Sugar Added Dipped Vanilla Bars  ', '4 x 2.3 oz'], ['$5.79  ', 'Popsicle Fruit Pops Mango', '-', '12 ct']]

#Convert to Data
df = pd.DataFrame(creams, columns =['Amnt', 'Brand', 'Flavor', 'Qty']) 

      Amnt                      Brand  \
0  $5.79          Breyers Ice Cream     
1  $6.39                Haagen-dazs     
2  $6.89               So Delicious     
3  $5.79    Popsicle Fruit Pops Mango   

                                              Flavor         Qty  
0                                 Homemade Vanilla         48 oz  
1                           Vanilla Bean Ice Cream          1 pt  
2  Dairy Free Coconutmilk No Sugar Added Dipped V...  4 x 2.3 oz  
3                                                  -       12 ct  

Note: I have added - in the last row for the flavor column as it was empty. If your original dataset, you must take this into consideration before performing any analysis.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM