简体   繁体   中英

how to create a dataframe from csv with a colon separator

I am parsing an Outlook message with the following code:

email_content = str(message.Body)
lines_stripped = [line.strip() for line in email_content.split('\r\n') if line.strip() != '']
for line in lines_stripped:
    writer = csv.writer(write_file, delimiter=" ")
    writer.writerow(line.split())

CSV file looks like this:

Car: Mazda

Color: Green

Comment: A very nice Car

Car: Toyota

Color: Black

Comment: Okay car

I want to transform this something like this:

Car     Color       Comment
Mazda   Green       A very nice Car
Toyota  Black       Okay car

I would do most of this in pure python, using this split_at pattern:

In [11]: def split_at(lst, f):
    ...:     inds = [i for i, x in enumerate(lst) if f(x)]
    ...:     for i, j in zip(inds, inds[1:]):
    ...:         yield lst[i:j]
    ...:     yield lst[j:]
    ...:

Which allows you to split the list of properties:

In [12]: cars = [c.split(": ", 1) for c in cars.splitlines() if c]

In [13]: cars
Out[13]:
[['Car', 'Mazda'],
 ['Color', 'Green'],
 ['Comment', 'A very nice Car'],
 ['Car', 'Toyota'],
 ['Color', 'Black'],
 ['Comment', 'Okay car']]

In [14]: pd.DataFrame([dict(c) for c in split_at(cars, lambda x: x[0] == "Car")])
Out[14]:
      Car  Color          Comment
0   Mazda  Green  A very nice Car
1  Toyota  Black         Okay car
##data

temp = StringIO("""  
Car: Mazda

Color: Green

Comment: A very nice Car

Car: Toyota

Color: Black

Comment: Okay car""")

df = pd.read_csv(temp, sep=':', engine='python', header=None)
df.columns = ['A','B']

##print(df)

         A                 B
0      Car             Mazda
1    Color             Green
2  Comment   A very nice Car
3      Car            Toyota
4    Color             Black
5  Comment          Okay car

using pd.pivot and using sorted with key as null

pd.pivot(index=df.index, columns=df.A, values=df.B).apply(sorted,key=pd.isnull).dropna()

Output

A      Car   Color           Comment
0    Mazda   Green   A very nice Car
1   Toyota   Black          Okay car

This should work:

import numpy as np
import pandas as pd
import io

temp = '''
Car: Mazda

Color: Green

Comment: A very nice Car

Car: Toyota

Color: Black

Comment: Okay car

'''
input_csv = io.StringIO(temp)
#input_csv = 'hello.csv'
df = pd.read_csv(input_csv, sep=":", skip_blank_lines=True,header=None)
data = np.array_split(df[1].to_numpy(), len(df)/3)
df2 = pd.DataFrame(data, columns=df[0].unique())
print(df2)

       Car   Color           Comment
0    Mazda   Green   A very nice Car
1   Toyota   Black          Okay car

Using pure python + pandas

cars = []
colors = []
comments = []

lines = io.StringIO(temp).readlines()
for line in lines:
  if line.startswith('Car'):
    cars.append(line.split(':')[1].strip())
  if line.startswith('Color'):
    colors.append(line.split(':')[1].strip())
  if line.startswith('Comment'):
    comments.append(line.split(':')[1].strip())

df = pd.DataFrame({'car': cars, 'color': colors, 'comment': comments})
df

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM