简体   繁体   中英

Reading in multiple tables from 1 csv file in pandas

suppose I have a csv file like this:

Name: Jack
Place: Binghampton
Age:27
Month,Sales,Revenue
Jan,51,$1000
Feb,20,$1050
Mar,100,$10000
### Blank File Space
### Blank File Space
Name: Jill
Place: Hamptonshire
Age: 49
Month,Sales,Revenue
Apr,11,$1000
May,55,$3000
Jun,23,$4600
### Blank File Space
### Blank File Space
...

And the contents of the file are evenly spaced as shown. I want to read each Month,Sales,Revenue portion in as its own df. I know I can do this manually by doing:

df_Jack = pd.read_csv('./sales.csv', skiprows=3, nrows=3)
df_Jill = pd.read_csv('./sales.csv', skiprows=12, nrows=3)

I'm not even super worried about the names of the df as I think I could do that on my own, I just don't really know how to iterate through the evenly spaced file to find sales records and store them as unique dfs.

Thanks for any help in advance!

How about create a list of dfs?

from io import StringIO

csvfile = StringIO("""Name: Jack
Place: Binghampton
Age:27
Month,Sales,Revenue
Jan,51,$1000
Feb,20,$1050
Mar,100,$10000
### Blank File Space
### Blank File Space
Name: Jill
Place: Hamptonshire
Age: 49
Month,Sales,Revenue
Apr,11,$1000
May,55,$3000
Jun,23,$4600
### Blank File Space
### Blank File Space""")

df = pd.read_csv(csvfile, sep=',', error_bad_lines=False, names=['Month','Sales','Revenue'])

df1 = df.dropna().loc[df.Month!='Month']

listofdf = [df1[i:i+3] for i in range(0,df1.shape[0],3)]

print(listofdf[0])

Output:

  Month Sales Revenue
4   Jan    51   $1000
5   Feb    20   $1050
6   Mar   100  $10000

print(listofdf[1])

Output:

   Month Sales Revenue
13   Apr    11   $1000
14   May    55   $3000
15   Jun    23   $4600

Obviously you could do this:

dfs = [pd.read_csv('./sales.csv', skiprows=i, nrows=3) for i in range(3, n, 9)]
# where n is your expected end line...

But another way is to read the csv yourself and pass the data back to pandas :

with open('./sales.csv', 'r') as file:
    streaming = True
    while streaming:
        name = file.readline().rstrip().replace('Name: ','')
        for _ in range(2): file.readline()
        headers = file.readline().rstrip().split(',')
        data = [file.readline().rstrip().split(',') for _ in range(3)]
        dfs[name] = pd.DataFrame.from_records(data, columns=headers)
        for _ in range(2):
            streaming = file.readline()

I'll concede it's quite brutish and inelegant compared to the other answer... but it works. And it actually gives you the DataFrame by name within a dictionary:

>>> dfs['Jack']

  Month Sales Revenue
0   Jan    51   $1000
1   Feb    20   $1050
2   Mar   100  $10000
>>> dfs['Jill']

  Month Sales Revenue
0   Apr    11   $1000
1   May    55   $3000
2   Jun    23   $4600

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM