简体   繁体   中英

Create 5 new columns from multiple CSV files

I have 4 CSV files that have 2 columns each.

User   Number  |  User1   Number1  |  User2    Number2  |  User3   Number3  
Sam         3  |  Tim           4  |  Mark          11  |  Jane          3
Tim         6  |  Gab           2  |  Jane          12  |  Moll          5
Ale         8  |  Jane          9  |  Moll           3  |  Mary          5
Jane        2  |  Tj            7  |  Gab            8  |  Kim           3

Process

  1. Create 2 new columns holding the User and Number info of all users that only appear once.
  2. Create 2 other columns when names exist in more than one CSV.
  3. Those that appear more than once, have their new number be the addition of their numbers across different CSV's.
  4. Have a column that says which CSV the duplicate name has come from.

Desired Output

User   Number  |  User1    Number1  |  Which CSV
Sam         3  |  Tim           10  |  User, User1
Ale         8  |  Jane          26  |  User, User1, User2, User3
TJ          7  |  Gab           10  |  User1, User2
Mark       11  |  Moll           8  |  User2, User3
Mary        5  |         
Kim         3  |

Attempt

usernameandlikes = pd.Series(dict(functools.reduce(operator.add, map(collections.Counter, [dict(zip(df["username"], df["likes"])), dict(zip(df["username2"], df["likes2"]))])))).reset_index()
usernameandlikes.columns = ["lcnames", "lcagg"]
username3_likes3 = usernameandlikes.loc[usernameandlikes['lcnames'].isin(list(set(df["username"]).intersection(set(df["username2"]))))].reset_index(drop=True) 
username3_likes4 = usernameandlikes.loc[usernameandlikes['lcnames'].isin(list(set(df["username"]).symmetric_difference(set(df["username2"]))))].reset_index(drop=True) 

First I would append() all data to one DataFrame with three columns User , Number , File .

In code I use module io only to simulate files.

csv0 = '''User   Number
Sam         3
Tim         6
Ale         8
Jane        2'''

csv1 = '''User1   Number1
Tim           4
Gab           2
Jane          9
Tj            7'''

csv2 = '''User2    Number2
Mark          11
Jane          12
Moll           3
Gab            8'''

csv3 = '''User3   Number3
Jane          3
Moll          5
Mary          5
Kim           3'''

import pandas as pd
import io

df0 = pd.read_csv(io.StringIO(csv0), sep='\s+')
df0['File'] = 'User'
#print(df0)

df1 = pd.read_csv(io.StringIO(csv1), sep='\s+')
df1.columns = ['User', 'Number']
df1['File'] = 'User1'
#print(df1)

df2 = pd.read_csv(io.StringIO(csv2), sep='\s+')
df2.columns = ['User', 'Number']
df2['File'] = 'User2'
#print(df2)

df3 = pd.read_csv(io.StringIO(csv3), sep='\s+')
df3.columns = ['User', 'Number']
df3['File'] = 'User3'
#print(df3)

df = df0.append([df1, df2, df3]).reset_index(drop=True)
print(df)

Results:

    User  Number   File
0    Sam       3   User
1    Tim       6   User
2    Ale       8   User
3   Jane       2   User
4    Tim       4  User1
5    Gab       2  User1
6   Jane       9  User1
7     Tj       7  User1
8   Mark      11  User2
9   Jane      12  User2
10  Moll       3  User2
11   Gab       8  User2
12  Jane       3  User3
13  Moll       5  User3
14  Mary       5  User3
15   Kim       3  User3

And now I can use groupby('User') to select userw which are only once in all data

print('--- single ---')
df_single = df.groupby('User').filter(lambda x: len(x) == 1)
print(df_single)

Result:

--- single ---
    User  Number   File
0    Sam       3   User
2    Ale       8   User
7     Tj       7  User1
8   Mark      11  User2
14  Mary       5  User3
15   Kim       3  User3

The same for users which are many times in data

print('--- multi ---')
df_multi = df.groupby('User').filter(lambda x: len(x) > 1)
print(df_multi)

Result:

--- multi ---
    User  Number   File
1    Tim       6   User
3   Jane       2   User
4    Tim       4  User1
5    Gab       2  User1
6   Jane       9  User1
9   Jane      12  User2
10  Moll       3  User2
11   Gab       8  User2
12  Jane       3  User3
13  Moll       5  User3

And I can use groupby().sum() to sum numbers

print('--- multi sum ---')
df_multi_sum = df_multi.groupby('User').sum().reset_index()
print(df_multi_sum)

Result:

--- multi sum ---
   User  Number
0   Gab      10
1  Jane      26
2  Moll       8
3   Tim      10

And groupby().apply() to create column Which CSV

print('--- multi sum file ---')
df_multi_sum['Which CSV'] = df_multi.groupby('User').apply(lambda x: ','.join(x['File'])).reset_index()[0]
print(df_multi_sum)

Result:

--- multi sum file ---
   User  Number               Which CSV
0   Gab      10             User1,User2
1  Jane      26  User,User1,User2,User3
2  Moll       8             User2,User3
3   Tim      10              User,User1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM