简体   繁体   中英

Combine three columns into one in CSV file with python and pandas

Hi I'm trying to combine several existing columns into 1 new column then delete the three original ones in a CSV file. I have been trying to do this with pandas however not having much luck. I'm fairly new to python.

My code first combines several CSV files in the same directory an then attempts to manipulates the columns. The first combine works and I get an output.csv with the combined data, however the combine of columns does not.

import glob
import pandas as pd

interesting_files = glob.glob("*.csv")

header_saved = False
with open('output.csv','wb') as fout:
    for filename in interesting_files:
        with open(filename) as fin:
            header = next(fin)
            if not header_saved:
                fout.write(header)
                header_saved = True
            for line in fin:
                fout.write(line)

df = pd.read_csv("output.csv")
df['HostAffected']=df['Host'] + "/" + df['Protocol'] + "/" + df['Port']
df.to_csv("newoutput.csv")

Effectively turning this:

Host,Protocol,Port
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,49707
10.0.0.10,tcp,49672
10.0.0.10,tcp,49670

into something like this:

HostsAffected
10.0.0.10/tcp/445
10.0.0.10/tcp/445
10.0.0.10/tcp/445
10.0.0.10/tcp/445
10.0.0.10/tcp/445
10.0.0.10/tcp/445
10.0.0.11/tcp/445
10.0.0.11/tcp/49707
10.0.0.11/tcp/49672
10.0.0.11/tcp/49670
10.0.0.11/tcp/49668
10.0.0.11/tcp/49667

There are other columns in the csv however.

I'm not a coder, I'm just trying to solve a problem, any help much appreciated.

The way I see it we have three alternatives:

%timeit df['Host'] + "/" + df['Protocol'] + "/" + df['Port'].map(str)
%timeit ['/'.join(i) for i in zip(df['Host'],df['Protocol'],df['Port'].map(str))]
%timeit ['/'.join(i) for i in df[['Host','Protocol','Port']].astype(str).values]

Timings :

10 loops, best of 3: 39.7 ms per loop  
10 loops, best of 3: 35.9 ms per loop  
10 loops, best of 3: 162 ms per loop

However slowest I think this would be your most readable approach:

import pandas as pd

data = '''\
ID,Host,Protocol,Port
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,49707
1,10.0.0.10,tcp,49672
1,10.0.0.10,tcp,49670'''

df = pd.read_csv(pd.compat.StringIO(data)) # Recreates a sample dataframe

cols = ['Host','Protocol','Port']
newcol = ['/'.join(i) for i in df[cols].astype(str).values]
df = df.assign(HostAffected=newcol).drop(cols, 1)
print(df)

Returns:

   ID         HostAffected
0   1    10.0.0.10/tcp/445
1   1    10.0.0.10/tcp/445
2   1    10.0.0.10/tcp/445
3   1    10.0.0.10/tcp/445
4   1    10.0.0.10/tcp/445
5   1    10.0.0.10/tcp/445
6   1    10.0.0.10/tcp/445
7   1  10.0.0.10/tcp/49707
8   1  10.0.0.10/tcp/49672
9   1  10.0.0.10/tcp/49670

There are couple of ways you can do this: either use vectorised functions to combine series, or use a lambda function with pd.Series.apply .

Vectorised solution

Don't forget to cast non-numeric types as str .

df['HostAffected'] = df['Host'] + '/' + df['Protocol'] + '/' + df['Port'].map(str)

Performance note: Converting a series of ints to strings - Why is apply much faster than astype?

Apply lambda function

df['HostsAffected'] = df.apply(lambda x: '/'.join(list(map(str, x))), axis=1)

With both solutions, you can simply filter by this column to remove all others:

df = df[['HostsAffected']]

Complete example

from io import StringIO
import pandas as pd

mystr = StringIO("""Host,Protocol,Port
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,49707
10.0.0.10,tcp,49672
10.0.0.10,tcp,49670""")

# replace mystr with 'file.csv'
df = pd.read_csv(mystr)

# combine columns
df['HostsAffected'] = df['Host'] + '/' + df['Protocol'] + '/' + df['Port'].map(str)

# include only new columns
df = df[['HostsAffected']]

Result:

print(df)

         HostsAffected
0    10.0.0.10/tcp/445
1    10.0.0.10/tcp/445
2    10.0.0.10/tcp/445
3    10.0.0.10/tcp/445
4    10.0.0.10/tcp/445
5    10.0.0.10/tcp/445
6    10.0.0.10/tcp/445
7  10.0.0.10/tcp/49707
8  10.0.0.10/tcp/49672
9  10.0.0.10/tcp/49670

This is how you can do it:

    dt = """Host,Protocol,Port
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,49707
10.0.0.10,tcp,49672
10.0.0.10,tcp,49670"""

tdf = pd.read_csv(pd.compat.StringIO(dt))
tdf['HostsAffected'] = tdf.apply(lambda x: '{}/{}/{}'.format(x['Host'] , x['Protocol'] , x['Port']), axis=1)
tdf = tdf[['HostsAffected']]
tdf.to_csv(<path-to-save-csv-file>)

This will be the output:

    HostsAffected
0   10.0.0.10/tcp/445
1   10.0.0.10/tcp/445
2   10.0.0.10/tcp/445
3   10.0.0.10/tcp/445
4   10.0.0.10/tcp/445
5   10.0.0.10/tcp/445
6   10.0.0.10/tcp/445
7   10.0.0.10/tcp/49707
8   10.0.0.10/tcp/49672
9   10.0.0.10/tcp/49670

If you are reading the CSV from the file, edit the read_csv line as follows:

tdf = pd.read_csv(<path-to-the-file>)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM