简体   繁体   English

在 Python 中将一列拆分为多列

[英]Split one column into multiple columns in Python

I have a Python dataframe like this with one column:我有一个像这样的 Python dataframe 与一列:

index  Train_station

0      Adenauerplatz 52° 29′ 59″ N, 13° 18′ 26″ O
1      Afrikanische Straße 52° 33′ 38″ N, 13° 20′ 3″ O
2      Alexanderplatz 52° 31′ 17″ N, 13° 24′ 48″ O

And I want to split it into 3 columns: Train station, Latitude, Longitude.我想把它分成 3 列:火车站、纬度、经度。 The dataframe should look like this: dataframe 应如下所示:

index  Train_station         Latitude       Longitude

0      Adenauerplatz         52° 29′ 59″ N  13° 18′ 26″ O
1      Afrikanische Straße   52° 33′ 38″ N  13° 20′ 3″ O
2      Alexanderplatz        52° 31′ 17″ N  13° 24′ 48″ O

I've tried using df[['Latitude', 'Longitude']] = df.Train_station.str.split(',', expand=True) but it only split between latitude and longitude coordinates.我试过使用df[['Latitude', 'Longitude']] = df.Train_station.str.split(',', expand=True)但它只在纬度和经度坐标之间分割。 How can I split a column with more than one condition that I define?如何拆分具有多个我定义的条件的列?

I've thought about method to check the string starting from the left and then split the when it meets an integer or the defined string but I've found no answer for this method so far.我已经考虑过从左侧开始检查字符串然后在遇到 integer 或定义的字符串时拆分字符串的方法,但到目前为止我还没有找到这种方法的答案。

df = df.Train_station.str.split(r'(.*?)(\d+°[^,]+),(.*)', expand=True)
print(df.loc[:, 1:3].rename(columns={1:'Train_station', 2:'Latitude', 3:'Longitude'}) )

Prints:印刷:

          Train_station       Latitude       Longitude
0        Adenauerplatz   52° 29′ 59″ N   13° 18′ 26″ O
1  Afrikanische Straße   52° 33′ 38″ N    13° 20′ 3″ O
2       Alexanderplatz   52° 31′ 17″ N   13° 24′ 48″ O

EDIT: Thanks @ALollz, you can use str.extract() :编辑:感谢@ALollz,您可以使用str.extract()

df = df.Train_station.str.extract(r'(?P<Train_station>.*?)(?P<Latitude>\d+°[^,]+),(?P<Longitude>.*)', expand=True)
print(df)

You can utilize the .split() method for separating the values in the strings.您可以使用.split()方法来分隔字符串中的值。

Use .apply() to create new data-frame columns for each desired column name.使用.apply()为每个所需的列名创建新的数据框列。

import pandas as pd

data = ["Adenauerplatz 52° 29′ 59″ N, 13° 18′ 26″ O",
        "Afrikanische Straße 52° 33′ 38″ N, 13° 20′ 3″ O",
        "Alexanderplatz 52° 31′ 17″ N, 13° 24′ 48″ O"]

df = pd.DataFrame(data, columns=['Train_station'])


def train_station(x):
    x = x.split(' ', 1)
    return x[0]


def latitude(x):
    x = x.split(' ', 1)
    x = x[1].split(', ', 1)
    return x[0]


def longitude(x):
    x = x.split(' ', 1)
    x = x[1].split(', ', 1)
    return x[1]


df['Latitude'] = df['Train_station'].apply(latitude)
df['Longitude'] = df['Train_station'].apply(longitude)
df['Train_station'] = df['Train_station'].apply(train_station)

print(df)

What you see above is a recreation of your original data-frame and then modified with .split() and .apply()您在上面看到的是对原始数据框的重新创建,然后使用.split().apply()进行了修改

Output: Output:

    Train_station              Latitude      Longitude
0   Adenauerplatz         52° 29′ 59″ N  13° 18′ 26″ O
1    Afrikanische  Straße 52° 33′ 38″ N   13° 20′ 3″ O
2  Alexanderplatz         52° 31′ 17″ N  13° 24′ 48″ O

You can try something like this:你可以尝试这样的事情:

df['Latitude']=df['Train_station'].apply(lambda x: ' '.join([i for i in x.split(' ') if any((lett.replace(',','') in '°′″') for lett in i)]).split(',')[0])
df['Longitude']=df['Train_station'].apply(lambda x: ' '.join([i for i in x.split(' ') if any((lett.replace(',','') in '°′″O') for lett in i)]).split(',')[1])
df['Train_station']=df['Train_station'].apply(lambda x: ''.join([i for i in x.split(' ') if not any((lett.replace(',','') in '°′″') for lett in i) ]))

Output: Output:

               Train_station       Latitude       Longitude
0          Adenauerplatz          52° 29′ 59″ N   13° 18′ 26″ O
1    Afrikanische Straße          52° 33′ 38″ N    13° 20′ 3″ O
2         Alexanderplatz          52° 31′ 17″ N   13° 24′ 48″ O

Similar to what @ Andrej Kesely does.类似于@Andrej Kesely 所做的。

import numpy as np
import pandas as pd

df2=df.Train_station.str.split('(?<=[a-z])(\s)(?![A-Z])|(?<=[A-Z]\,)(\s)|(?<=[A-Z])(\s)', expand=True).replace(' ', np.NaN).dropna(axis='columns')
df2.columns=['Train_station', 'Latitude', 'Longitude']
print(df2)

     Train_station          Latitude      Longitude
0        Adenauerplatz    52° 29′ 59″ N,  13° 18′ 26″ O
1  Afrikanische Straße    52° 33′ 38″ N,   13° 20′ 3″ O
2       Alexanderplatz    52° 31′ 17″ N,  13° 24′ 48″ O

Explanation,解释,

(?<=[az])(\s)(?![AZ]) - Split by space after a lower alphabet but not followed by Upper case. (?<=[az])(\s)(?![AZ]) - 在小写字母后用空格分隔,但后面不跟大写。

OR或者

(?<=[AZ]\,)(\s) By space after Uppercase alphabet followed by comma (?<=[AZ]\,)(\s)大写字母后有空格,后跟逗号

OR

(?<=[AZ])(\s) By space after Uppercase alphabet (?<=[AZ])(\s)大写字母后的空格

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM