Pandas將列中的數字提取到新列中

Question

我目前有這個df，其中rect列是所有字符串。 我需要將x，y，w和h從中提取到單獨的列中。 數據集非常大，所以我需要一種有效的方法

df['rect'].head()
0    <Rect (120,168),260 by 120>
1    <Rect (120,168),260 by 120>
2    <Rect (120,168),260 by 120>
3    <Rect (120,168),260 by 120>
4    <Rect (120,168),260 by 120>

到目前為止，這個解決方案有效，但是你可以看到它非常混亂

df[['x', 'y', 'w', 'h']] = df['rect'].str.replace('<Rect \(', '').str.replace('\),', ',').str.replace(' by ', ',').str.replace('>', '').str.split(',', n=3, expand=True)

有沒有更好的辦法？ 可能是正則表達式方法

Answer 1

使用extractall

df[['x', 'y', 'w', 'h']] = df['rect'].str.extractall('(\d+)').unstack().loc[:,0]
Out[267]: 
match    0    1    2    3
0      120  168  260  120
1      120  168  260  120
2      120  168  260  120
3      120  168  260  120
4      120  168  260  120

Answer 2

排隊

制作副本

df.assign(**dict(zip('xywh', df.rect.str.findall('\d+').str)))

                          rect    x    y    w    h
0  <Rect (120,168),260 by 120>  120  168  260  120
1  <Rect (120,168),260 by 120>  120  168  260  120
2  <Rect (120,168),260 by 120>  120  168  260  120
3  <Rect (120,168),260 by 120>  120  168  260  120
4  <Rect (120,168),260 by 120>  120  168  260  120

或者只是重新分配到df

df = df.assign(**dict(zip('xywh', df.rect.str.findall('\d+').str)))

df

                          rect    x    y    w    h
0  <Rect (120,168),260 by 120>  120  168  260  120
1  <Rect (120,168),260 by 120>  120  168  260  120
2  <Rect (120,168),260 by 120>  120  168  260  120
3  <Rect (120,168),260 by 120>  120  168  260  120
4  <Rect (120,168),260 by 120>  120  168  260  120

到位

修改現有的df

df[[*'xywh']] = pd.DataFrame(df.rect.str.findall('\d+').tolist())

df

                          rect    x    y    w    h
0  <Rect (120,168),260 by 120>  120  168  260  120
1  <Rect (120,168),260 by 120>  120  168  260  120
2  <Rect (120,168),260 by 120>  120  168  260  120
3  <Rect (120,168),260 by 120>  120  168  260  120
4  <Rect (120,168),260 by 120>  120  168  260  120

Answer 3

如果字符串遵循特定格式<Rect \\((\\d+),(\\d+)\\),(\\d+) by (\\d+)> ，則可以將此正則表達式與str.extract方法一起使用：

df[['x','y','w','h']] = df.rect.str.extract(r'<Rect \((\d+),(\d+)\),(\d+) by (\d+)>')

df
#                          rect    x    y    w    h
#0  <Rect (120,168),260 by 120>  120  168  260  120
#1  <Rect (120,168),260 by 120>  120  168  260  120
#2  <Rect (120,168),260 by 120>  120  168  260  120
#3  <Rect (120,168),260 by 120>  120  168  260  120
#4  <Rect (120,168),260 by 120>  120  168  260  120

Answer 4

使用str.extract ，它將正則表達式中的組提取到列中：

df['rect'].str.extract(r'\((?P<x>\d+),(?P<y>\d+)\),(?P<w>\d+) by (?P<h>\d+)', expand=True)

結果：

     x    y    w    h
0  120  168  260  120
1  120  168  260  120
2  120  168  260  120
3  120  168  260  120
4  120  168  260  120

Answer 5

這是其中一種情況，即“優化”數據本身而不是試圖將其轉化為消費者想要的內容。 將干凈數據更改為專用格式要比將專用格式更改為可移植數據要容易得多。

也就是說，如果你真的需要解析它，你可以做類似的事情

>>> import re
>>> re.findall(r'\d+', '<Rect (120,168),260 by 120>')
['120', '168', '260', '120']
>>>

Pandas將列中的數字提取到新列中

問題描述

5 個解決方案

解決方案1
5 已采納 2018-09-04 18:08:12

解決方案2
5 2018-09-04 18:20:51

排隊

到位

解決方案3
3 2018-09-04 18:05:26

解決方案4
2 2018-09-04 18:08:54

解決方案5
0 2018-09-04 18:08:15

Pandas將列中的數字提取到新列中

問題描述

5 個解決方案

解決方案1 5 已采納 2018-09-04 18:08:12

解決方案2 5 2018-09-04 18:20:51

排隊

到位

解決方案3 3 2018-09-04 18:05:26

解決方案4 2 2018-09-04 18:08:54

解決方案5 0 2018-09-04 18:08:15

解決方案1
5 已采納 2018-09-04 18:08:12

解決方案2
5 2018-09-04 18:20:51

解決方案3
3 2018-09-04 18:05:26

解決方案4
2 2018-09-04 18:08:54

解決方案5
0 2018-09-04 18:08:15