[英]Pandas make new column from substring slice based on the number in a substring of another column
我有一個名為'table'的數據框,如下所示:
import pandas as pd
import numpy as np
table = pd.read_csv(main_data, sep='\t')
它產生了這個:
NAME SYMBOL STRING
A blah A34SA
B foo BS2812D
...
如何在pandas中創建一個新列,所以我有以下內容:
NAME SYMBOL STRING NUMBER
A blah A34SA 34
B foo BS2812D 2812
到目前為止我有這個: table['NUMBER'] = table.STRING.str[int(filter(str.isdigit, table.STRING))]
但是這個函數在這個上下文中不起作用。
謝謝!
您可以嘗試使用正則表達式從String中提取數字:
import re
def extNumber(row):
row['NUMBER'] = re.search("(\\d+)", row.STRING).group(1)
return row
df.apply(extNumber, axis=1)
以下應該有效
table['NUMBER'] = table.STRING.apply(lambda x: int(''.join(filter(str.isdigit, x))))
您可以使用正則表達式。
import re
table['NUMBER'] = table['STRING'].apply(lambda x: re.sub(r'[^0-9]','',x))
我會這樣做:
In [22]: df['NUMBER'] = df.STRING.str.extract('(?P<NUMBER>\d+)', expand=True).astype(int)
In [23]: df
Out[23]:
NAME SYMBOL STRING NUMBER
0 A blah A34SA 34
1 B foo BS2812D 2812
In [24]: df.dtypes
Out[24]:
NAME object
SYMBOL object
STRING object
NUMBER int32
dtype: object
針對20M行DF的時序 :
In [71]: df = pd.concat([df] * 10**7, ignore_index=True)
In [72]: df.shape
Out[72]: (20000000, 3)
In [73]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000000 entries, 0 to 19999999
Data columns (total 3 columns):
NAME object
SYMBOL object
STRING object
dtypes: object(3)
memory usage: 457.8+ MB
In [74]: %timeit df.STRING.str.replace(r'\D+', '').astype(int)
1 loop, best of 3: 507 ms per loop
In [75]: %timeit df.STRING.str.extract('(?P<NUMBER>\d+)', expand=True).astype(int)
1 loop, best of 3: 434 ms per loop
In [76]: %timeit df.STRING.apply(lambda x: int(''.join(filter(str.isdigit, x))))
1 loop, best of 3: 562 ms per loop
In [77]: %timeit df['STRING'].apply(lambda x: re.sub(r'[^0-9]','',x))
1 loop, best of 3: 552 ms per loop
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.