[英]Python pandas how to scan string contains by row?
How do you scan if a pandas dataframe row contains a certain substring? 您如何扫描熊猫数据框行是否包含某个子字符串?
for example i have a dataframe with 11 columns all the columns contains names 例如我有一个包含11列的数据框,所有列都包含名称
ID name1 name2 name3 ... name10
-------------------------------------------------------
AA AA_balls AA_cakee1 AA_lavender ... AA_purple
AD AD_cakee AD_cats AD_webss ... AD_ballss
CS CS_cakee CS_cats CS_webss ... CS_purble
.
.
.
I would like to get rows which contains, say "ball" in the dataframe and get the ID 我想获取在数据框中包含“ ball”的行并获取ID
so the result would be ID 'AA' and ID 'AD' since AA_balls and AD_ballss are in the rows. 因此结果将是ID'AA'和ID'AD',因为AA_balls和AD_ballss在行中。
I have searched on google but seems there is no specific result for these. 我在Google上进行了搜索,但似乎没有针对这些内容的特定结果。 people usually ask questions about searching substring in a specific columns but not all columns (a single row)
人们通常会问有关在特定列而不是所有列(单行)中搜索子字符串的问题
df[df["col_name"].str.contains("ball")]
The Methods I have thought of are as follows, you can skip this if you have little time: 我想到的方法如下,如果您没有时间,可以跳过此方法:
(1) loop through the columns (1)遍历各列
for col_name in col_names:
df.append(df[df[col_name].str.contains('ball')])
and then drop duplicates rows which have same ID values but this method would be very slow 然后删除具有相同ID值的重复行,但是此方法将非常慢
(2) Make data frame to a 2 column dataframe by appending name2- name10 columns into one column and use df[df["concat_col"].str.contains("ball")]["ID] to get the IDs and drop duplicate (2)通过将name2- name10列追加到一列中,将数据帧制成2列数据帧,并使用df [df [“ concat_col”]。str.contains(“ ball”)] [“ ID]获取ID并删除重复
ID concat_col
AA AA_balls
AA AA_cakeee
AA AA_lavender
AA AA_purple
.
.
.
CS CS_purble
(3) Use the dataframe like (2) to make a dictionay where (3)使用像(2)这样的数据框做一个字典,其中
dict[df["concat_col"].value] = df["ID"]
then get the 然后得到
[value for key, value in programs.items() if 'ball' in key()]
but in this method i need to loop through dictionary and become slow 但是用这种方法我需要遍历字典并变慢
if there is a method that i can apply faster without these processes, i would prefer doing so. 如果有一种方法可以在没有这些过程的情况下更快地申请,我宁愿这样做。 If anyone knows about this, would appreciate a lot if you kindly let me know:) Thanks!
如果有人知道这一点,请告诉我非常感谢:)谢谢!
One idea is use melt
: 一种想法是使用
melt
:
df = df.melt('ID')
a = df.loc[df['value'].str.contains('ball'), 'ID']
print (a)
0 AA
10 AD
Name: ID, dtype: object
Another: 另一个:
df = df.set_index('ID')
a = df.index[df.applymap(lambda x: 'ball' in x).any(axis=1)]
Or: 要么:
mask = np.logical_or.reduce([df[x].str.contains('ball', regex=False) for x in df.columns])
a = df.loc[, 'ID']
Timings : 时间 :
np.random.seed(145)
L = list('abcdefgh')
df = pd.DataFrame(np.random.choice(L, size=(4000, 10)))
df.insert(0, 'ID', np.arange(4000).astype(str))
a = np.random.randint(4000, size=15)
b = np.random.randint(1, 10, size=15)
for i, j in zip(a,b):
df.iloc[i, j] = 'AB_ball_DE'
#print (df)
In [85]: %%timeit
...: df1 = df.melt('ID')
...: a = df1.loc[df1['value'].str.contains('ball'), 'ID']
...:
10 loops, best of 3: 24.3 ms per loop
In [86]: %%timeit
...: df.loc[np.logical_or.reduce([df[x].str.contains('ball', regex=False) for x in df.columns]), 'ID']
...:
100 loops, best of 3: 12.8 ms per loop
In [87]: %%timeit
...: df1 = df.set_index('ID')
...: df1.index[df1.applymap(lambda x: 'ball' in x).any(axis=1)]
...:
100 loops, best of 3: 11.1 ms per loop
Maybe this might work? 也许这行得通吗?
mask = df.apply(lambda row: row.map(str).str.contains('word').any(), axis=1)
df.loc[mask]
Disclaimer: I haven't tested this. 免责声明:我尚未对此进行测试。 Perhaps the
.map(str)
isn't necessary. 也许
.map(str)
是不必要的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.