简体   繁体   中英

Pandas check which substring is in column of strings

Im trying to create function which will create a new column in a pandas dataframe, where it figures out which substring is in a column of strings and takes the substring and uses that for the new column.

The problem being that the text to find does not appear at the same location in variable x

 df = pd.DataFrame({'x': ["var_m500_0_somevartext","var_m500_0_vartextagain",
 "varwithsomeothertext_0_500", "varwithsomext_m150_0_text"], 'x1': [4, 5, 6,8]})

finds = ["m500_0","0_500","m150_0"]

which of finds is in a given df["x"] row

I've made a function that works, but is terribly slow for large datasets

def pd_create_substring_var(df,new_var_name = "new_var",substring_list=["1"],var_ori="x"):
    import re
    df[new_var_name] = "na"
    cols =  list(df.columns)
    for ix in range(len(df)):
        for find in substring_list:
            for m in re.finditer(find, df.iloc[ix][var_ori]):
                df.iat[ix, cols.index(new_var_name)] = df.iloc[ix][var_ori][m.start():m.end()]
    return df


df = pd_create_substring_var(df,"t",finds,var_ori="x")

df 
                            x  x1       t
0      var_m500_0_somevartext   4  m500_0
1     var_m500_0_vartextagain   5  m500_0
2  varwithsomeothertext_0_500   6   0_500
3   varwithsomext_m150_0_text   8  m150_0

Does this accomplish what you need ?

finds = ["m500_0", "0_500", "m150_0"]
df["t"] = df["x"].str.extract(f"({'|'.join(finds)})")

Probably not the best way:

df['t'] = df['x'].apply(lambda x: ''.join([i for i in finds if i in x]))

And now:

print(df)

Is:

                            x  x1       t
0      var_m500_0_somevartext   4  m500_0
1     var_m500_0_vartextagain   5  m500_0
2  varwithsomeothertext_0_500   6   0_500
3   varwithsomext_m150_0_text   8  m150_0

And now, just adding to @pythonjokeun's answer, you can do:

df["t"] = df["x"].str.extract("(%s)" % '|'.join(finds))

Or:

df["t"] = df["x"].str.extract("({})".format('|'.join(finds)))

Or:

df["t"] = df["x"].str.extract("(" + '|'.join(finds) + ")")

I don't know how large your dataset is, but you can use map function like below:

def subset_df_test():
  df = pandas.DataFrame({'x': ["var_m500_0_somevartext", "var_m500_0_vartextagain",
                         "varwithsomeothertext_0_500", "varwithsomext_m150_0_text"], 'x1': [4, 5, 6, 8]})

  finds = ["m500_0", "0_500", "m150_0"]
  df['t'] = df['x'].map(lambda x: compare(x, finds))
  print df

def compare(x, finds):
  for f in finds:
    if f in x:
        return f

Use pandas.str.findall :

df['x'].str.findall("|".join(finds))

0    [m500_0]
1    [m500_0]
2     [0_500]
3    [m150_0]

尝试这个

df["t"] = df["x"].apply(lambda x: [i for i in finds if i in x][0])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM