Pandas check which substring is in column of strings

Question

Im trying to create function which will create a new column in a pandas dataframe, where it figures out which substring is in a column of strings and takes the substring and uses that for the new column.

The problem being that the text to find does not appear at the same location in variable x

 df = pd.DataFrame({'x': ["var_m500_0_somevartext","var_m500_0_vartextagain",
 "varwithsomeothertext_0_500", "varwithsomext_m150_0_text"], 'x1': [4, 5, 6,8]})

finds = ["m500_0","0_500","m150_0"]

which of finds is in a given df["x"] row

I've made a function that works, but is terribly slow for large datasets

def pd_create_substring_var(df,new_var_name = "new_var",substring_list=["1"],var_ori="x"):
    import re
    df[new_var_name] = "na"
    cols =  list(df.columns)
    for ix in range(len(df)):
        for find in substring_list:
            for m in re.finditer(find, df.iloc[ix][var_ori]):
                df.iat[ix, cols.index(new_var_name)] = df.iloc[ix][var_ori][m.start():m.end()]
    return df


df = pd_create_substring_var(df,"t",finds,var_ori="x")

df 
                            x  x1       t
0      var_m500_0_somevartext   4  m500_0
1     var_m500_0_vartextagain   5  m500_0
2  varwithsomeothertext_0_500   6   0_500
3   varwithsomext_m150_0_text   8  m150_0

Answer 1

Does this accomplish what you need ?

finds = ["m500_0", "0_500", "m150_0"]
df["t"] = df["x"].str.extract(f"({'|'.join(finds)})")

Answer 2

Probably not the best way:

df['t'] = df['x'].apply(lambda x: ''.join([i for i in finds if i in x]))

And now:

print(df)

Is:

                            x  x1       t
0      var_m500_0_somevartext   4  m500_0
1     var_m500_0_vartextagain   5  m500_0
2  varwithsomeothertext_0_500   6   0_500
3   varwithsomext_m150_0_text   8  m150_0

And now, just adding to @pythonjokeun's answer, you can do:

df["t"] = df["x"].str.extract("(%s)" % '|'.join(finds))

Or:

df["t"] = df["x"].str.extract("({})".format('|'.join(finds)))

Or:

df["t"] = df["x"].str.extract("(" + '|'.join(finds) + ")")

Answer 3

I don't know how large your dataset is, but you can use map function like below:

def subset_df_test():
  df = pandas.DataFrame({'x': ["var_m500_0_somevartext", "var_m500_0_vartextagain",
                         "varwithsomeothertext_0_500", "varwithsomext_m150_0_text"], 'x1': [4, 5, 6, 8]})

  finds = ["m500_0", "0_500", "m150_0"]
  df['t'] = df['x'].map(lambda x: compare(x, finds))
  print df

def compare(x, finds):
  for f in finds:
    if f in x:
        return f

Answer 4

Use pandas.str.findall :

df['x'].str.findall("|".join(finds))

0    [m500_0]
1    [m500_0]
2     [0_500]
3    [m150_0]

Answer 5

尝试这个

df["t"] = df["x"].apply(lambda x: [i for i in finds if i in x][0])

Pandas check which substring is in column of strings

Question

5 answers

solution1
3 2019-04-18 09:35:04

solution2
1 ACCPTED 2019-04-18 09:36:02

solution3
1 2019-04-18 09:50:40

solution4
1 2019-04-18 10:12:50

solution5
0 2019-04-18 10:46:12

Pandas check which substring is in column of strings

Question

5 answers

solution1 3 2019-04-18 09:35:04

solution2 1 ACCPTED 2019-04-18 09:36:02

solution3 1 2019-04-18 09:50:40

solution4 1 2019-04-18 10:12:50

solution5 0 2019-04-18 10:46:12

solution1
3 2019-04-18 09:35:04

solution2
1 ACCPTED 2019-04-18 09:36:02

solution3
1 2019-04-18 09:50:40

solution4
1 2019-04-18 10:12:50

solution5
0 2019-04-18 10:46:12