简体   繁体   English

如果值存在于另一个 df.Series 描述中,则从值列表中填充 pd.Series

[英]Fill pd.Series from list of values if value exist in another df.Series description

I need to solve tricky problem, and minimize big O notation problem.我需要解决棘手的问题,并尽量减少大 O 表示法问题。

I have two pandas dataframes:我有两个 pandas 数据帧:

The first df is like as:第一个 df 就像:

| source | searchTermsList |

|:---- |:------:|

| A  | [t1, t2, t3,...tn] |
| B  | [t4, t5, t6,...tn] |
| C  | [t7, t8, t9,...tn] |

Where the first column is string, the second one is a list of strings no duplicated, just unique values.第一列是字符串,第二列是不重复的字符串列表,只是唯一值。

The second dataframe, which I need to create a new pd.Series with first column (df1.source) if term in searchTerm list, exist in the follow df2.Series, called "description".第二个 dataframe,我需要创建一个新的 pd.Series,如果 searchTerm 列表中的术语存在于下面的 df2.Series 中,则具有第一列(df1.source),称为“描述”。

Example.
| objID | dataDescr |

|:---- |:------:|

| 1  | The first description name has t2 | 
| 2  | The second description name has t6 and t7| 
| 3  | The third description name has t8, t1, t9| 

Expected results预期成绩

| objID | dataDescr | source |

|:---- |:------:| -----:|

| 1  | The first description name    | A |
| 2  | The second description name    | B |
| 3  | The third description name    | C |

Explanation解释

  • The first description has t2, so the column filled with A, because t2 appears in the term list.第一个描述有 t2,所以用 A 填充列,因为 t2 出现在术语列表中。

  • The second description has two terms, t6 and t7, in that case match only the first one with the second list, so the source will be filled B第二个描述有两个术语,t6 和 t7,在这种情况下,仅将第一个与第二个列表匹配,因此将填充源 B

  • The third description has three terms, as above, only get the first one with the list and source will be filled with C.第三个描述有三个术语,如上,只获取第一个与列表和源将填充 C。

My approach我的方法

If I split descrName and finally search that word in the all lists, maybe the computational cost will be very huge.如果我拆分descrName 并最终在所有列表中搜索该单词,那么计算成本可能会非常巨大。 The idea with map, doesn't work, because with haven't ordered dataframe, in the first just we have 10-20 rows, only unique values, in the second will be to matching with each terms n times. map 的想法行不通,因为没有订购 dataframe,第一个只有 10-20 行,只有唯一值,第二个将与每个术语匹配 n 次。

Any suggestion,please?请问有什么建议吗?

Use:利用:

s = df1.explode('searchTermsList').set_index('searchTermsList')['source']
print (s)
t1    A
t2    A
t3    A
t4    B
t5    B
t6    B
t7    C
t8    C
t9    C
Name: source, dtype: object

pat = r"\b({})\b".format("|".join(s.index))

df2['searchTermsList'] = df2['dataDescr'].str.extract(pat, expand=False).map(s)
print (df2)
   objID                                  dataDescr searchTermsList
0      1          The first description name has t2               A
1      2  The second description name has t6 and t7               B
2      3  The third description name has t8, t1, t9               C

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM