简体   繁体   English

使用 str.contains 后合并两个数据框?

[英]merge two data frame after use str.contains?

I have two data frames I want to match partial strings by using str.contains function then merge them.我有两个数据框,我想通过使用str.contains函数来匹配部分字符串,然后合并它们。

Here is an example:下面是一个例子:

data1

      email     is_mane         name           id
    hi@amal.com     1   there is rain          10
    hi2@amal.com    1   here is the food        9
    hi3@amal.com    1   let's go together       8
    hi4@amal.com    1   today is my birthday    6


data2

    id  name
    1   the rain is beautiful
    1   the food
    2   together
    4   my birthday
    3   your birthday

And here is the code I wrote:这是我写的代码:

data.loc[data.name.str.contains('|'.join(data2.name)),:]

and the output:和输出:

        email   is_mane     name               id
    hi2@amal.com    1   here is the food        9
    hi3@amal.com    1   let's go together       8
    hi4@amal.com    1   today is my birthday    6

As you can see it did not return "there is rain" even that rain word is contained in dara2 : could it be because of space?如您所见,即使dara2包含rain字,它也没有返回“有雨” :可能是因为空间原因吗?

Also I want to merge data1 with data2 so that will help me to know what email has match.此外,我想将data1data2合并,以便帮助我了解匹配的电子邮件。

I would like to have the following output:我想要以下输出:


        email   is_mane     name               id      id2       name2
    hi2@amal.com    1   here is the food        9       1       the food
    hi3@amal.com    1   let's go together       8       2       together
    hi4@amal.com    1   today is my birthday    6       4       my birthday
    hi4@amal.com    1   today is my birthday    6       3       your birthday

Is there is any way to do it?有什么办法吗?

If you're good with matching only full words you can do (so eg dog and dogs won't match)如果你擅长只匹配完整的单词,你可以做(​​所以例如dogdogs不会匹配)

data1["key"]=data1["name"].str.split(r"[^\w+]")
data2["key"]=data2["name"].str.split(r"[^\w+]")

data3=data1.explode("key").merge(data2.explode("key"), on="key", suffixes=["", "2"]).drop("key", axis=1).drop_duplicates()

Otherwise it's a matter of cross join, and applying str.contains(...) to filter out the ones, which aren't matching.否则,这是交叉连接的问题,并应用str.contains(...)过滤掉不匹配的那些。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM