简体   繁体   中英

How to check if a list of words is contained in another list in a pandas dataframe?

I am trying to compare two list of words in separate columns in a dataframe and print the common words. After that I want to calculate a column common_count which is the count of common words divided by total words in the first list. The final output would appear like this:

在此处输入图像描述

Snippet to create the dataframe:

raw_data = [{'id': 1, 'name': '[corporation, fluor]', 'name_ref': '[constructors, fluor, incorporated, intl]'},\
        {'id': 2, 'name': '[community, foundation]', 'name_ref': '[community, county, foundation, of, the, westmoreland]'},\
        {'id': 3, 'name': '[fo, minnesota, vikings]', 'name_ref': '[development, inc, minnesota, vikings]'}]

df = pd.DataFrame.from_dict(raw_data)

Please suggest me how can I go about deriving the common and common_count columns in pandas or pyspark approach.

You can split by comma and use array_intersect to find the common elements:

import pandas as pd
import pyspark.sql.functions as F

df = pd.DataFrame.from_dict(raw_data)
sdf = spark.createDataFrame(df)

result = sdf.selectExpr(
    'id',
    "split(trim('][', name), ', ') name",
    "split(trim('][', name_ref), ', ') name_ref"
).withColumn(
    'common',
    F.array_intersect('name', 'name_ref')
).withColumn(
    'common_count',
    F.size('common') / F.size('name')
)

result.show(truncate=False)
+---+------------------------+------------------------------------------------------+-----------------------+------------------+
|id |name                    |name_ref                                              |common                 |common_count      |
+---+------------------------+------------------------------------------------------+-----------------------+------------------+
|1  |[corporation, fluor]    |[constructors, fluor, incorporated, intl]             |[fluor]                |0.5               |
|2  |[community, foundation] |[community, county, foundation, of, the, westmoreland]|[community, foundation]|1.0               |
|3  |[fo, minnesota, vikings]|[development, inc, minnesota, vikings]                |[minnesota, vikings]   |0.6666666666666666|
+---+------------------------+------------------------------------------------------+-----------------------+------------------+

Here's on way using pandas:

def string_to_array(s):
    return [x.strip() for x in s.strip("[]").split(",")]


df['name'] = df['name'].apply(string_to_array)
df['name_ref'] = df['name_ref'].apply(string_to_array)
df['common'] = [list(set(x[1]) & set(x[2])) for x in df.values]
df['common_count'] = df['common'].str.len() / df['name'].str.len()

print(df)

#    id                      name                                           name_ref                   common  common_count
# 0   1      [corporation, fluor]          [constructors, fluor, incorporated, intl]                  [fluor]      0.500000
# 1   2   [community, foundation]  [community, county, foundation, of, the, westm...  [community, foundation]      1.000000
# 2   3  [fo, minnesota, vikings]             [development, inc, minnesota, vikings]     [minnesota, vikings]      0.666667

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM