I am trying to compare two list of words in separate columns in a dataframe and print the common words. After that I want to calculate a column common_count
which is the count of common words divided by total words in the first list. The final output would appear like this:
Snippet to create the dataframe:
raw_data = [{'id': 1, 'name': '[corporation, fluor]', 'name_ref': '[constructors, fluor, incorporated, intl]'},\
{'id': 2, 'name': '[community, foundation]', 'name_ref': '[community, county, foundation, of, the, westmoreland]'},\
{'id': 3, 'name': '[fo, minnesota, vikings]', 'name_ref': '[development, inc, minnesota, vikings]'}]
df = pd.DataFrame.from_dict(raw_data)
Please suggest me how can I go about deriving the common
and common_count
columns in pandas or pyspark approach.
You can split by comma and use array_intersect
to find the common elements:
import pandas as pd
import pyspark.sql.functions as F
df = pd.DataFrame.from_dict(raw_data)
sdf = spark.createDataFrame(df)
result = sdf.selectExpr(
'id',
"split(trim('][', name), ', ') name",
"split(trim('][', name_ref), ', ') name_ref"
).withColumn(
'common',
F.array_intersect('name', 'name_ref')
).withColumn(
'common_count',
F.size('common') / F.size('name')
)
result.show(truncate=False)
+---+------------------------+------------------------------------------------------+-----------------------+------------------+
|id |name |name_ref |common |common_count |
+---+------------------------+------------------------------------------------------+-----------------------+------------------+
|1 |[corporation, fluor] |[constructors, fluor, incorporated, intl] |[fluor] |0.5 |
|2 |[community, foundation] |[community, county, foundation, of, the, westmoreland]|[community, foundation]|1.0 |
|3 |[fo, minnesota, vikings]|[development, inc, minnesota, vikings] |[minnesota, vikings] |0.6666666666666666|
+---+------------------------+------------------------------------------------------+-----------------------+------------------+
Here's on way using pandas:
def string_to_array(s):
return [x.strip() for x in s.strip("[]").split(",")]
df['name'] = df['name'].apply(string_to_array)
df['name_ref'] = df['name_ref'].apply(string_to_array)
df['common'] = [list(set(x[1]) & set(x[2])) for x in df.values]
df['common_count'] = df['common'].str.len() / df['name'].str.len()
print(df)
# id name name_ref common common_count
# 0 1 [corporation, fluor] [constructors, fluor, incorporated, intl] [fluor] 0.500000
# 1 2 [community, foundation] [community, county, foundation, of, the, westm... [community, foundation] 1.000000
# 2 3 [fo, minnesota, vikings] [development, inc, minnesota, vikings] [minnesota, vikings] 0.666667
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.