How do I map a dictionary with a set of strings to the column of a data frame?

Question

I have a data frame with a column named text and want to assign values in a new column if the text in the first column contains one or more substrings from a dictionary. If the text column contains a substring, I want the key of the dictionary to be assigned to the new column category .

This is what my code looks like:

import pandas as pd

some_strings = ['Apples and pears and cherries and bananas', 
                'VW and Ford and Lamborghini and Chrysler and Hyundai', 
                'Berlin and Paris and Athens and London']
categories = ['fruits', 'cars', 'capitals']

test_df = pd.DataFrame(some_strings, columns = ['text'])

cat_map = {'fruits': {'apples', 'pears', 'cherries', 'bananas'}, 
           'cars': {'VW', 'Ford', 'Lamborghini', 'Chrysler', 'Hyundai'}, 
           'capitals': {'Berlin', 'Paris', 'Athens', 'London'}}

The dictionary cat_map contains sets of strings as values. If the text column in test_df contains any of those words, then I want the key of the dictionary to be assigned as value to the new category column. The output dataframe should look like this:

output_frame = pd.DataFrame({'text': some_strings, 
                            'category': categories})

Any help on this would be appreciated.

Answer 1

You can try

d = {v:k for k, s in cat_map.items() for v in s}

test_df['category'] = (test_df['text'].str.extractall('('+'|'.join(d)+')')
                       [0].map(d)
                       .groupby(level=0).agg(set))

print(d)

{'cherries': 'fruits', 'pears': 'fruits', 'bananas': 'fruits', 'apples': 'fruits', 'Chrysler': 'cars', 'Hyundai': 'cars', 'Lamborghini': 'cars', 'Ford': 'cars', 'VW': 'cars', 'Berlin': 'capitals', 'Athens': 'capitals', 'London': 'capitals', 'Paris': 'capitals'}


print(test_df)

                                                   text    category
0             Apples and pears and cherries and bananas    {fruits}
1  VW and Ford and Lamborghini and Chrysler and Hyundai      {cars}
2                Berlin and Paris and Athens and London  {capitals}

Answer 2

Not exactly sure what you're trying to achieve but if I understood properly you could check if any of the word in the string is present in your cat_map

import pandas as pd

results = {"text": [], "category": []}

for element in some_strings:
    for key, value in cat_map:
        # Check if any of the word of the current string is in current category
        if set(element.split(' ')).intersection(value):
            results["text"].append(element)
            results["category"].append(key)

df = pd.DataFrame.from_dict(results)

Answer 3

One approach:

lookup = { word : label for label, words in cat_map.items() for word in words }
pattern = fr"\b({'|'.join(lookup)})\b"

test_df["category"] = test_df["text"].str.extract(pattern, expand=False).map(lookup)
print(test_df)

Output

                                                text  category
0          Apples and pears and cherries and bananas    fruits
1  VW and Ford and Lamborghini and Chrysler and H...      cars
2             Berlin and Paris and Athens and London  capitals

Answer 4

You can try this one

results = {"text": [], "category": []}
for text in some_strings:
    for key in cat_map.keys():
        for word in set(text.split(" ")):
            if word in cat_map[key]:
                results["text"].append(text)
                results["category"].append(key)
df = pd.DataFrame.from_dict(results)
df.drop_duplicates()

How do I map a dictionary with a set of strings to the column of a data frame?

Question

4 answers

solution1
0 2022-07-30 12:39:47

solution2
0 2022-07-30 12:47:05

solution3
0 2022-07-30 12:50:29

solution4
0 2022-07-30 13:40:53

How do I map a dictionary with a set of strings to the column of a data frame?

Question

4 answers

solution1 0 2022-07-30 12:39:47

solution2 0 2022-07-30 12:47:05

solution3 0 2022-07-30 12:50:29

solution4 0 2022-07-30 13:40:53

solution1
0 2022-07-30 12:39:47

solution2
0 2022-07-30 12:47:05

solution3
0 2022-07-30 12:50:29

solution4
0 2022-07-30 13:40:53