frequency of string (comma separated) in Python

Question

I'm trying to find the frequency of strings from the field "Select Investors" on this website https://www.cbinsights.com/research-unicorn-companies

Is there a way to pull out the frequency of each of the comma separated strings?

For example, how frequent does the term "Sequoia Capital China" show up?

Answer 1

# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]


column = "Select Investors"
all_investor = []
for i in first_df[column]:
    all_investor += str(i).lower().split(',')

# Calculate frequency
for string in all_investor:
    string = string.strip()
    frequency = first_df[column].apply(
        lambda x: string in str(x).lower()).sum()
    print(string, frequency)

Output:
andreessen horowitz 41
new enterprise associates 21
battery ventures 14
index ventures 30
dst global 19
ribbit capital 8
forerunner ventures 4
crosslink capital 4
homebrew 2
sequoia capital 115
thoma bravo 3
softbank 50
tencent holdings 28
lightspeed india partners 4
sequoia capital india 25
ggv capital 14
....

Answer 2

The solution provided by @Mazhar checks whether a certain term is a substring of a string delimited by commas. As a consequence, the number of occurrences of 'Sequoia Capital' returned by this approach is the sum of the occurrences of all the strings that contain 'Sequoia Capital' , namely 'Sequoia Capital' , 'Sequoia Capital China' , 'Sequoia Capital India' , 'Sequoia Capital Israel' and 'and Sequoia Capital China' . The following code avoids that issue:

import pandas as pd
from collections import defaultdict

url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]

freqs = defaultdict(int)
for group in df['Select Investors']:
    if hasattr(group, 'lower'):
        for investor in group.lower().split(','):
            freqs[investor.strip()] += 1

Demo

In [57]: freqs['sequoia capital']
Out[57]: 41

In [58]: freqs['sequoia capital china']
Out[58]: 46

In [59]: freqs['sequoia capital india']
Out[59]: 25

In [60]: freqs['sequoia capital israel']
Out[60]: 2

In [61]: freqs['and sequoia capital china']
Out[61]: 1

The sum of occurrences is 115, which coincides with the frequency returned for 'sequoia capital' by the currently accepted solution.

frequency of string (comma separated) in Python

Question

2 answers

solution1
1 ACCPTED 2022-01-28 04:38:02

solution2
0 2022-01-28 05:56:46

Demo

frequency of string (comma separated) in Python

Question

2 answers

solution1 1 ACCPTED 2022-01-28 04:38:02

solution2 0 2022-01-28 05:56:46

Demo

solution1
1 ACCPTED 2022-01-28 04:38:02

solution2
0 2022-01-28 05:56:46