简体   繁体   中英

frequency of string (comma separated) in Python

I'm trying to find the frequency of strings from the field "Select Investors" on this website https://www.cbinsights.com/research-unicorn-companies

Is there a way to pull out the frequency of each of the comma separated strings?

For example, how frequent does the term "Sequoia Capital China" show up?

# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]


column = "Select Investors"
all_investor = []
for i in first_df[column]:
    all_investor += str(i).lower().split(',')

# Calculate frequency
for string in all_investor:
    string = string.strip()
    frequency = first_df[column].apply(
        lambda x: string in str(x).lower()).sum()
    print(string, frequency)

Output:
andreessen horowitz 41
new enterprise associates 21
battery ventures 14
index ventures 30
dst global 19
ribbit capital 8
forerunner ventures 4
crosslink capital 4
homebrew 2
sequoia capital 115
thoma bravo 3
softbank 50
tencent holdings 28
lightspeed india partners 4
sequoia capital india 25
ggv capital 14
....

The solution provided by @Mazhar checks whether a certain term is a substring of a string delimited by commas. As a consequence, the number of occurrences of 'Sequoia Capital' returned by this approach is the sum of the occurrences of all the strings that contain 'Sequoia Capital' , namely 'Sequoia Capital' , 'Sequoia Capital China' , 'Sequoia Capital India' , 'Sequoia Capital Israel' and 'and Sequoia Capital China' . The following code avoids that issue:

import pandas as pd
from collections import defaultdict

url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]

freqs = defaultdict(int)
for group in df['Select Investors']:
    if hasattr(group, 'lower'):
        for investor in group.lower().split(','):
            freqs[investor.strip()] += 1

Demo

In [57]: freqs['sequoia capital']
Out[57]: 41

In [58]: freqs['sequoia capital china']
Out[58]: 46

In [59]: freqs['sequoia capital india']
Out[59]: 25

In [60]: freqs['sequoia capital israel']
Out[60]: 2

In [61]: freqs['and sequoia capital china']
Out[61]: 1

The sum of occurrences is 115, which coincides with the frequency returned for 'sequoia capital' by the currently accepted solution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM