简体   繁体   English

计算数据框内列表中的出现次数

[英]Counting Occurrences Within a List Within a Dataframe

I have a panda dataframe that contains a list of articles; 我有一个包含文章列表的熊猫数据框; the outlet, publish date, link etc. One of the columns in this dataframe is a list of keywords. 插座,发布日期,链接等。此数据框中的一列是关键字列表。 For example, in the keyword column each cell contains a list like [drop, right, states, laws]. 例如,在关键字列中,每个单元格都包含一个列表,如[drop,right,states,laws]。

My ultimate goal is to count the number of occurrences of each unique word on each day. 我的最终目标是计算每天每个独特单词的出现次数。 The challenge that I'm having is breaking the keywords out of their lists and then matching them to the date on which they occurred. 我遇到的挑战是将关键字从列表中删除,然后将它们与它们发生的日期相匹配。 ...assuming this is even the most logical first step. ......假设这甚至是最合乎逻辑的第一步。

At present I have a solution in the code below, however I'm new to python and in thinking through these things I still think in an Excel mindset. 目前我在下面的代码中有一个解决方案,但是我是python的新手,在思考这些东西时我仍然认为是Excel的思维方式。 The code below works but it's very slow. 下面的代码有效,但速度很慢。 Is there a fast way to do this? 有没有快速的方法来做到这一点?

# Create a list of the keywords for articles in the last 30 days to determine their quantity
keyword_list = stories_full_recent_df['Keywords'].tolist()
keyword_list = [item for sublist in keyword_list for item in sublist]

# Create a blank dataframe and new iterator to write the keyword appearances to
wordtrends_df = pd.DataFrame(columns=['Captured_Date', 'Brand' , 'Coverage' ,'Keyword'])
r = 0

print("Creating table on keywords: {:,}".format(len(keyword_list)))
print(time.strftime("%H:%M:%S"))
# Write the keywords out into their own rows with the dates and origins in which they occur
while r <= len(keyword_list):
    for i in stories_full_recent_df.index:
        words = stories_full_recent_df.loc[i]['Keywords']
        for word in words:
            wordtrends_df.loc[r] = [stories_full_recent_df.loc[i]['Captured_Date'], stories_full_recent_df.loc[i]['Brand'],
                                    stories_full_recent_df.loc[i]['Coverage'], word]
        r += 1

print(time.strftime("%H:%M:%S"))
print("Keyword compilation complete.")

Once I have each word on it's own row I'm simply using .groupby() to figure out the number of occurences each day. 一旦我将每个单词放在它自己的行上,我只是使用.groupby()来计算每天出现的次数。

# Group and count the keywords and days to find the day with the least of each word
test_min = wordtrends_df.groupby(('Keyword', 'Captured_Date'), as_index=False).count().sort_values(by=['Keyword','Brand'], ascending=True)
keyword_min = test_min.groupby(['Keyword'], as_index=False).first()

At present there around about 100,000 words in this list and it takes me an hour to run through that list. 目前这个列表中大约有100,000个单词,我需要一个小时才能完成该列表。 I'd love thoughts on a faster way to do it. 我想以更快的方式去思考它。

I think you can get the expected result by doing this: 我认为你可以通过这样做得到预期的结果:

wordtrends_df = pd.melt(pd.concat((stories_full_recent_df[['Brand', 'Captured_Date', 'Coverage']],
                                   stories_full_recent_df.Keywords.apply(pd.Series)),axis=1),
                        id_vars=['Brand','Captured_Date','Coverage'],value_name='Keyword')\
                  .drop(['variable'],axis=1).dropna(subset=['Keyword'])

An explanation with a small example below. 以下是一个小例子的解释。

Consider an example dataframe: 考虑一个示例数据帧:

df = pd.DataFrame({'Brand': ['X', 'Y'],
 'Captured_Date': ['2017-04-01', '2017-04-02'],
 'Coverage': [10, 20],
 'Keywords': [['a', 'b', 'c'], ['c', 'd']]})
#   Brand Captured_Date  Coverage   Keywords
# 0     X    2017-04-01        10  [a, b, c]
# 1     Y    2017-04-02        20     [c, d]

First thing you can do is expand the keywords column so that each keyword occupies its own column: 您可以做的第一件事是展开关键字列,以便每个关键字占据其自己的列:

a = df.Keywords.apply(pd.Series)
#    0  1    2
# 0  a  b    c
# 1  c  d  NaN

Concatenate this with the original df without Keywords column: 将其与原始df连接,不带关键字列:

b = pd.concat((df[['Captured_Date','Brand','Coverage']],a),axis=1)
#   Captured_Date Brand  Coverage  0  1    2
# 0    2017-04-01     X        10  a  b    c
# 1    2017-04-02     Y        20  c  d  NaN

Melt this last result to create a row per keyword: 将最后一个结果融合以创建每个关键字的行:

c = pd.melt(b,id_vars=['Captured_Date','Brand','Coverage'],value_name='Keyword')
#   Captured_Date Brand  Coverage variable Keyword
# 0    2017-04-01     X        10        0       a
# 1    2017-04-02     Y        20        0       c
# 2    2017-04-01     X        10        1       b
# 3    2017-04-02     Y        20        1       d
# 4    2017-04-01     X        10        2       c
# 5    2017-04-02     Y        20        2     NaN

Finally, drop the useless variable column and drop rows where Keyword is missing: 最后,删除无用的variable列并删除缺少Keyword行:

d = c.drop(['variable'],axis=1).dropna(subset=['Keyword'])
#   Captured_Date Brand  Coverage Keyword
# 0    2017-04-01     X        10       a
# 1    2017-04-02     Y        20       c
# 2    2017-04-01     X        10       b
# 3    2017-04-02     Y        20       d
# 4    2017-04-01     X        10       c

Now you're ready to count by keywords and dates. 现在,您已准备好按关键字和日期计算。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM