简体   繁体   English

如何遍历 pandas dataframe 中的嵌套 for 循环?

[英]How to iterate through a nested for loop in pandas dataframe?

I am attempting to iterate through a Hacker News dataset and was trying to create 3 categories (ie types of posts) found on the HN forum viz, ask_posts, show_posts and other_posts.我正在尝试遍历 Hacker News 数据集,并尝试创建在 HN 论坛上找到的 3 个类别(即帖子类型),即 ask_posts、show_posts 和 other_posts。

In short, I am trying to find out the average number of comments per posts per category(described below).简而言之,我试图找出每个类别每个帖子的平均评论数(如下所述)。

import pandas as pd
import datetime as dt

df = pd.read_csv('HN_posts_year_to_Sep_26_2016.csv')

ask_posts = []
show_posts = []
other_post = []
total_ask_comments = 0
total_show_comments = 0

for i, row in df.iterrows():
    title = row.title
    comments = row['num_comments']
    if title.lower().startswith('ask hn'):
        ask_posts.append(title)
        for post in ask_posts:
            total_ask_comments += comments
    elif title.lower().startswith('show hn'):
        show_posts.append(title)
        for post in show_posts:
             total_show_comments += comments
    else:
        other_post.append(title)

avg_ask_comments = total_ask_comments/len(ask_posts)
avg_show_comments = total_show_comments/len(show_posts)


print(total_ask_comments)
print(total_show_comments)

print(avg_ask_comments)
print(avg_show_comments)

The results respectively are;结果分别是;

395976587 395976587

250362315 250362315

and

43328.21829521829 43328.21829521829

24646.81187241583 24646.81187241583

These seem quite high and I am not sure if it because this is an issue with the way I have structured my nested loop.这些看起来很高,我不确定是否因为这是我构建嵌套循环的方式的问题。 Is this method correct?这种方法正确吗? It is critical that I use a for loop to do this.我使用 for 循环来执行此操作至关重要。

Any and all help/verification of my code is appreciated.感谢您对我的代码的任何和所有帮助/验证。

This post doesn't answer specifically the question about looping through dataframes;这篇文章没有具体回答有关循环数据帧的问题。 but it gives you an alternative solution which is faster.但它为您提供了一个更快的替代解决方案。

Looping over Pandas dataframes to gather the information as you have it is going to be tremendously slow.循环遍历 Pandas 数据帧来收集你所拥有的信息将会非常缓慢。 It's much much faster to use filtering to get the information you want.使用过滤来获取您想要的信息要快得多。

>>> show_posts = df[df.title.str.contains("show hn", case=False)]
>>> show_posts
              id  ...       created_at
52      12578335  ...   9/26/2016 0:36
58      12578182  ...   9/26/2016 0:01
64      12578098  ...  9/25/2016 23:44
70      12577991  ...  9/25/2016 23:17
140     12577142  ...  9/25/2016 20:06
...          ...  ...              ...
292995  10177714  ...   9/6/2015 14:21
293002  10177631  ...   9/6/2015 13:50
293019  10177511  ...   9/6/2015 13:02
293028  10177459  ...   9/6/2015 12:38
293037  10177421  ...   9/6/2015 12:16

[10189 rows x 7 columns]
>>> ask_posts = df[df.title.str.contains("ask hn", case=False)]
>>> ask_posts
              id  ...       created_at
10      12578908  ...   9/26/2016 2:53
42      12578522  ...   9/26/2016 1:17
76      12577908  ...  9/25/2016 22:57
80      12577870  ...  9/25/2016 22:48
102     12577647  ...  9/25/2016 21:50
...          ...  ...              ...
293047  10177359  ...   9/6/2015 11:27
293052  10177317  ...   9/6/2015 10:52
293055  10177309  ...   9/6/2015 10:46
293073  10177200  ...    9/6/2015 9:36
293114  10176919  ...    9/6/2015 6:02

[9147 rows x 7 columns]

You can get your numbers very quickly this way你可以通过这种方式很快得到你的号码

>>> num_ask_comments = ask_posts.num_comments.sum()
>>> num_ask_comments
95000
>>> num_show_comments = show_posts.num_comments.sum()
>>> num_show_comments
50026
>>> 
>>> total_num_comments = df.num_comments.sum()
>>> total_num_comments
1912761
>>> 
>>> # Get a ratio of the number ask comments to total number of comments
>>> num_ask_comments / total_num_comments
0.04966642460819726
>>> 

Also you'll get different numbers with .startswith() vs. .contains() (I'm not sure which you want).此外,使用.startswith().contains()会得到不同的数字(我不确定你想要哪个)。

>>> ask_posts = df[df.title.str.lower().str.startswith("ask hn")]
>>> len(ask_posts)
9139
>>> 
>>> ask_posts = df[df.title.str.contains("ask hn", case=False)]
>>> len(ask_posts)
9147
>>> 

The pattern argument to .contains() can be a regular expression - which is very useful. .contains()的模式参数可以是正则表达式——这非常有用。 So we can specify all records that begin with "ask hn" at the very start of the title, but if we're not sure if any whitespace will be in front of it, we can do所以我们可以在标题的开头指定所有以“ask hn”开头的记录,但是如果我们不确定它前面是否有空格,我们可以这样做

>>> ask_posts = df[df.title.str.contains(r"^\s*ask hn", case=False)]
>>> len(ask_posts)
9139
>>> 

What's happening in the filter statements is probably difficult to grasp when you're starting out using Pandas.当您开始使用 Pandas 时,可能很难掌握过滤器语句中发生的情况。 The expression in the square brackets of df[df.title.str.contains("show hn", case=False)] for instance.例如df[df.title.str.contains("show hn", case=False)]方括号中的表达式。

What the statement inside the square brackets ( df.title.str.contains("show hn", case=False) ) produces is a column of True and False values - a boolean filter (not sure if that's what it's called but it has that effect).方括号内的语句( df.title.str.contains("show hn", case=False) )产生的是一列 True 和 False 值 - 一个 boolean 过滤器(不确定这是否是所谓的,但它有那个效果)。

So that boolean column that's produced is used to select rows in the dataframe, df[<bool column>] , and it produces a new dataframe with the matching records. So that boolean column that's produced is used to select rows in the dataframe, df[<bool column>] , and it produces a new dataframe with the matching records. We can then use that to extract other information - like the summation of the comments column.然后我们可以使用它来提取其他信息——比如评论列的总和。

Iterating through pandas dataFrame objects is generally slow.遍历 pandas dataFrame 对象通常很慢。 Iteration beats the whole purpose of using DataFrame.迭代胜过使用 DataFrame 的全部目的。 It is an anti-pattern and is something you should only do when you have exhausted every other option.这是一种反模式,只有在用尽所有其他选项时才应该这样做。 It is better look for a List Comprehensions, vectorized solution or DataFrame.apply() method for iterate through DataFrame .最好寻找 List Comprehensions、矢量化解决方案或 DataFrame.apply() 方法来迭代 DataFrame List comprehensions example:列表推导示例:

result = [(x, y,z) for x, y,z in zip(df['column1'], df['column2'],df['column3'])]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM