简体   繁体   English

在大熊猫数据框中,算出某一列中某条件发生的次数?

[英]In a pandas dataframe, count the number of times a condition occurs in one column?

Background 背景

I have five years of NO2 measurement data, in csv files-one file for every location and year. 我在CSV文件中有五年的NO2测量数据,每个位置和年份都一个文件。 I have loaded all the files into pandas dataframes in the same format: 我已将所有文件以相同格式加载到pandas数据框中:

Date    Hour    Location    NO2_Level
0   01/01/2016  00  Street  18
1   01/01/2016  01  Street  39
2   01/01/2016  02  Street  129
3   01/01/2016  03  Street  76
4   01/01/2016  04  Street  40

Goal 目标

For each dataframe count the number of times NO2_Level is greater than 150 and output this. 对于每个数据帧计数,NO2_Level大于150的次数并输出。

So I wrote a loop that's creates all the dataframes from the right directories and cleans them appropriately . 因此,我编写了一个循环,该循环从正确的目录创建所有数据帧,并适当地清理它们。

Problem 问题

Whatever I've tried produces results I know on inspection are incorrect, eg : -the count value for every location on a given year is the same (possible but unlikely) -for a year when I know there should be any positive number for the count, every location returns 0 无论我已经试过产生的结果我知道检查是不正确的,例如:对于给定年份的每个位置-the计数值是相同的(可能的,但不太可能) -对于一年当我知道应该有任何的正数计数,每个位置返回0

What I've tried 我尝试过的

I have tried a lot of approaches to getting this value for each dataframe, such as making the column a series: 我尝试了很多方法来为每个数据框获取此值,例如将列设为一系列:

NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()'''

Using pd.count(): 使用pd.count():

count = df[df['NO2_Level'] >= 150].count()

These two approaches have gotten closest to what I want to output 这两种方法最接近我要输出的内容

Example to test on 测试示例

data = {'Date': ['01/01/2016','01/02/2016',' 01/03/2016', '01/04/2016', '01/05/2016'], 'Hour': ['00', '01', '02', '03', '04'], 'Location':  ['Street','Street','Street','Street','Street',], 'NO2_Level': [18, 39, 129, 76, 40]}
df = pd.DataFrame(data=d)
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()
count

Expected Outputs 预期产出

So from this I'm trying to get it to output a single line for each dataframe that was made in the format Location, year, count (of condition): 因此,我试图通过它为位置,年份,计数(条件)格式的每个数据框输出一行:

Kirkstall Road,2013,47
Haslewood Close,2013,97
...
Jack Lane Hunslet,2015,158

So the above example would produce 所以上面的例子会产生

Street, 2016, 1

Actual Every year produces the same result for each location, for some years (2014) the count doesn't seem to work at all when on inspection there should be: 实际每年在每个位置产生的结果都是相同的,在某些年份(2014年)中,在检查时似乎根本不起作用,应该有:

Kirkstall Road,2013,47
Haslewood Close,2013,47
Tilbury Terrace,2013,47
Corn Exchange,2013,47
Temple Newsam,2014,0
Queen Street Morley,2014,0
Corn Exchange,2014,0
Tilbury Terrace,2014,0
Haslewood Close,2015,43
Tilbury Terrace,2015,43
Corn Exchange,2015,43
Jack Lane Hunslet,2015,43
Norman Rows,2015,43

here is a solution with a sample generated (randomly): 这是一个带有随机生成的样本的解决方案:

def random_dates(start, end, n):
    start_u = start.value // 10 ** 9
    end_u = end.value // 10 ** 9
    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')

location = ['street', 'avenue', 'road', 'town', 'campaign']

df = pd.DataFrame({'Date' : random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-12-31'), 20),
                   'Location' : np.random.choice(location, 20),
                   'NOE_level' : np.random.randint(low=130, high= 200, size=20)})

#Keep only year for Date
df['Date'] = df['Date'].dt.strftime("%Y")

print(df)

df = df.groupby(['Location', 'Date'])['NOE_level'].apply(lambda x: (x>150).sum()).reset_index(name='count')
print(df)

Example df generated: 生成的示例df:

        Date  Location  NOE_level
0       2018      town        191
1       2017  campaign        187
2       2017      town        137
3       2016    avenue        148
4       2017  campaign        195
5       2018      town        181
6       2018      road        187
7       2018      town        184
8       2016      town        155
9       2016    street        183
10      2018      road        136
11      2017      road        171
12      2018    street        165
13      2015    avenue        193
14      2016  campaign        170
15      2016    street        132
16      2016  campaign        165
17      2015      road        161
18      2018      road        161
19      2015      road        140 

output: 输出:

    Location       Date  count
0     avenue       2015      1
1     avenue       2016      0
2   campaign       2016      2
3   campaign       2017      2
4       road       2015      1
5       road       2017      1
6       road       2018      2
7     street       2016      1
8     street       2018      1
9       town       2016      1
10      town       2017      0
11      town       2018      3

Hopefully this helps. 希望这会有所帮助。

import pandas as pd

ddict = {
    'Date':['2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-02',],
    'Hour':['00','01','02','03','04','02'],
    'Location':['Street','Street','Street','Street','Street','Street',],
    'N02_Level':[19,39,129,76,40, 151],
}

df = pd.DataFrame(ddict)

# Convert dates to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Make a Year column
df['Year'] = df['Date'].apply(lambda x: x.strftime('%Y'))

# Group by lcoation and year, count by M02_Level > 150
df1 = df[df['N02_Level'] > 150].groupby(['Location','Year']).size().reset_index(name='Count')

# Interate the results
for i in range(len(df1)):
    loc = df1['Location'][i]
    yr = df1['Year'][i]
    cnt = df1['Count'][i]
    print(f'{loc},{yr},{cnt}')


### To not use f-strings
for i in range(len(df1)):
    print('{loc},{yr},{cnt}'.format(loc=df1['Location'][i], yr=df1['Year'][i], cnt=df1['Count'][i]))

Sample data: 样本数据:

        Date Hour Location  N02_Level
0 2016-01-01   00   Street         19
1 2016-01-01   01   Street         39
2 2016-01-01   02   Street        129
3 2016-01-01   03   Street         76
4 2016-01-01   04   Street         40
5 2016-01-02   02   Street        151

Output: 输出:

Street,2016,1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 计算每个值在pandas列中出现的次数 - Count number of times each value occurs in pandas column 将Pandas数据框中字符串出现的次数附加到另一列 - Append number of times a string occurs in Pandas dataframe to another column pandas 字符串在基于另一列的列中出现的次数 - pandas number of times a string occurs in one column based on another column 计算列表中每个项目在 Pandas 数据框列中出现的次数,用逗号将值与其他列的附加聚合分开 - Count number of times each item in list occurs in a pandas dataframe column with comma separates values with additional aggregation of other columns 计数或标记dataframe列达到条件的次数 - Count or flag number of times dataframe column reaches a condition 计算 Pandas dataframe 的每一列中满足条件的值的数量 - Count the number of values that satisfy a condition in every column of a Pandas dataframe 计算熊猫数据框列中满足条件的单元格数 - Count number of cells satisfying a condition in pandas dataframe column Pandas 根据另一列中的条件计算一列中的计数 - Pandas count number in one column based off a condition in a different column 计数级别在 Python dataframe 中的集群/组内出现的次数 - Count number of times a level occurs within a cluster/group in Python dataframe 计算带有条件的熊猫数据框中出现的总数 - Count total number of occurrences in pandas dataframe with a condition
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM