如何获得 pandas 列中满足特定条件的值的总和？

Question

这是我正在使用的 csv：

oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...

function处理数据帧

def do_process_citation_data(f_path):
    global my_ocan

    my_ocan = pd.read_csv("citations.csv",
                          names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
                          parse_dates=['creation', 'timespan'])
    my_ocan = my_ocan.iloc[1:]  # to remove the first row iloc - to select data by row numbers
    my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
    my_ocan['timespan'] = my_ocan['timespan'].apply(parse)
    # Period parsing on my_ocan['timespan']
    #print(my_ocan['timespan'])
    print(my_ocan.head())

    return my_ocan
#print(my_ocan.head())
                                                 oci  ... author_sc
1  020010107073600090601000000060105060106040500-...  ...        no
2  02007050504361421181514370302080202-0200101000...  ...        no
3  0200101040536030109070002063703010907000500-02...  ...        no
4  020010009033611181136111133000507-020010100083...  ...        no
5  0200100000736090708630363030109630608020004630...  ...        no

[5 rows x 7 columns]

#print(my_ocam.info())

RangeIndex: 213 entries, 1 to 213
Data columns (total 7 columns):
oci           213 non-null object
citing        213 non-null object
cited         213 non-null object
creation      213 non-null datetime64[ns]
timespan      213 non-null int64
journal_sc    213 non-null object
author_sc     213 non-null object
dtypes: datetime64[ns](1), int64(1), object(5)

print(my_ocan['creation'].head())
print(my_ocan['timespan'].head())

1   2016-07-10
2   2018-03-01
3   2018-01-01
4   2017-06-13
5   2017-01-01
Name: creation, dtype: datetime64[ns]
1     486
2    1080
3     730
4     824
5     365
Name: timespan, dtype: int64

我正在编写一个 function ，它返回一个包含在特定年份创建的文档数量以及该年创建的文档的“时间跨度”的平均时间的两个项目的元组。

def do_get_citations_per_year(data, year):
    result = tuple()
    y = ocinumber(year)
    n = time(year)
    result = (y, n)

我设法通过使用.loc获得了文件总数：

def ocinumber(year):
    result = tuple()
    my_ocan['creation'] = pd.DatetimeIndex(my_ocan['creation']).year
    lenta = len(my_ocan.loc[my_ocan['creation'] == year, 'creation'])
    return lenta
    #i.e running with 2015 return 99

不幸的是，当我在不同的条件下使用 same.loc 方法时，它不会返回任何结果。 这个想法是取 ['timespan'] 中与 ['creation'] 中的输入年份匹配的所有值的总和。

def time(year):
    my_ocan['creation'] = pd.DatetimeIndex(my_ocan['creation']).year
    t = my_ocan.loc[my_ocan['creation'] == year, 'timespan'].sum()
    return t
    #returns 0, when running with 2015 and with all the others

如何获得在特定年份创建的 ['timespan'] 中所有值的总和？

谢谢

Answer 1

我认为您正在尝试做的事情如下

# Get the number of citations in a year
len(my_ocan[my_ocan["creation"].dt.year==2015].index)

# Get the total timespan in a year
my_ocan[my_ocan["creation"].dt.year==2015]["timespan"].sum()

在Pandas中过滤一个DataFrame的基本逻辑如下

# 1. Establish Filter Logic
my_ocan["timespan"] == 365
# Returns
 1  False
 2  False
 3  False
 4  False
 5  True

#Use this result as a filter
my_ocan[my_ocan["timespan"] == 365]

# This will only return the corresponding rows where the filter returned True, thus for your example data set you'll get a single line of data

对于具有日期时间类型的 Pandas 列，您可以使用 dt 访问器访问许多日期时间函数，请看这里： https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html# dt 访问器

如何使用 dt 访问器的示例：

my_ocan["creation"].head()
# Returns
1   2016-07-10
2   2018-03-01
3   2018-01-01
4   2017-06-13
5   2017-01-01

# But using the dt accessor we can quickly get the year
my_ocan["creation"].dt.year.head()
# Returns
1   2016
2   2018
3   2018
4   2017
5   2017

将所有这些结合在一起并创建您的元组 function：

def get_citations_per_year(df, year):
    citation_count = len(my_ocan[my_ocan["creation"].dt.year==2015].index)
    timespan_sum = my_ocan[my_ocan["creation"].dt.year==2015]["timespan"].sum()
    return (citation_count, timespan_sum)

如何获得 pandas 列中满足特定条件的值的总和？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-05-22 09:03:59

如何获得 pandas 列中满足特定条件的值的总和？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-05-22 09:03:59

解决方案1
1 已采纳 2020-05-22 09:03:59