[英]How can I get the sum of values in a pandas column that meet certain conditions?
这是我正在使用的 csv:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
function处理数据帧
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
my_ocan['timespan'] = my_ocan['timespan'].apply(parse)
# Period parsing on my_ocan['timespan']
#print(my_ocan['timespan'])
print(my_ocan.head())
return my_ocan
#print(my_ocan.head())
oci ... author_sc
1 020010107073600090601000000060105060106040500-... ... no
2 02007050504361421181514370302080202-0200101000... ... no
3 0200101040536030109070002063703010907000500-02... ... no
4 020010009033611181136111133000507-020010100083... ... no
5 0200100000736090708630363030109630608020004630... ... no
[5 rows x 7 columns]
#print(my_ocam.info())
RangeIndex: 213 entries, 1 to 213
Data columns (total 7 columns):
oci 213 non-null object
citing 213 non-null object
cited 213 non-null object
creation 213 non-null datetime64[ns]
timespan 213 non-null int64
journal_sc 213 non-null object
author_sc 213 non-null object
dtypes: datetime64[ns](1), int64(1), object(5)
print(my_ocan['creation'].head())
print(my_ocan['timespan'].head())
1 2016-07-10
2 2018-03-01
3 2018-01-01
4 2017-06-13
5 2017-01-01
Name: creation, dtype: datetime64[ns]
1 486
2 1080
3 730
4 824
5 365
Name: timespan, dtype: int64
我正在编写一个 function ,它返回一个包含在特定年份创建的文档数量以及该年创建的文档的“时间跨度”的平均时间的两个项目的元组。
def do_get_citations_per_year(data, year):
result = tuple()
y = ocinumber(year)
n = time(year)
result = (y, n)
我设法通过使用.loc获得了文件总数:
def ocinumber(year):
result = tuple()
my_ocan['creation'] = pd.DatetimeIndex(my_ocan['creation']).year
lenta = len(my_ocan.loc[my_ocan['creation'] == year, 'creation'])
return lenta
#i.e running with 2015 return 99
不幸的是,当我在不同的条件下使用 same.loc 方法时,它不会返回任何结果。 这个想法是取 ['timespan'] 中与 ['creation'] 中的输入年份匹配的所有值的总和。
def time(year):
my_ocan['creation'] = pd.DatetimeIndex(my_ocan['creation']).year
t = my_ocan.loc[my_ocan['creation'] == year, 'timespan'].sum()
return t
#returns 0, when running with 2015 and with all the others
如何获得在特定年份创建的 ['timespan'] 中所有值的总和?
谢谢
我认为您正在尝试做的事情如下
# Get the number of citations in a year
len(my_ocan[my_ocan["creation"].dt.year==2015].index)
# Get the total timespan in a year
my_ocan[my_ocan["creation"].dt.year==2015]["timespan"].sum()
在Pandas中过滤一个DataFrame的基本逻辑如下
# 1. Establish Filter Logic
my_ocan["timespan"] == 365
# Returns
1 False
2 False
3 False
4 False
5 True
#Use this result as a filter
my_ocan[my_ocan["timespan"] == 365]
# This will only return the corresponding rows where the filter returned True, thus for your example data set you'll get a single line of data
对于具有日期时间类型的 Pandas 列,您可以使用 dt 访问器访问许多日期时间函数,请看这里: https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html# dt 访问器
如何使用 dt 访问器的示例:
my_ocan["creation"].head()
# Returns
1 2016-07-10
2 2018-03-01
3 2018-01-01
4 2017-06-13
5 2017-01-01
# But using the dt accessor we can quickly get the year
my_ocan["creation"].dt.year.head()
# Returns
1 2016
2 2018
3 2018
4 2017
5 2017
将所有这些结合在一起并创建您的元组 function:
def get_citations_per_year(df, year):
citation_count = len(my_ocan[my_ocan["creation"].dt.year==2015].index)
timespan_sum = my_ocan[my_ocan["creation"].dt.year==2015]["timespan"].sum()
return (citation_count, timespan_sum)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.