简体   繁体   English

为什么 df.reset_index() 在我的数据框中的小数点后添加 5 个零?

[英]Why is df.reset_index() adding 5 zeros after a decimal in my dataframe?

I am an MPH Epidemiology student in a data science introduction class with just about NO programming experience.我是一名 MPH 流行病学学生,在数据科学介绍课上,几乎没有编程经验。 I have uploaded a json file into pycharm, converted it to a dataframe using我已经将一个 json 文件上传到 pycharm,使用

pub_num = pd.DataFrame(papers['Publication_Year'].value_counts())  

Then reset the index using然后使用重置索引

pub_num = pub_num.reset_index()

After resetting the index, it took the whole numbers that were in my dataframe and added 5 zeros after a decimal point.重置索引后,它取了我数据框中的整数,并在小数点后添加了 5 个零。 Now i'm trying to plot the dataframe, and I can't plot them correctly bc it's not recognizing whole numbers.现在我正在尝试绘制数据框,但我无法正确绘制它们,因为它无法识别整数。

Why is it adding zeroes and how do I get rid of them?为什么要添加零,我该如何摆脱它们? It is showing up fine in my console.它在我的控制台中显示良好。 No zeros.没有零。 But then I look in the environment and 'view as dataframe' in the bottom right corner, I can see all the zeroes.但是然后我查看环境并在右下角“查看为数据框”,我可以看到所有的零。 screen shot showing the console with no zeroes and the dataframe with zeroes.屏幕截图显示没有零的控制台和带有零的数据帧。

I've tried changing back to int using df.astype(int) and changing the precision to 0. But neither have worked.我尝试使用 df.astype(int) 改回 int 并将精度更改为 0。但都没有奏效。

import json
import pandas as pd
import matplotlib.pyplot as plt

# open and prints out the json file
with open('Papers.json') as file:
    data = json.load(file)

# convert to pandas dataframe.
papers = pd.read_json('Papers.json')

# creates a dataframe to count the number of publications in each year
pub_num = pd.DataFrame(papers['Publication_Year'].value_counts())
pub_num = pub_num.reset_index()
pub_num.columns = ['Publication_Year', 'Counts']
print(pub_num)

The output of the df is: df 的输出是:

       Publication_Year  Counts
0              2010      10
1              2009       5

my code for the plot is this:我的情节代码是这样的:

plt.scatter(x = 'Publication_Year', y = 'Counts', data = pub_num)
plt.xlabel('Publication Year')
plt.ticklabel_format(useOffset=False)
plt.show()

Plot using the plt.ticklabel_format(useOffset=False使用 plt.ticklabel_format(useOffset=False) 绘图

plot if I don't use plt.ticklable_format function如果我不使用 plt.ticklable_format 函数,则绘图

UPDATE: So I took the suggestion of transforming to date time using:更新:所以我建议使用以下方法转换为日期时间:

pub_num['Publication_Year'] = pd.to_datetime(pub_num['Publication_Year'],format='%Y')

This is the graph that came out: Graph using the conversion to years instead of integers It's still adding extra numbers after year, which is why I honestly believe it because there are zeroes after my decimals in my df as shown in the first picture.这是出来的图表:使用转换为年份而不是整数的图表它仍然在年复一年地添加额外的数字,这就是为什么我真的相信它,因为我的 df 中的小数点后有零,如第一张图片所示。

This has nothing to do with zeroes in your data frame.这与数据框中的零无关

In your first output, you have only two rows.在您的第一个输出中,您只有两行。

       Publication_Year  Counts
0              2010          10
1              2009           5

In plotting terms, you'll have two ordered pairs : (2009, 5) and (2010, 10).在绘图方面,您将有两个有序对:(2009, 5) 和 (2010, 10)。 This means you'll have two points in your graph.这意味着您的图表中有两个点。

That's exactly what's being outputted in this link you provided.正是您提供的此链接中输出的内容 Since 2010 and 2009 are integers, pandas will just interpolate values in the xticks on the x axis for readability.由于20102009是整数,为了便于阅读,pandas 只会在x轴上的xticks插入值。 These values don't mean anything, they are just part of the x axis, but you can totally modify them by messing with the xticks and xtickslabels arguments of the plt.plot function.这些值没有任何意义,它们只是x轴的一部分,但是您可以通过混淆plt.plot函数的xticksxtickslabels参数来完全修改它们。

When you make your values datetime , your data will look something like this:当您将值设置为datetime ,您的数据将如下所示:

     Publication_Year  Counts
0          2010-01-01      10
1          2009-01-01       5

Again, you'll have two points in your data frame.同样,您的数据框中将有两个点。 Pandas will, again, interpolate in between these points for readability. Pandas 将再次在这些点之间进行插值以提高可读性。 Since the beginning is January 2009 and the end is January 2010 , you'll have March , April , July etc in between just for readability .由于开始时间是January 2009 January 2010 ,结束时间是January 2010 ,因此为了便于阅读,您将有MarchAprilJuly等等。

Again, this has nothing to do with decimal points .同样,这与小数点无关

If you add plt.xticks([2009, 2010]) just before your plt.show() line, you'll enforce your code to have just two ticks: 2009 and 2010. The result would be something like:如果您在plt.show()行之前添加plt.xticks([2009, 2010]) ,您将强制您的代码只有两个刻度:2009 和 2010。结果将类似于:

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM