Why is df.reset_index() adding 5 zeros after a decimal in my dataframe?

Question

I am an MPH Epidemiology student in a data science introduction class with just about NO programming experience. I have uploaded a json file into pycharm, converted it to a dataframe using

pub_num = pd.DataFrame(papers['Publication_Year'].value_counts())

Then reset the index using

pub_num = pub_num.reset_index()

After resetting the index, it took the whole numbers that were in my dataframe and added 5 zeros after a decimal point. Now i'm trying to plot the dataframe, and I can't plot them correctly bc it's not recognizing whole numbers.

Why is it adding zeroes and how do I get rid of them? It is showing up fine in my console. No zeros. But then I look in the environment and 'view as dataframe' in the bottom right corner, I can see all the zeroes. screen shot showing the console with no zeroes and the dataframe with zeroes.

I've tried changing back to int using df.astype(int) and changing the precision to 0. But neither have worked.

import json
import pandas as pd
import matplotlib.pyplot as plt

# open and prints out the json file
with open('Papers.json') as file:
    data = json.load(file)

# convert to pandas dataframe.
papers = pd.read_json('Papers.json')

# creates a dataframe to count the number of publications in each year
pub_num = pd.DataFrame(papers['Publication_Year'].value_counts())
pub_num = pub_num.reset_index()
pub_num.columns = ['Publication_Year', 'Counts']
print(pub_num)

The output of the df is:

       Publication_Year  Counts
0              2010      10
1              2009       5

my code for the plot is this:

plt.scatter(x = 'Publication_Year', y = 'Counts', data = pub_num)
plt.xlabel('Publication Year')
plt.ticklabel_format(useOffset=False)
plt.show()

Plot using the plt.ticklabel_format(useOffset=False

plot if I don't use plt.ticklable_format function

UPDATE: So I took the suggestion of transforming to date time using:

pub_num['Publication_Year'] = pd.to_datetime(pub_num['Publication_Year'],format='%Y')

This is the graph that came out: Graph using the conversion to years instead of integers It's still adding extra numbers after year, which is why I honestly believe it because there are zeroes after my decimals in my df as shown in the first picture.

Answer 1

This has nothing to do with zeroes in your data frame.

In your first output, you have only two rows.

       Publication_Year  Counts
0              2010          10
1              2009           5

In plotting terms, you'll have two ordered pairs : (2009, 5) and (2010, 10). This means you'll have two points in your graph.

That's exactly what's being outputted in this link you provided. Since 2010 and 2009 are integers, pandas will just interpolate values in the xticks on the x axis for readability. These values don't mean anything, they are just part of the x axis, but you can totally modify them by messing with the xticks and xtickslabels arguments of the plt.plot function.

When you make your values datetime , your data will look something like this:

     Publication_Year  Counts
0          2010-01-01      10
1          2009-01-01       5

Again, you'll have two points in your data frame. Pandas will, again, interpolate in between these points for readability. Since the beginning is January 2009 and the end is January 2010 , you'll have March , April , July etc in between just for readability .

Again, this has nothing to do with decimal points .

If you add plt.xticks([2009, 2010]) just before your plt.show() line, you'll enforce your code to have just two ticks: 2009 and 2010. The result would be something like:

Why is df.reset_index() adding 5 zeros after a decimal in my dataframe?

Question

1 answers

solution1
1 ACCPTED 2019-12-11 00:03:58

Why is df.reset_index() adding 5 zeros after a decimal in my dataframe?

Question

1 answers

solution1 1 ACCPTED 2019-12-11 00:03:58

solution1
1 ACCPTED 2019-12-11 00:03:58