简体   繁体   中英

Cleaning up my data using Pandas in Python

Made my first ever API call to get 1 row and 7 columns of this (it's the only available report export call to get campaign metrics data by date):

Below is the first row, so A1 in excel.

{'other': False, 'total': {'impressions': 346821, 'taps': 12167, 'installs': 7535, 'newDownloads': 5364, 'redownloads': 2171, 'latOnInstalls': 1878, 'latOffInstalls': 5657, 'ttr': 0.0351, 'avgCPA': {'amount': '1.8', 'currency': 'GBP'}, 'avgCPT': {'amount': '1.1147', 'currency': 'GBP'}, 'localSpend': {'amount': '123.456', 'currency': 'GBP'}, 'conversionRate': 0.6193}, 'metadata': {'campaignId': 219752776, 'campaignName': Campaign1', 'deleted': False}}

what id like to do is delete everything apart from the localSpend: 123.345 in each row then sum it up.

I assume I'll need some loops, however, I'm not sure how I'd go about it because the column names are classed as headings and I'm quite new to pandas.

Thanks in advance!

I assume that your API-response is a list of dictionaries and if I understood correctly you just want to sum all localSpend values:

Heres the code:

import pandas as pd

data = [
{
  'other': False,
  'total': {
    'impressions': 346821,
    'taps': 12167,
    'installs': 7535,
    'newDownloads': 5364,
    'redownloads': 2171,
    'latOnInstalls': 1878,
    'latOffInstalls': 5657,
    'ttr': 0.0351,
    'avgCPA': {
      'amount': '1.8',
      'currency': 'GBP'
    },
    'avgCPT': {
      'amount': '1.1147',
      'currency': 'GBP'
    },
    'localSpend': {
      'amount': '123.456',
      'currency': 'GBP'
    },
    'conversionRate': 0.6193
  },
  'metadata': {
    'campaignId': 219752776,
    'campaignName': 'Campaign1',
    'deleted ': False
  }
},
{
  'other': False,
  'total': {
    'impressions': 346821,
    'taps': 12167,
    'installs': 7535,
    'newDownloads': 5364,
    'redownloads': 2171,
    'latOnInstalls': 1878,
    'latOffInstalls': 5657,
    'ttr': 0.0351,
    'avgCPA': {
      'amount': '1.8',
      'currency': 'GBP'
    },
    'avgCPT': {
      'amount': '1.1147',
      'currency': 'GBP'
    },
    'localSpend': {
      'amount': '123.456',
      'currency': 'GBP'
    },
    'conversionRate': 0.6193
  },
  'metadata': {
    'campaignId': 219752776,
    'campaignName': 'Campaign1',
    'deleted ': False
  }
}
]

# Creating empty list for just localSpend values
rows = list()

for row in data:
    rows.append(row['total']['localSpend'])

# loading list into dataframe
df = pd.DataFrame(rows)

# Converting column type to float
df['amount'] = pd.to_numeric(df["amount"], downcast="float")

# Summing the whole column
print("Total result is:", df['amount'].sum())

flixoflax's answer should work and is nice and straightforward.

It does a big chunk of the work outside of pandas though; here is a pure pandas solution as an example (with the added bonus of retaining all the data if you decide you do want it after all).

The problem with doing the initial loop outside of pandas is that, while it works for the sample data, if the source data is in Excel then pandas can (and IMO should) be used to load the data. In that case it makes no sense to load the data in pandas, loop outside of pandas, and then step back into pandas for processing.

Import and load the data as normal:

import numpy as np
import pandas as pd

# Let's assume data is the result of the pandas Excel read
data = [
{
  'other': False,
  'total': {
    'impressions': 346821,
    'taps': 12167,
    'installs': 7535,
    'newDownloads': 5364,
    'redownloads': 2171,
    'latOnInstalls': 1878,
    'latOffInstalls': 5657,
    'ttr': 0.0351,
    'avgCPA': {
      'amount': '1.8',
      'currency': 'GBP'
    },
    'avgCPT': {
      'amount': '1.1147',
      'currency': 'GBP'
    },
    'localSpend': {
      'amount': '123.456',
      'currency': 'GBP'
    },
    'conversionRate': 0.6193
  },
  'metadata': {
    'campaignId': 219752776,
    'campaignName': 'Campaign1',
    'deleted ': False
  }
},
{
  'other': False,
  'total': {
    'impressions': 346821,
    'taps': 12167,
    'installs': 7535,
    'newDownloads': 5364,
    'redownloads': 2171,
    'latOnInstalls': 1878,
    'latOffInstalls': 5657,
    'ttr': 0.0351,
    'avgCPA': {
      'amount': '1.8',
      'currency': 'GBP'
    },
    'avgCPT': {
      'amount': '1.1147',
      'currency': 'GBP'
    },
    'localSpend': {
      'amount': '123.456',
      'currency': 'GBP'
    },
    'conversionRate': 0.6193
  },
  'metadata': {
    'campaignId': 219752776,
    'campaignName': 'Campaign1',
    'deleted ': False
  }
}
]

Load the data straight into a dataframe (instead of looping and picking out one key value). Note this leads to a dataframe that uses more memory but it means all the values are available and an initial loop is not needed outside of pandas.

Also, on first load some columns will be dictionaries. This will need to be cleaned up.

The necessary operations can be done step by step or all at once. For the sake of this example, I'll post both sets for comparison.

# Step by step
# Create dataframe
df = pd.DataFrame(data)
# Split out the 'total' column
df2 = df['total'].apply(pd.Series)
# Split out the 'localSpend' column
df3 = df2['localSpend'].apply(pd.Series)
# Merge the three dataframes back together
result = pd.concat([df, df2, df3], axis=1)
print(f"Total result is:{result['amount'].astype('float64').sum()}")

Or a more concise form, with the split and merges occurring together:

df = pd.DataFrame(data)
df = pd.concat([df, df['total'].apply(pd.Series)], axis=1)
df = pd.concat([df, df['localSpend'].apply(pd.Series)], axis=1)
print(f"Total result is:{df['amount'].astype('float64').sum()}")

Not that df['amount'].astype('float64') has been performed because the column has been left as it's default dtype (object). This would not be needed if you convert the column to a numeric as flixoflax did.

df['amount'] = pd.to_numeric(df["amount"], downcast="float")
print(f"Total result is:{df['amount'].sum():.3f}")

The final version of the 'amount' column can be split off into it's own dataframe or Series, and can be converted to a float at the same time:

df2["amount"] = pd.to_numeric(df["amount"], downcast="float")
print(f"Total result is:{df2['amount'].sum():.3f}")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM