[英]Using Pandas, how would I add an existing dataframe row to another dataframe row with existing values?
For context, I have a dataset that is comprised of USA's states and territories.就上下文而言,我有一个由美国各州和领地组成的数据集。 I have made a new data frame with only the 50 states(excluding territories) lets call it States_Only.
我制作了一个只有 50 个州(不包括领土)的新数据框,我们称之为 States_Only。 This is complete.
这是完整的。 However, the first data set (lets call it USA_ALL) had both NY and NYC as independent rows, meaning that the values attributed to NY do not already include NYC's recorded data.
但是,第一个数据集(我们称之为 USA_ALL)将 NY 和 NYC 作为独立行,这意味着归属于 NY 的值尚未包括NYC 的记录数据。 Because they originated from the same data set the columns match.
因为它们源自相同的数据集,所以列匹配。 All values are either NAN/NULL or integers.
所有值都是 NAN/NULL 或整数。 For my States_Only data to be complete, the NYC values from USA_ALL need to be added to NY in the States_only dataframe.
为了使我的 States_Only 数据完整,需要将来自 USA_ALL 的 NYC 值添加到 States_only 数据框中的 NY。 How can I achieve this?
我怎样才能做到这一点? For clarity, I do not want to append NYC, nor do I have the ability to groupby() because there is nothing software side tying these two together(such as an identifier), only the knowledge that NYC is within NY.
为清楚起见,我不想附加 NYC,也没有 groupby() 的能力,因为没有任何软件方面将这两者联系在一起(例如标识符),只有纽约市在纽约市内的知识。
import requests
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
if __name__ == '__main__':
#data prep
data_path = './assets/'
out_path = './output'
#scraping javascript map data via xml
endpoint = "https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData"
data = requests.get(endpoint, params={"id": "US_MAP_DATA"}).json()
#convert to df and export raw data as csv
df = pd.DataFrame(data["US_MAP_DATA"])
path = os.path.join(out_path,'Raw_CDC_Data.csv')
df.to_csv(path)
#Remove last data point (Total USA)
df.drop(df.tail(1).index,inplace=True)
#Create DF of just 50 states
state_abbr =["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
"MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
"SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
states = df[df['abbr'].isin(state_abbr)]
# Add NYC from df to NY's existing values (sum of each column) to states
here is an excel spreadsheat to show the expected final value in the States_only dataset, this is included because the formatting on this forum for this data would be hard to understand and unclear Expected Values这是一个 excel spreadsheat,用于显示 States_only 数据集中的预期最终值,包括在内是因为此论坛上此数据的格式很难理解且预期值不清楚
While this isn't super clean, it will do the trick:虽然这不是超级干净,但它可以解决问题:
import pandas as pd
import requests
endpoint = "https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData"
data = requests.get(endpoint, params={"id": "US_MAP_DATA"}).json()
df = pd.DataFrame(data["US_MAP_DATA"])
# drop last row
df = df[:-1]
ny_rows_mask = df["abbr"].isin(["NY", "NYC"])
ny_rows = df.loc[ny_rows_mask]
df = df.loc[~ny_rows_mask]
new_row = ny_rows.sum()
new_row["abbr"] = "NY"
new_row["id"] = 36
new_row["fips"] = 36
new_row["name"] = "New York"
df = df.append(new_row, ignore_index=True)
As an aside, if you haven't already you should examine some of the data types that Pandas infers from the CSV.顺便说一句,如果您还没有检查过 Pandas 从 CSV 推断出的一些数据类型。 The
id
column probably shouldn't be a number type, for example.例如,
id
列可能不应该是数字类型。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.