简体   繁体   English

如何使用数据帧在新列中拆分两个 CSV 文件列,显示 pandas 中的匹配项?

[英]How to split up two CSV file columns in a new column, showing matches in pandas, using dataframes?

I am trying to clean up a CSV file data set before I use it to make a couple of dash graphs.我正在尝试清理CSV文件数据集,然后再使用它制作几个虚线图。

One of the columns is UNITMEASURENAME and includes:其中一列是UNITMEASURENAME ,包括:

Thousand Barrels per day (kb/d)
Thousand Kilolitres (kl)
Thousand Barrels per day (kb/d)
Thousand Kilolitres (kl)
Conversion factor barrels/ktons
Conversion factor barrels/ktons
Thousand Barrels (kbbl)

Another column contains the value for each of the corresponding rows .另一列包含每个对应rows的值。

There is also a country and a data column.还有一个国家和一个数据列。

What I need to do is split up the UNITMEASURENAME into separate columns, taking the values from the column with the numbers.我需要做的是将UNITMEASURENAME拆分为单独的列,从带有数字的列中获取值。

Would df.pivot_table work? df.pivot_table会起作用吗?

I have done the following in pandas , but I don't think it will working within Dash for a plotly graph:我在pandas中完成了以下操作,但我认为它不会在 Dash 中用于 plotly 图:

TK = df.loc[df['UNITMEASURENAME']=='Thousand Kilolitres (kl)']

IN = df.loc[df['COUNTRYNAME']=='INDIA']

This isn't making a new colum in the actual CSV file.这并不是在实际的 CSV 文件中创建新列。

TK = df.loc[df['UNITMEASURENAME']=='Thousand Kilolitres (kl)']

IN = df.loc[df['COUNTRYNAME']=='INDIA']

I want new columns and then I will save the actual CSV file with them.我想要新的列,然后我将与它们一起保存实际的 CSV 文件。

{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
 'Year': {0: 2018, 1: 2018, 2: 2018, 3: 2018, 4: 2018},
 'Month': {0: 3, 1: 3, 2: 3, 3: 4, 4: 4},
 'OBSVALUE': {0: 7323.0, 1: 9907.0, 2: 48827.7847, 3: 9868.0, 4: 47066.6794},
 'COUNTRYNAME': {0: 'SAUDI ARABIA',
  1: 'SAUDI ARABIA',
  2: 'SAUDI ARABIA',
  3: 'SAUDI ARABIA',
  4: 'SAUDI ARABIA'},
 'UNITMEASURENAME': {0: 'Conversion factor barrels/ktons',
  1: 'Thousand Barrels per day (kb/d)',
  2: 'Thousand Kilolitres (kl)',
  3: 'Thousand Barrels per day (kb/d)',
  4: 'Thousand Kilolitres (kl)'},
 'alternate_date': {0: '2018-03-01',
  1: '2018-03-01',
  2: '2018-03-01',
  3: '2018-04-01',
  4: '2018-04-01'}}

Header for CSV file: Header 用于 CSV 文件:

Unnamed: 0  Year    Month   OBSVALUE    COUNTRYNAME UNITMEASURENAME alternate_date
0   0   2018    3   7323.0000   SAUDI ARABIA    Conversion factor barrels/ktons 2018-03-01
1   1   2018    3   9907.0000   SAUDI ARABIA    Thousand Barrels per day (kb/d) 2018-03-01
2   2   2018    3   48827.7847  SAUDI ARABIA    Thousand Kilolitres (kl)    2018-03-01
3   3   2018    4   9868.0000   SAUDI ARABIA    Thousand Barrels per day (kb/d) 2018-04-01
4   4   2018    4   47066.6794  SAUDI ARABIA    Thousand Kilolitres (kl)    2018-04-01

It seems that you have a multi-column key (year, month, country name, and maybe alternate_date), which is fine, but it would make pivoting difficult/dangerous.So, I will simply give you some code to create new columns based on the values in that one column.似乎您有一个多列键(年、月、国家/地区名称,可能还有备用日期),这很好,但它会使旋转变得困难/危险。所以,我将简单地给您一些代码来创建基于新列的在那一列中的值。

First, I love to copy a dataframe so that I'm not losing my original data首先,我喜欢复制 dataframe 以免丢失原始数据

dfc = df.copy()

Now, let's get a unique list of all the values of that column现在,让我们获取该列所有值的唯一列表

vals = dfc['UNITMEASURENAME'].values
vals = np.unique(vals)

Now let's create a new column for each of the values现在让我们为每个值创建一个新列

for val in vals:
    dfc[val] = dfc.apply(lambda x: x['OBSVALUE'] if x['UNITMEASURENAME'] == val else None , axis = 1)

if lambda functions are too confusing:如果 lambda 功能太混乱:

dfc = df.copy()
vals = dfc['UNITMEASURENAME'].values
vals = np.unique(vals)

def fun(row):
    if row['UNITMEASURENAME'] == val:
        return row['OBSVALUE']
    else:
        return None

for val in vals:
    dfc[val] = dfc.apply(fun, axis = 1)

I tested this code.我测试了这段代码。

I think you could use pivot method of Pandas DataFrame to create new columns using categorical values.我认为您可以使用pivot方法 Pandas DataFrame 使用分类值创建新列。

df = ... # your dataframe

# We keep 'Unnamed: 0' column as index for later when we merge df and df2
df2 = df.pivot(index='Unnamed: 0', columns='UNITMEASURENAME', values=['OBSVALUE'])

# df2 is a MultiIndex dataframe.. So we access the level needed and then reset_index
df2 = df2['OBSVALUE'].reset_index()

Now you can merge this to the original dataframe to keep other columns for your analysis现在您可以将其合并到原始 dataframe 以保留其他列以供您分析

final_df = pd.merge(df, df2, on='Unnamed: 0')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM