简体   繁体   English

在pandas数据帧中拆分列而不删除na

[英]Splitting a column in pandas dataframe not dropping na

I am going to preface this question by saying I do not own the way the data starts out in the csv. 我将通过说我不拥有数据在csv中开始的方式来预先提出这个问题。 Nor do I have direct access to the csv since I can only pull it from an SFTP that I do not have direct access to. 我也无法直接访问csv,因为我只能从我无法直接访问的SFTP中获取它。 The API shows the same data format that the csv shows. API显示与csv显示的数据格式相同的格式。 Here is the two columns of the incoming csv to the dataframe that are pertinent. 以下是与数据帧相关的传入csv的两列。

+-----+-------------------------------+-------------+
|     |  Sourcing Event ID (DTRM ID)  |     Site    |
+-----+-------------------------------+-------------+
| 0   |                         1035  |     ,ABC55, |
| 1   |                         1067  |          ,, |
| 2   |                         1181  |          ,, |
| 3   |                         1183  |          ,, |
| 4   |                         1184  |          ,, |
| 5   |                         1264  |          ,, |
| 6   |                         1307  |      ,DEF2, |
| 7   |                         1354  |          ,, |
| 8   |                         1369  |    ,HIJ150, |
| 9   |                         1372  |     ,DEF64, |
| 10  |                         1373  |      ,KLM9, |
| 11  |                         1374  |      ,DEF1, |
| 12  |                         1381  |          ,, |
| 13  |                         1385  |          ,, |
| 14  |                         1391  |          ,, |
| 15  |                         1394  |          ,, |
| 16  |                         1395  |          ,, |
| 17  |                         1402  |          ,, |
| 18  |                         1404  |          ,, |
| 19  |                         1405  |          ,, |
| 20  |                         1406  |          ,, |
| 21  |                         1408  |          ,, |
| 22  |                         1410  |    ,HIJ116, |
| 23  |                         1412  |          ,, |
+-----+-------------------------------+-------------+

from that I do the following (from a previous SO answer) : 从那里我做了以下(从以前的SO回答):

df_sourcing_events = pd.read_csv(wf['local_filename'])


            sourcing_events_melt_col = 'Sourcing Event ID (DTRM ID)'
            sourcing_events_site_col = 'Site'
            print(df_sourcing_events[[sourcing_events_melt_col,sourcing_events_site_col]])
            df_sourcing_events[sourcing_events_site_col] = df_sourcing_events[sourcing_events_site_col].str.lstrip(',')
            df_sourcing_events[sourcing_events_site_col] = df_sourcing_events[sourcing_events_site_col].str.rstrip(',')

            df_sourcing_events_sites = pd.concat([df_sourcing_events[sourcing_events_melt_col], df_sourcing_events[sourcing_events_site_col].str.split(',', expand = True)], axis = 1)\
                                                    .melt(id_vars=[sourcing_events_melt_col])\
                                                    .sort_values(by = sourcing_events_melt_col)\
                                                    .rename(columns = {'value' : sourcing_events_site_col})\
                                                    .drop(columns = ['variable'])\
                                                    .dropna()

now you are asking yourself why strip the leading and trailing commas? 现在你问自己为什么要删除前导和尾随逗号?

well because I have another file that has to do with contracts that has the same exact layout and I did the same exact thing to it and that solved the problem with the same exact code. 好吧,因为我有另一个文件与具有相同确切布局的合同有关,我对它做了同样的事情,并用相同的确切代码解决了问题。 I cannot for the life of me understand why the output from my code is the following: 我不能为我的生活理解为什么我的代码输出如下:

+-----+-------------------------------+-----------+
|     |  Sourcing Event ID (DTRM ID)  |    Site   |
+-----+-------------------------------+-----------+
| 0   |                         1035  |     ABC55 |
| 1   |                         1067  |           |
| 2   |                         1181  |           |
| 3   |                         1183  |           |
| 4   |                         1184  |           |
| 5   |                         1264  |           |
| 6   |                         1307  |      DEF2 |
| 7   |                         1354  |           |
| 8   |                         1369  |    HIJ150 |
| 9   |                         1372  |     DEF64 |
| 10  |                         1373  |      KLM9 |
| 11  |                         1374  |      DEF1 |
| 12  |                         1381  |           |
| 13  |                         1385  |           |
| 14  |                         1391  |           |
| 15  |                         1394  |           |
| 16  |                         1395  |           |
| 17  |                         1402  |           |
| 18  |                         1404  |           |
| 19  |                         1405  |           |
| 20  |                         1406  |           |
| 21  |                         1408  |           |
| 22  |                         1410  |    HIJ116 |
| 23  |                         1412  |           |
+-----+-------------------------------+-----------+

it's like the dropna() is just not working at all. 就像dropna()根本不工作一样。 I even copied and pasted the working code from the other contracts csv into this area and simply changed the variables in the code to match this csv and it still doesn't work. 我甚至将其他合同csv中的工作代码复制并粘贴到此区域,只需更改代码中的变量以匹配此csv,它仍然无法正常工作。 I re-checked to make sure the other code is actually working as well. 我重新检查以确保其他代码实际上也正常工作。

I tried .dropna(how='any') to no avail. 我试过.dropna(how='any')无济于事。 What else should I do? 我还该怎么办?

EDIT: 编辑:

Answer to The Zack Man: 回答扎克人:

No because after that I am doing the following: 不,因为在那之后我做了以下事情:

df_sourcing_events_final = df_sourcing_events.drop([sourcing_events_site_col], axis=1)

            write_dataframe_to_csv_on_s3(df_sourcing_events_sites, s3_bucket, 'sourcing_events_sites.csv')

            write_dataframe_to_csv_on_s3(df_sourcing_events_final, s3_bucket, file_name)

I am splitting out a column that is a list into individual rows and making a new csv from it to load into a separate table. 我正在拆分一个列,该列是单个行的列表,并从中创建一个新的csv以加载到单独的表中。

It's not dropping because they are empty strings not N/A. 它没有掉线,因为它们是空字符串而不是N / A. Try: 尝试:

df = df_sourcing_events_sites
df = df[df.Site != '']

dropna() only drops "real" NaN . dropna()只会掉落“真正的” NaN But sometimes csv files contains na considered as string by Pandas. 但有时csv文件包含被熊猫视为字符串的na。 In your case I think those are empty strings "" . 在你的情况下,我认为那些是空字符串""

In any case, the read_csv method has an na_values parameter which you can fill with your desired string values. 在任何情况下, read_csv方法都有一个na_values参数,您可以使用所需的字符串值填充该参数。 You can try na_values="" but I cannot predict the output of that. 您可以尝试na_values=""但我无法预测其输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM