简体   繁体   中英

Splitting a column in pandas dataframe not dropping na

I am going to preface this question by saying I do not own the way the data starts out in the csv. Nor do I have direct access to the csv since I can only pull it from an SFTP that I do not have direct access to. The API shows the same data format that the csv shows. Here is the two columns of the incoming csv to the dataframe that are pertinent.

+-----+-------------------------------+-------------+
|     |  Sourcing Event ID (DTRM ID)  |     Site    |
+-----+-------------------------------+-------------+
| 0   |                         1035  |     ,ABC55, |
| 1   |                         1067  |          ,, |
| 2   |                         1181  |          ,, |
| 3   |                         1183  |          ,, |
| 4   |                         1184  |          ,, |
| 5   |                         1264  |          ,, |
| 6   |                         1307  |      ,DEF2, |
| 7   |                         1354  |          ,, |
| 8   |                         1369  |    ,HIJ150, |
| 9   |                         1372  |     ,DEF64, |
| 10  |                         1373  |      ,KLM9, |
| 11  |                         1374  |      ,DEF1, |
| 12  |                         1381  |          ,, |
| 13  |                         1385  |          ,, |
| 14  |                         1391  |          ,, |
| 15  |                         1394  |          ,, |
| 16  |                         1395  |          ,, |
| 17  |                         1402  |          ,, |
| 18  |                         1404  |          ,, |
| 19  |                         1405  |          ,, |
| 20  |                         1406  |          ,, |
| 21  |                         1408  |          ,, |
| 22  |                         1410  |    ,HIJ116, |
| 23  |                         1412  |          ,, |
+-----+-------------------------------+-------------+

from that I do the following (from a previous SO answer) :

df_sourcing_events = pd.read_csv(wf['local_filename'])


            sourcing_events_melt_col = 'Sourcing Event ID (DTRM ID)'
            sourcing_events_site_col = 'Site'
            print(df_sourcing_events[[sourcing_events_melt_col,sourcing_events_site_col]])
            df_sourcing_events[sourcing_events_site_col] = df_sourcing_events[sourcing_events_site_col].str.lstrip(',')
            df_sourcing_events[sourcing_events_site_col] = df_sourcing_events[sourcing_events_site_col].str.rstrip(',')

            df_sourcing_events_sites = pd.concat([df_sourcing_events[sourcing_events_melt_col], df_sourcing_events[sourcing_events_site_col].str.split(',', expand = True)], axis = 1)\
                                                    .melt(id_vars=[sourcing_events_melt_col])\
                                                    .sort_values(by = sourcing_events_melt_col)\
                                                    .rename(columns = {'value' : sourcing_events_site_col})\
                                                    .drop(columns = ['variable'])\
                                                    .dropna()

now you are asking yourself why strip the leading and trailing commas?

well because I have another file that has to do with contracts that has the same exact layout and I did the same exact thing to it and that solved the problem with the same exact code. I cannot for the life of me understand why the output from my code is the following:

+-----+-------------------------------+-----------+
|     |  Sourcing Event ID (DTRM ID)  |    Site   |
+-----+-------------------------------+-----------+
| 0   |                         1035  |     ABC55 |
| 1   |                         1067  |           |
| 2   |                         1181  |           |
| 3   |                         1183  |           |
| 4   |                         1184  |           |
| 5   |                         1264  |           |
| 6   |                         1307  |      DEF2 |
| 7   |                         1354  |           |
| 8   |                         1369  |    HIJ150 |
| 9   |                         1372  |     DEF64 |
| 10  |                         1373  |      KLM9 |
| 11  |                         1374  |      DEF1 |
| 12  |                         1381  |           |
| 13  |                         1385  |           |
| 14  |                         1391  |           |
| 15  |                         1394  |           |
| 16  |                         1395  |           |
| 17  |                         1402  |           |
| 18  |                         1404  |           |
| 19  |                         1405  |           |
| 20  |                         1406  |           |
| 21  |                         1408  |           |
| 22  |                         1410  |    HIJ116 |
| 23  |                         1412  |           |
+-----+-------------------------------+-----------+

it's like the dropna() is just not working at all. I even copied and pasted the working code from the other contracts csv into this area and simply changed the variables in the code to match this csv and it still doesn't work. I re-checked to make sure the other code is actually working as well.

I tried .dropna(how='any') to no avail. What else should I do?

EDIT:

Answer to The Zack Man:

No because after that I am doing the following:

df_sourcing_events_final = df_sourcing_events.drop([sourcing_events_site_col], axis=1)

            write_dataframe_to_csv_on_s3(df_sourcing_events_sites, s3_bucket, 'sourcing_events_sites.csv')

            write_dataframe_to_csv_on_s3(df_sourcing_events_final, s3_bucket, file_name)

I am splitting out a column that is a list into individual rows and making a new csv from it to load into a separate table.

It's not dropping because they are empty strings not N/A. Try:

df = df_sourcing_events_sites
df = df[df.Site != '']

dropna() only drops "real" NaN . But sometimes csv files contains na considered as string by Pandas. In your case I think those are empty strings "" .

In any case, the read_csv method has an na_values parameter which you can fill with your desired string values. You can try na_values="" but I cannot predict the output of that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM