简体   繁体   中英

Select data from two dataframes into two dataframes with Pandas DataFrame and Python

I have two dataframes, like below

rows_df

  Id Location Age
0  0           30
1  1       US  20
2  2       

requiredCols_df

  RequiredColumn
0       Location

requiredCols_df specifies which column is required in rows_df . In this example, Location is required, Age is optional in rows_df .

I want to filter the rows_df based on the requiredCols_df . So there will be two resulting dataframes. One contains rows that have the required columns, and the other dataframe contains rows that don't have any required columns.

Expected Result

Rows matched

  Id Location Age
1  1       US  20

Rows don't match

  Id Location Age
0  0           30
2  2

Please note that:

1.The rows_df contains more than two columns, eg 10-30 columns.

2 The requiredCols_df contains more than one row.

3 Please note that Location contains a ' ' (empty space(s)) in row 0 , and a null (empty) in row 2 .

rows_df = pd.DataFrame({'Id':['0','1','2'],
                    'Location': [' ', 'US', None], 
                        'Age':['30','20','']})

column names below specify which column in rows_df must be not empty

requiredCols_df = pd.DataFrame([['Location']],
                            columns= ['RequiredColumn'])

4 Update: Both the resulting DataFrame contain the original non-changed values.

I can do this with a loop, but I want to see if there is a better solution.

If you correct those to be true nans...

req_cols = requiredCols_df.RequiredColumn
rows_df[req_cols] = rows_df[req_cols].replace([r'^\s*$', '', None], np.nan, regex=True)

  Id Location Age
0  0      NaN  30
1  1       US  20
2  2      NaN

Then this is simple:

matched = rows_df.dropna(subset=req_cols)
not_matched = rows_df[~rows_df.eq(matched).all(axis=1)]
print(matched)
print(not_matched)

# Output:
  Id Location Age
1  1       US  20

  Id Location Age
0  0      NaN  30
2  2      NaN

Assuming you don't have a tremendous count of columns are there, this seems like a task for a loop

df_result = rows_df
for column in requiredCols_df["RequiredColumn"]:
    df_result = df_result[df_result[column].notnull()]

With this setup:

import numpy as np
import pandas as pd


rows_df = pd.DataFrame(
    {
        "Id": [0, 1, 2],
        "Location": [np.NaN, "US", np.NaN],
        "Age": [30, 20, np.NaN]
    }
)

requiredCols_df = pd.DataFrame(
    {
        "RequiredColumn": ["Location"]
    }
)

We can easily form a boolean series to select what we want:

required_column_not_null = rows_df[requiredCols_df["RequiredColumn"]].notnull().all(axis=1)

rows_df[required_column_not_null]
   Id Location   Age
1   1       US  20.0

required_column_null = ~required_column_not_null

rows_df[required_column_null]
   Id Location   Age
0   0      NaN  30.0
2   2      NaN   NaN
li = []

for column_name in requiredCols_df.RequiredColumn:
    li.append(rows_df[(rows_df[column_name] != ' ') & (~rows_df[column_name].isnull())])

pd.concat(li)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM