简体   繁体   中英

Pandas find rows with value in any column

I am reading data from a lot of csv files as pandas dataframes. But the format of csv files is not consistent. An example:

Unnamed:1 Unnamed:2 .... Unnamed:20
Data      NaN       .... NaN
Nan       Temp       .... NaN
id        name      .... year
.
.

Now I want to find the first row which contains id or ID or Id , make that row as column names and drop any rows above it. So finally I will get:

id        name      .... year
.
.

Now id column may not always be the first column, ie, Unnamed:1 column, so I am checking entire rows like so:

df.isin(["id"]).any(axis=1)

The issue with the above code is that I am not sure how to check for all different ways id may be written, ie, ID/Id/id . Ideally, I would like to use regex here, but I know it can be done without regex for a particular column like so:

df['Unnamed:1'].str.lower().str.contains('id')

I am just not getting how to do both at the same time, ie, check for all ways id may be written in all the columns.

You can use for match first ID/id/Id substring in all columns by filter output rows before and then convert first row to columns:

mask = (df.select_dtypes(object)
          .apply(lambda x: x.str.contains('id', case=False))
          .any(axis=1)
          .cumsum()
          .gt(0))

df = df[mask].copy()
df.columns = df.iloc[0].rename(None)
df = df.iloc[1:].reset_index(drop=True)

Another idea for test not subtrings:

mask = df.isin(['id','ID','Id']).any(axis=1).cumsum().gt(0)

df = df[mask].copy()
df.columns = df.iloc[0].rename()
df = df.iloc[1:].reset_index(drop=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM