Pandas find rows with value in any column

Question

I am reading data from a lot of csv files as pandas dataframes. But the format of csv files is not consistent. An example:

Unnamed:1 Unnamed:2 .... Unnamed:20
Data      NaN       .... NaN
Nan       Temp       .... NaN
id        name      .... year
.
.

Now I want to find the first row which contains id or ID or Id , make that row as column names and drop any rows above it. So finally I will get:

id        name      .... year
.
.

Now id column may not always be the first column, ie, Unnamed:1 column, so I am checking entire rows like so:

df.isin(["id"]).any(axis=1)

The issue with the above code is that I am not sure how to check for all different ways id may be written, ie, ID/Id/id . Ideally, I would like to use regex here, but I know it can be done without regex for a particular column like so:

df['Unnamed:1'].str.lower().str.contains('id')

I am just not getting how to do both at the same time, ie, check for all ways id may be written in all the columns.

Answer 1

You can use for match first ID/id/Id substring in all columns by filter output rows before and then convert first row to columns:

mask = (df.select_dtypes(object)
          .apply(lambda x: x.str.contains('id', case=False))
          .any(axis=1)
          .cumsum()
          .gt(0))

df = df[mask].copy()
df.columns = df.iloc[0].rename(None)
df = df.iloc[1:].reset_index(drop=True)

Another idea for test not subtrings:

mask = df.isin(['id','ID','Id']).any(axis=1).cumsum().gt(0)

df = df[mask].copy()
df.columns = df.iloc[0].rename()
df = df.iloc[1:].reset_index(drop=True)

Pandas find rows with value in any column

Question

1 answers

solution1
1 ACCPTED 2021-01-18 12:24:14

Pandas find rows with value in any column

Question

1 answers

solution1 1 ACCPTED 2021-01-18 12:24:14

solution1
1 ACCPTED 2021-01-18 12:24:14