I am reading data from a lot of csv files as pandas dataframes. But the format of csv files is not consistent. An example:
Unnamed:1 Unnamed:2 .... Unnamed:20
Data NaN .... NaN
Nan Temp .... NaN
id name .... year
.
.
Now I want to find the first row which contains id
or ID
or Id
, make that row as column names and drop any rows above it. So finally I will get:
id name .... year
.
.
Now id
column may not always be the first column, ie, Unnamed:1
column, so I am checking entire rows like so:
df.isin(["id"]).any(axis=1)
The issue with the above code is that I am not sure how to check for all different ways id
may be written, ie, ID/Id/id
. Ideally, I would like to use regex here, but I know it can be done without regex for a particular column like so:
df['Unnamed:1'].str.lower().str.contains('id')
I am just not getting how to do both at the same time, ie, check for all ways id
may be written in all the columns.
You can use for match first ID/id/Id
substring in all columns by filter output rows before and then convert first row to columns:
mask = (df.select_dtypes(object)
.apply(lambda x: x.str.contains('id', case=False))
.any(axis=1)
.cumsum()
.gt(0))
df = df[mask].copy()
df.columns = df.iloc[0].rename(None)
df = df.iloc[1:].reset_index(drop=True)
Another idea for test not subtrings:
mask = df.isin(['id','ID','Id']).any(axis=1).cumsum().gt(0)
df = df[mask].copy()
df.columns = df.iloc[0].rename()
df = df.iloc[1:].reset_index(drop=True)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.