简体   繁体   English

从 dataframe 中删除行,直到找到实际的列名

[英]Removing the rows from dataframe till the actual column names are found

I am reading tabular data from the email in the pandas dataframe.我正在从 pandas dataframe 中的 email 读取表格数据。 There is no guarantee that column names will contain in the first row.Sometimes data is in the following format.The actual column names are [ID,Name and Year]无法保证第一行中会包含列名。有时数据采用以下格式。实际的列名是 [ID,Name and Year]

dummy1           dummy2     dummy3
test_column1 test_column2 test_column3
ID     Name        Year
1      John        Sophomore
2      Lisa        Junior
3      Ed          Senior

Sometimes the column names come in the first row as expected.有时列名按预期出现在第一行。

ID     Name        Year
1      John        Sophomore
2      Lisa        Junior
3      Ed          Senior

Once I read the HTML table from the email,how I remove the initial rows that don't contain the column names?So in the first case I would need to remove first 2 rows in the dataframe(including column row) and in the second case,i wouldn't have to remove anything.从 email 中读取 HTML 表后,如何删除不包含列名的初始行?所以在第一种情况下,我需要删除数据帧中的前 2 行(包括列行)和第二行情况下,我不必删除任何东西。

Also,the column names can be in any sequence.此外,列名可以按任何顺序排列。 basically,I want to do in following基本上,我想做以下

1.check whether once of the column names contains in one of the rows in dataframe
2.Remove the rows above
if "ID" in row:
    remove the above rows

How can I achieve this?我怎样才能做到这一点?

Ugly but effective quick try:丑陋但有效的快速尝试:

id_name = df.columns[0]
df_clean = df[(df[id_name] == 'ID') | (df[id_name].dtype == 'int64')]

You can first get index of valid columns and then filter and set accordingly.您可以先获取有效列的index ,然后进行相应的过滤和设置。

df = pd.read_csv("d.csv",sep='\s+', header=None)
col_index = df.index[(df == ["ID","Name","Year"]).all(1)].item()    # get columns index

df.columns = df.iloc[col_index].to_numpy()   # set valid columns
df = df.iloc[col_index + 1 :]                # filter data
df
  ID  Name       Year
3  1  John  Sophomore
4  2  Lisa     Junior
5  3    Ed     Senior

or或者

If you want to se ID as index如果您想将ID作为索引

df = df.iloc[col_index + 1 :].set_index('ID')
df
    Name       Year
ID
1   John  Sophomore
2   Lisa     Junior
3     Ed     Senior

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM