从 dataframe 中删除行，直到找到实际的列名

Question

I am reading tabular data from the email in the pandas dataframe.我正在从 pandas dataframe 中的 email 读取表格数据。 There is no guarantee that column names will contain in the first row.Sometimes data is in the following format.无法保证列名将包含在第一行中。有时数据采用以下格式。 The column names that will always be there are [ID,Name and Year].Sometimes there can be additional columns such as "Age"将始终存在的列名称是 [ID,Name 和 Year]。有时可能会有其他列，例如“Age”

dummy1           dummy2     dummy3      dummy4
test_column1 test_column2 test_column3  test_column4
ID     Name        Year                  Age
1      John        Sophomore             20
2      Lisa        Junior                21
3      Ed          Senior                22

Sometimes the column names come in the first row as expected.有时列名按预期出现在第一行。

ID     Name        Year
1      John        Sophomore
2      Lisa        Junior
3      Ed          Senior

Once I read the HTML table from the email,how can I remove the initial rows that don't contain the column names?["ID","Name","Year"] So in the first case I would need to remove first 2 rows in the dataframe(including column row) and in the second case,i wouldn't have to remove anything.从 email 中读取 HTML 表后，如何删除不包含列名的初始行？["ID","Name","Year"] 所以在第一种情况下，我需要先删除数据框中的 2 行（包括列行），在第二种情况下，我不必删除任何内容。

Also,the column names can be in any sequence,and they can be variable.But these 3 columns will always be there ["ID","Name","Year"] if i do the following,it only works if the dataframe contains only 3 columns ["ID","Name","Year"]此外，列名可以按任何顺序排列，并且它们可以是可变的。但是如果我执行以下操作，这 3 列将始终存在 ["ID","Name","Year"]，它仅适用于 dataframe仅包含 3 列 ["ID","Name","Year"]

col_index = df.index[(df == ["ID","Name","Year"]).all(1)].item()    # get columns index

df.columns = df.iloc[col_index].to_numpy()   # set valid columns
df = df.iloc[col_index + 1 :]

I should be able to fetch the corresponding column index as long as the row contains any of these 3 columns ["ID","Name","Year"] How can I achieve this?只要该行包含这 3 列中的任何一个，我就应该能够获取相应的列索引 ["ID","Name","Year"] 我该如何实现呢？ I tried我试过了

col_index = df.index[(["ID","Name","Year"] in df).any(1)].item()

But i am getting error但我收到错误

Answer 1

You could stack the dataframe and use isin to find the header row.您可以堆叠 dataframe 并使用isin找到 header 行。

IIUC, a small function could work. IIUC，一个小的 function 可以工作。 (personally I'd change this to pass in your file I/O read method and return a dataframe starting at that header row. （我个人会更改它以传入您的文件 I/O 读取方法并返回 dataframe 从该 header 行开始。

#make sure your read method has pd.read..(headers=None)
def find_columns(dataframe,cols) -> list:
    stack_df = dataframe.stack()
    header_row = stack_df[stack_df.isin(cols)].index.get_level_values(0)[0]
    return header_row

header_row = find_columns(df,["Age", "Year", "ID", "Name"])

new_df = pd.read_csv(file,skiprows=header_row)

   ID  Name       Year  Age
0   1  John  Sophomore   20
1   2  Lisa     Junior   21
2   3    Ed     Senior   22

从 dataframe 中删除行，直到找到实际的列名

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-29 12:56:15

从 dataframe 中删除行，直到找到实际的列名

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-29 12:56:15

解决方案1
1 已采纳 2020-06-29 12:56:15