如何使用 pandas 清理 csv 中间的额外标题信息

Question

I have a csv file that I am trying to convert into a data frame.我有一个 csv 文件，我正在尝试将其转换为数据框。 But the data has some extra heading material that gets repeated.但是数据有一些重复的额外标题材料。 For example:例如：

Results Generated Date Time  
Sampling Info  
Time; Data  
1; 4.0  
2; 5.2  
3; 6.1  

Results Generated Date Time  
Sampling Info   
Time; Data  
6; 3.2   
7; 4.1   
8; 9.7

If it is a clean csv file without the extra heading material, I am using如果它是没有额外标题材料的干净 csv 文件，我正在使用

df = pd.read_csv(r'Filelocation', sep=';', skiprows=2)

But I can't figure out how to remove the second set of header info.但我不知道如何删除第二组 header 信息。 I don't want to lose the data below the second header set.我不想丢失第二个 header 集以下的数据。 Is there a way to remove it so the data is clean?有没有办法删除它以便数据干净？ The second header set is not always in the same location (basically a data acquisition mistake).第二个 header 集并不总是在同一个位置（基本上是数据采集错误）。
Thank you!谢谢！

Answer 1

Try to split your text file after the first block of data.尝试在第一个数据块之后拆分文本文件。 Then you can make two dataframes out of it and concatenate them.然后你可以用它制作两个数据框并将它们连接起来。

with open("yourfile.txt", 'r') as f:
    content = f.read()

# Make a list of subcontent
splitContent = content.split('Results Generated Date Time\nSampling Info\n')

Using "Results Generated Date Time\nSampling Info\n" as the argument for split, also removes those lines - This only works if the unnecessary header lines are always equal!使用“Results Generated Date Time\nSampling Info\n”作为 split 的参数，也会删除这些行 - 仅当不必要的 header 行始终相等时才有效！

After this you get a list of your data as strings (variable: splitContent) separated by a delimiter (';').在此之后，您将获得由分隔符 (';') 分隔的字符串形式的数据列表（变量：splitContent）。 Use this Answer to create dataframes from strings: https://stackoverflow.com/a/22605281/11005812 .使用此答案从字符串创建数据帧： https://stackoverflow.com/a/22605281/11005812 。

Another approach could be to save each subcontent as a own file and read it again.另一种方法可能是将每个子内容保存为自己的文件并再次读取。

Concatening dataframes: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html连接数据帧： https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

Answer 2

import pandas as pd

filename = 'filename.csv'
lines =open(filename).read().split('\n')   # reading the csv file

list_ = [e for e in lines if e!='' ]  # removing '' characters from lines list

list_ = [e for e in list_ if e[0].isdigit()]  # removing string starting with non-numeric characters 

Time = [float(i.split(';')[0]) for i in list_]   # use int or float depending upon the requirements

Data = [float(i.split(';')[1].strip()) for i in list_]


df = pd.DataFrame({'Time':Time, 'Data':Data})    #making the dataframe 
df

I hope this will do the work !我希望这能奏效！

如何使用 pandas 清理 csv 中间的额外标题信息

问题描述

2 个解决方案

解决方案1
0 2020-06-05 18:41:25

解决方案2
0 已采纳 2020-06-05 18:50:09

如何使用 pandas 清理 csv 中间的额外标题信息

问题描述

2 个解决方案

解决方案1 0 2020-06-05 18:41:25

解决方案2 0 已采纳 2020-06-05 18:50:09

解决方案1
0 2020-06-05 18:41:25

解决方案2
0 已采纳 2020-06-05 18:50:09