简体   繁体   English

如何使用 pandas 清理 csv 中间的额外标题信息

[英]How to clean up extra heading info in middle of csv with pandas

I have a csv file that I am trying to convert into a data frame.我有一个 csv 文件,我正在尝试将其转换为数据框。 But the data has some extra heading material that gets repeated.但是数据有一些重复的额外标题材料。 For example:例如:

Results Generated Date Time  
Sampling Info  
Time; Data  
1; 4.0  
2; 5.2  
3; 6.1  

Results Generated Date Time  
Sampling Info   
Time; Data  
6; 3.2   
7; 4.1   
8; 9.7    

If it is a clean csv file without the extra heading material, I am using如果它是没有额外标题材料的干净 csv 文件,我正在使用

df = pd.read_csv(r'Filelocation', sep=';', skiprows=2)  

But I can't figure out how to remove the second set of header info.但我不知道如何删除第二组 header 信息。 I don't want to lose the data below the second header set.我不想丢失第二个 header 集以下的数据。 Is there a way to remove it so the data is clean?有没有办法删除它以便数据干净? The second header set is not always in the same location (basically a data acquisition mistake).第二个 header 集并不总是在同一个位置(基本上是数据采集错误)。
Thank you!谢谢!

Try to split your text file after the first block of data.尝试在第一个数据块之后拆分文本文件。 Then you can make two dataframes out of it and concatenate them.然后你可以用它制作两个数据框并将它们连接起来。

with open("yourfile.txt", 'r') as f:
    content = f.read()

# Make a list of subcontent
splitContent = content.split('Results Generated Date Time\nSampling Info\n')

Using "Results Generated Date Time\nSampling Info\n" as the argument for split, also removes those lines - This only works if the unnecessary header lines are always equal!使用“Results Generated Date Time\nSampling Info\n”作为 split 的参数,也会删除这些行 - 仅当不必要的 header 行始终相等时才有效!

After this you get a list of your data as strings (variable: splitContent) separated by a delimiter (';').在此之后,您将获得由分隔符 (';') 分隔的字符串形式的数据列表(变量:splitContent)。 Use this Answer to create dataframes from strings: https://stackoverflow.com/a/22605281/11005812 .使用此答案从字符串创建数据帧: https://stackoverflow.com/a/22605281/11005812

Another approach could be to save each subcontent as a own file and read it again.另一种方法可能是将每个子内容保存为自己的文件并再次读取。

Concatening dataframes: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html连接数据帧: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

import pandas as pd

filename = 'filename.csv'
lines =open(filename).read().split('\n')   # reading the csv file

list_ = [e for e in lines if e!='' ]  # removing '' characters from lines list

list_ = [e for e in list_ if e[0].isdigit()]  # removing string starting with non-numeric characters 

Time = [float(i.split(';')[0]) for i in list_]   # use int or float depending upon the requirements

Data = [float(i.split(';')[1].strip()) for i in list_]


df = pd.DataFrame({'Time':Time, 'Data':Data})    #making the dataframe 
df

I hope this will do the work !我希望这能奏效!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM