简体   繁体   English

Pandas read_csv() 来自文本文件,其中数据开始/结束由特定字符串标记

[英]Pandas read_csv() from text file where data begin/end are marked by specific strings

I'm reading hundreds of model outputs from text file, where the first nrows are non-relevant text rows about the model run (note: nrows varies from file to file).我正在从文本文件中读取数百个 model 输出,其中第一个 nrows 是关于 model 运行的不相关文本行(注意:nrows 因文件而异)。 However, I have comma separated data that I want to import from the text file.但是,我有要从文本文件导入的逗号分隔数据。 This data can be found following the line "BREAK THROUGH @ WT, ITERATION" and "END BREAK THROUGH @ WT" in all of the files (see below).这些数据可以在所有文件中的“BREAK THROUGH @ WT, ITERATION”和“END BREAK THROUGH @ WT”行之后找到(见下文)。 My current approach of using nrows and skiprows in read_csv() doesnt work because these parameters vary from file to file.我目前在 read_csv() 中使用 nrows 和 skiprows 的方法不起作用,因为这些参数因文件而异。 Any thoughts on how you can import csv data from text files using string "markers".关于如何使用字符串“标记”从文本文件导入 csv 数据的任何想法。 Thanks!谢谢!
Model output/Input files I want to read look like Model 我想阅读的输出/输入文件看起来像
text文本
text 0.314347435514229文本 0.314347435514229
text text text text text text text文字文字文字文字文字文字文字文字
BREAK THROUGH @ WT, ITERATION突破@WT,迭代
1 0.0 1 0.0
3 0.0 3 0.0
6 0.0 6 0.0
END BREAK THROUGH @ WT结束突破 @ WT
The extracted data in the dataframe would look like dataframe 中提取的数据看起来像
1 0.0 1 0.0
3 0.0 3 0.0
6 0.0 6 0.0

Using fake data with a column named "your_column":使用名为“your_column”的列的假数据:

words = ["BREAK THROUGH @ WT, ITERATION", "END BREAK THROUGH @ WT"]

df = pd.read_csv(...)
df = df.loc[df["your_column"].isin(words).cumsum() & ~df["your_column"].isin(words)].reset_index(drop=True)
print(df)

Seems like I was able to find a solution without regex but still curious how regex could have simplified my life.似乎我能够找到没有正则表达式的解决方案,但仍然好奇正则表达式如何简化我的生活。

beg_id = "BREAK THROUGH @ WT, ITERATION =     1\n"
end_id = "END BREAK THROUGH @ WT"
# for f in cmtp_fnames:
f = 'data/cmtp/PFOS_Dry_LS_1m_AD+R.OUT'
with open(f) as fname:
    data = fname.read()
    data = data[data.find(beg_id):]
    data = data[data.find(beg_id)+len(beg_id):data.find(end_id)]
    data=data.splitlines(False)
    data=pd.DataFrame(sub.split(",") for sub in data).drop(labels=2,axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM