[英]Pandas read_csv() from text file where data begin/end are marked by specific strings
I'm reading hundreds of model outputs from text file, where the first nrows are non-relevant text rows about the model run (note: nrows varies from file to file).我正在从文本文件中读取数百个 model 输出,其中第一个 nrows 是关于 model 运行的不相关文本行(注意:nrows 因文件而异)。 However, I have comma separated data that I want to import from the text file.
但是,我有要从文本文件导入的逗号分隔数据。 This data can be found following the line "BREAK THROUGH @ WT, ITERATION" and "END BREAK THROUGH @ WT" in all of the files (see below).
这些数据可以在所有文件中的“BREAK THROUGH @ WT, ITERATION”和“END BREAK THROUGH @ WT”行之后找到(见下文)。 My current approach of using nrows and skiprows in read_csv() doesnt work because these parameters vary from file to file.
我目前在 read_csv() 中使用 nrows 和 skiprows 的方法不起作用,因为这些参数因文件而异。 Any thoughts on how you can import csv data from text files using string "markers".
关于如何使用字符串“标记”从文本文件导入 csv 数据的任何想法。 Thanks!
谢谢!
Model output/Input files I want to read look like Model 我想阅读的输出/输入文件看起来像
text文本
text 0.314347435514229文本 0.314347435514229
text text text text text text text文字文字文字文字文字文字文字文字
BREAK THROUGH @ WT, ITERATION突破@WT,迭代
1 0.0 1 0.0
3 0.0 3 0.0
6 0.0 6 0.0
END BREAK THROUGH @ WT结束突破 @ WT
The extracted data in the dataframe would look like dataframe 中提取的数据看起来像
1 0.0 1 0.0
3 0.0 3 0.0
6 0.0 6 0.0
Using fake data with a column named "your_column":使用名为“your_column”的列的假数据:
words = ["BREAK THROUGH @ WT, ITERATION", "END BREAK THROUGH @ WT"]
df = pd.read_csv(...)
df = df.loc[df["your_column"].isin(words).cumsum() & ~df["your_column"].isin(words)].reset_index(drop=True)
print(df)
Seems like I was able to find a solution without regex but still curious how regex could have simplified my life.似乎我能够找到没有正则表达式的解决方案,但仍然好奇正则表达式如何简化我的生活。
beg_id = "BREAK THROUGH @ WT, ITERATION = 1\n"
end_id = "END BREAK THROUGH @ WT"
# for f in cmtp_fnames:
f = 'data/cmtp/PFOS_Dry_LS_1m_AD+R.OUT'
with open(f) as fname:
data = fname.read()
data = data[data.find(beg_id):]
data = data[data.find(beg_id)+len(beg_id):data.find(end_id)]
data=data.splitlines(False)
data=pd.DataFrame(sub.split(",") for sub in data).drop(labels=2,axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.