Pandas read_csv() 来自文本文件，其中数据开始/结束由特定字符串标记

Question

I'm reading hundreds of model outputs from text file, where the first nrows are non-relevant text rows about the model run (note: nrows varies from file to file).我正在从文本文件中读取数百个 model 输出，其中第一个 nrows 是关于 model 运行的不相关文本行（注意：nrows 因文件而异）。 However, I have comma separated data that I want to import from the text file.但是，我有要从文本文件导入的逗号分隔数据。 This data can be found following the line "BREAK THROUGH @ WT, ITERATION" and "END BREAK THROUGH @ WT" in all of the files (see below).这些数据可以在所有文件中的“BREAK THROUGH @ WT, ITERATION”和“END BREAK THROUGH @ WT”行之后找到（见下文）。 My current approach of using nrows and skiprows in read_csv() doesnt work because these parameters vary from file to file.我目前在 read_csv() 中使用 nrows 和 skiprows 的方法不起作用，因为这些参数因文件而异。 Any thoughts on how you can import csv data from text files using string "markers".关于如何使用字符串“标记”从文本文件导入 csv 数据的任何想法。 Thanks!谢谢！
Model output/Input files I want to read look like Model 我想阅读的输出/输入文件看起来像
text文本
text 0.314347435514229文本 0.314347435514229
text text text text text text text文字文字文字文字文字文字文字文字
BREAK THROUGH @ WT, ITERATION突破@WT，迭代
1 0.0 1 0.0
3 0.0 3 0.0
6 0.0 6 0.0
END BREAK THROUGH @ WT结束突破 @ WT
The extracted data in the dataframe would look like dataframe 中提取的数据看起来像
1 0.0 1 0.0
3 0.0 3 0.0
6 0.0 6 0.0

Answer 1

Using fake data with a column named "your_column":使用名为“your_column”的列的假数据：

words = ["BREAK THROUGH @ WT, ITERATION", "END BREAK THROUGH @ WT"]

df = pd.read_csv(...)
df = df.loc[df["your_column"].isin(words).cumsum() & ~df["your_column"].isin(words)].reset_index(drop=True)
print(df)

Answer 2

Seems like I was able to find a solution without regex but still curious how regex could have simplified my life.似乎我能够找到没有正则表达式的解决方案，但仍然好奇正则表达式如何简化我的生活。

beg_id = "BREAK THROUGH @ WT, ITERATION =     1\n"
end_id = "END BREAK THROUGH @ WT"
# for f in cmtp_fnames:
f = 'data/cmtp/PFOS_Dry_LS_1m_AD+R.OUT'
with open(f) as fname:
    data = fname.read()
    data = data[data.find(beg_id):]
    data = data[data.find(beg_id)+len(beg_id):data.find(end_id)]
    data=data.splitlines(False)
    data=pd.DataFrame(sub.split(",") for sub in data).drop(labels=2,axis=1)

Pandas read_csv() 来自文本文件，其中数据开始/结束由特定字符串标记

问题描述

2 个解决方案

解决方案1
0 2023-01-31 07:02:19

解决方案2
0 2023-01-31 16:19:06

Pandas read_csv() 来自文本文件，其中数据开始/结束由特定字符串标记

问题描述

2 个解决方案

解决方案1 0 2023-01-31 07:02:19

解决方案2 0 2023-01-31 16:19:06

解决方案1
0 2023-01-31 07:02:19

解决方案2
0 2023-01-31 16:19:06