[英]Extract data from multiple files (Structure outputs) and printing to one file
Please, I need help with extracting values from 400 files.拜托,我需要帮助从 400 个文件中提取值。 So far I have never been doing something similar and I don't know where to start from.到目前为止,我从未做过类似的事情,我不知道从哪里开始。 Since I am not a programmer, I don' know which software program would be good to use: R, SAS, Python, command prompt, bash, awk.由于我不是程序员,我不知道使用哪种软件程序比较好:R、SAS、Python、命令提示符、bash、awk。 I have some experience with data manipulation/management using SAS and R (mostly “regular” files with rows and columns) while running some applications with command prompt, bash.我在使用 SAS 和 R(主要是带有行和列的“常规”文件)进行数据操作/管理方面有一些经验,同时使用命令提示符 bash 运行一些应用程序。
Best regards此致
[LINK - an example of Structure/file output][1] [链接 - 结构/文件输出示例][1]
https://www.dropbox.com/sh/idvoigkky7ldgb7/AAD5foVSKc5Ty6ijc08ge230a?dl=0 https://www.dropbox.com/sh/idvoigkky7ldgb7/AAD5foVSKc5Ty6ijc08ge230a?dl=0
Using grep with PCREs for positive look-behind and data from the Dropbox link:使用带有 PCRE 的 grep 进行正向后视和来自 Dropbox 链接的数据:
$ grep -Pohm 1 "(?<=^Estimated Ln Prob of Data = ).*" job_*
Output:输出:
-5570597.3
-2834943326.2
Used switches:使用的开关:
-P, --perl-regexp
Interpret PATTERNS as Perl-compatible regular expressions (PCREs).
-h, --no-filename
Suppress the prefixing of file names on output.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line
-m NUM, --max-count=NUM
Stop reading a file after NUM matching lines.
Another using awk:另一个使用 awk:
$ for f in job* ; do awk '/^Estimated Ln Prob of Data/{print $NF;exit}' $f ; done
and GNU awk:和 GNU awk:
$ awk '/^Estimated Ln Prob of Data/{print $NF;nextfile}' job_*
A simple implementation in Python.一个简单的 Python 实现。 Let me know if it works for you.请让我知道这对你有没有用。
import glob
import os.path as os
import re
import uuid
def extract_data(source: str,
export: str = None,
nested: bool = False,
delimit: str = ",",
extract: str = "Estimated Ln Prob of Data") -> None:
"""
Extracts values of `Estimated Ln Prob of Data` from source and exports
it in a text file.
Args:
source: Directory which has `job_01_01-output_f` files.
export: Path of the output file.
nested: Boolean, if you want to use nested files as well.
extract: Keyword whose respective value needs to be extracted.
"""
regex = r"^\b{}\b.+$".format(extract)
nest = "**" if nested else "*"
values = []
for file in glob.glob(f"{source}/{nest}", recursive=True):
raw = os.basename(file)
if raw.startswith("job_") and raw.endswith("-output_f"):
with open(file, "r") as _file:
matches = re.finditer(regex, _file.read(), re.MULTILINE)
entry = f"{raw}{delimit}{list(matches)[0].group().rsplit('= ')[-1]}\n"
values.append(entry)
export = export if export else os.join(source, f"{str(uuid.uuid4())}.txt")
with open(export, "w") as _file:
_file.writelines(values)
# Where "/home/SOME_USER/Downloads" is the path where you have these 400 files.
extract_data("/home/SOME_USER/Downloads")
batch
for your literal question: batch
您的字面问题:
(for /f "tokens=2 delims==" %%a in ('findstr /c:"Estimated Ln Prob of Data" "job_??_??-output_f"') do echo %%a)>result.csv
In case you need the filenames too:如果您也需要文件名:
(for /f "tokens=1,3 delims=:=" %%a in ('findstr /c:"Estimated Ln Prob of Data" "job_??_??-output_f"') do echo %%a,%%b)>result.csv
Firstly, I'm offering this answer to give additional options, I think the best answer is James Brown 's grep
solution as learning to be proficient with grep
will be a particularly useful skill.首先,我提供这个答案是为了提供其他选项,我认为最好的答案是James Brown的grep
解决方案,因为学习精通grep
将是一项特别有用的技能。 Stephan 's solution is also handy if you think you might get stuck in a Windows environment, especially a if you're in a minimal one that won't necessarily have PowerShell.如果您认为自己可能会被困在 Windows 环境中, Stephan的解决方案也很方便,尤其是如果您处于不一定具有 PowerShell 的最小环境中。
Here's an option in PowerShell:这是 PowerShell 中的一个选项:
Get-Content "job_01_01-output_f" | ForEach-Object { if ($_ -match "Estimated Ln Prob of Data * = * ([-.\d]+)") { $Matches[1]} }
And another option using sed
:另一个选项使用sed
:
sed -ne "s/Estimated Ln Prob of Data *= *\([-.0-9]\+\)/\1/gp" "job_01_01-output_f"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.