从多个文件（结构输出）中提取数据并打印到一个文件

Question

Please, I need help with extracting values from 400 files.拜托，我需要帮助从 400 个文件中提取值。 So far I have never been doing something similar and I don't know where to start from.到目前为止，我从未做过类似的事情，我不知道从哪里开始。 Since I am not a programmer, I don' know which software program would be good to use: R, SAS, Python, command prompt, bash, awk.由于我不是程序员，我不知道使用哪种软件程序比较好：R、SAS、Python、命令提示符、bash、awk。 I have some experience with data manipulation/management using SAS and R (mostly “regular” files with rows and columns) while running some applications with command prompt, bash.我在使用 SAS 和 R（主要是带有行和列的“常规”文件）进行数据操作/管理方面有一些经验，同时使用命令提示符 bash 运行一些应用程序。

I run Structure (Software for population genetics) on Cloud Computing.我在云计算上运行 Structure（种群遗传学软件）。
Output were 400 files/runs.输出为 400 个文件/运行。 Their names are : job_01_01-output_f;他们的名字是：job_01_01-output_f； job_01_02-output_f …… job_40_10-output_f job_01_02-output_f …… job_40_10-output_f
These outputs don't have any extensions (like .txt), but I normally open them using Textpad, Notepad++这些输出没有任何扩展名（如 .txt），但我通常使用 Textpad、Notepad++ 打开它们
In each of these 400 files/outputs there is a line: Estimated Ln Prob of Data = -5570597.3在这 400 个文件/输出中的每一个中，都有一行：Estimated Ln Prob of Data = -5570597.3
I would like to extract numeric value -5570597.3 from all these files/outputs and save it into .csv, .txt like column (one under another - the same order like files)我想从所有这些文件/输出中提取数值 -5570597.3 并将其保存到 .csv、.txt 之类的列中（一个在另一个下 - 与文件的顺序相同）
Also, this line is not always in the same line within all the files because it depends on number of “parameters”.此外，这一行并不总是在所有文件的同一行中，因为它取决于“参数”的数量。
So I guess something like “take value that comes after “Estimated Ln Prob of Data =” would be option.所以我想像“获取数据的估计 Ln 概率 =”之后的值将是一个选项。
For example, one file/outout has around 60000 lines.例如，一个文件/输出大约有 60000 行。 The size from these files goes from 800kb to 5mb.这些文件的大小从 800kb 到 5mb。
I will try to upload the file/output for example.例如，我将尝试上传文件/输出。

Best regards此致

[LINK - an example of Structure/file output][1] [链接 - 结构/文件输出示例][1]

https://www.dropbox.com/sh/idvoigkky7ldgb7/AAD5foVSKc5Ty6ijc08ge230a?dl=0 https://www.dropbox.com/sh/idvoigkky7ldgb7/AAD5foVSKc5Ty6ijc08ge230a?dl=0

Answer 1

Using grep with PCREs for positive look-behind and data from the Dropbox link:使用带有 PCRE 的 grep 进行正向后视和来自 Dropbox 链接的数据：

$ grep -Pohm 1 "(?<=^Estimated Ln Prob of Data   = ).*" job_*

Output:输出：

-5570597.3
-2834943326.2

Used switches:使用的开关：

-P, --perl-regexp
          Interpret PATTERNS as Perl-compatible regular expressions (PCREs).

-h, --no-filename
          Suppress the prefixing of file names on output.

-o, --only-matching
          Print only the matched (non-empty) parts of a matching line

-m NUM, --max-count=NUM
          Stop reading a file after NUM matching lines.

Another using awk:另一个使用 awk：

$ for f in job* ; do awk '/^Estimated Ln Prob of Data/{print $NF;exit}' $f ; done

and GNU awk:和 GNU awk：

$ awk '/^Estimated Ln Prob of Data/{print $NF;nextfile}' job_*

Answer 2

A simple implementation in Python.一个简单的 Python 实现。 Let me know if it works for you.请让我知道这对你有没有用。

import glob
import os.path as os
import re
import uuid


def extract_data(source: str,
                 export: str = None,
                 nested: bool = False,
                 delimit: str = ",",
                 extract: str = "Estimated Ln Prob of Data") -> None:
  """
  Extracts values of `Estimated Ln Prob of Data` from source and exports
  it in a text file.
  
  Args:
    source: Directory which has `job_01_01-output_f` files.
    export: Path of the output file.
    nested: Boolean, if you want to use nested files as well.
    extract: Keyword whose respective value needs to be extracted.
  """
  regex = r"^\b{}\b.+$".format(extract)
  nest = "**" if nested else "*"
  values = []

  for file in glob.glob(f"{source}/{nest}", recursive=True):
    raw = os.basename(file)
    if raw.startswith("job_") and raw.endswith("-output_f"):
      with open(file, "r") as _file:
        matches = re.finditer(regex, _file.read(), re.MULTILINE)
        entry = f"{raw}{delimit}{list(matches)[0].group().rsplit('= ')[-1]}\n"
        values.append(entry)

  export = export if export else os.join(source, f"{str(uuid.uuid4())}.txt")
  with open(export, "w") as _file:
    _file.writelines(values)


# Where "/home/SOME_USER/Downloads" is the path where you have these 400 files.
extract_data("/home/SOME_USER/Downloads")

Answer 3

batch for your literal question: batch您的字面问题：

(for /f "tokens=2 delims==" %%a in ('findstr /c:"Estimated Ln Prob of Data" "job_??_??-output_f"') do echo %%a)>result.csv

In case you need the filenames too:如果您也需要文件名：

(for /f "tokens=1,3 delims=:=" %%a in ('findstr /c:"Estimated Ln Prob of Data" "job_??_??-output_f"') do echo %%a,%%b)>result.csv

Answer 4

Firstly, I'm offering this answer to give additional options, I think the best answer is James Brown 's grep solution as learning to be proficient with grep will be a particularly useful skill.首先，我提供这个答案是为了提供其他选项，我认为最好的答案是James Brown的grep解决方案，因为学习精通grep将是一项特别有用的技能。 Stephan 's solution is also handy if you think you might get stuck in a Windows environment, especially a if you're in a minimal one that won't necessarily have PowerShell.如果您认为自己可能会被困在 Windows 环境中， Stephan的解决方案也很方便，尤其是如果您处于不一定具有 PowerShell 的最小环境中。

Here's an option in PowerShell:这是 PowerShell 中的一个选项：

Get-Content "job_01_01-output_f" | ForEach-Object { if ($_ -match "Estimated Ln Prob of Data * = * ([-.\d]+)") { $Matches[1]} }

And another option using sed :另一个选项使用sed ：

sed -ne "s/Estimated Ln Prob of Data *= *\([-.0-9]\+\)/\1/gp" "job_01_01-output_f"

从多个文件（结构输出）中提取数据并打印到一个文件

问题描述

4 个解决方案

解决方案1
2 已采纳 2020-10-06 20:47:59

解决方案2
0 2020-10-06 20:02:58

解决方案3
0 2020-10-06 20:05:20

解决方案4
0 2020-10-06 21:45:23

从多个文件（结构输出）中提取数据并打印到一个文件

问题描述

4 个解决方案

解决方案1 2 已采纳 2020-10-06 20:47:59

解决方案2 0 2020-10-06 20:02:58

解决方案3 0 2020-10-06 20:05:20

解决方案4 0 2020-10-06 21:45:23

解决方案1
2 已采纳 2020-10-06 20:47:59

解决方案2
0 2020-10-06 20:02:58

解决方案3
0 2020-10-06 20:05:20

解决方案4
0 2020-10-06 21:45:23