從多個文件（結構輸出）中提取數據並打印到一個文件

Question

拜托，我需要幫助從 400 個文件中提取值。 到目前為止，我從未做過類似的事情，我不知道從哪里開始。 由於我不是程序員，我不知道使用哪種軟件程序比較好：R、SAS、Python、命令提示符、bash、awk。 我在使用 SAS 和 R（主要是帶有行和列的“常規”文件）進行數據操作/管理方面有一些經驗，同時使用命令提示符 bash 運行一些應用程序。

我在雲計算上運行 Structure（種群遺傳學軟件）。
輸出為 400 個文件/運行。 他們的名字是：job_01_01-output_f； job_01_02-output_f …… job_40_10-output_f
這些輸出沒有任何擴展名（如 .txt），但我通常使用 Textpad、Notepad++ 打開它們
在這 400 個文件/輸出中的每一個中，都有一行：Estimated Ln Prob of Data = -5570597.3
我想從所有這些文件/輸出中提取數值 -5570597.3 並將其保存到 .csv、.txt 之類的列中（一個在另一個下 - 與文件的順序相同）
此外，這一行並不總是在所有文件的同一行中，因為它取決於“參數”的數量。
所以我想像“獲取數據的估計 Ln 概率 =”之后的值將是一個選項。
例如，一個文件/輸出大約有 60000 行。 這些文件的大小從 800kb 到 5mb。
例如，我將嘗試上傳文件/輸出。

此致

[鏈接 - 結構/文件輸出示例][1]

https://www.dropbox.com/sh/idvoigkky7ldgb7/AAD5foVSKc5Ty6ijc08ge230a?dl=0

Answer 1

使用帶有 PCRE 的 grep 進行正向后視和來自 Dropbox 鏈接的數據：

$ grep -Pohm 1 "(?<=^Estimated Ln Prob of Data   = ).*" job_*

輸出：

-5570597.3
-2834943326.2

使用的開關：

-P, --perl-regexp
          Interpret PATTERNS as Perl-compatible regular expressions (PCREs).

-h, --no-filename
          Suppress the prefixing of file names on output.

-o, --only-matching
          Print only the matched (non-empty) parts of a matching line

-m NUM, --max-count=NUM
          Stop reading a file after NUM matching lines.

另一個使用 awk：

$ for f in job* ; do awk '/^Estimated Ln Prob of Data/{print $NF;exit}' $f ; done

和 GNU awk：

$ awk '/^Estimated Ln Prob of Data/{print $NF;nextfile}' job_*

Answer 2

一個簡單的 Python 實現。 請讓我知道這對你有沒有用。

import glob
import os.path as os
import re
import uuid


def extract_data(source: str,
                 export: str = None,
                 nested: bool = False,
                 delimit: str = ",",
                 extract: str = "Estimated Ln Prob of Data") -> None:
  """
  Extracts values of `Estimated Ln Prob of Data` from source and exports
  it in a text file.
  
  Args:
    source: Directory which has `job_01_01-output_f` files.
    export: Path of the output file.
    nested: Boolean, if you want to use nested files as well.
    extract: Keyword whose respective value needs to be extracted.
  """
  regex = r"^\b{}\b.+$".format(extract)
  nest = "**" if nested else "*"
  values = []

  for file in glob.glob(f"{source}/{nest}", recursive=True):
    raw = os.basename(file)
    if raw.startswith("job_") and raw.endswith("-output_f"):
      with open(file, "r") as _file:
        matches = re.finditer(regex, _file.read(), re.MULTILINE)
        entry = f"{raw}{delimit}{list(matches)[0].group().rsplit('= ')[-1]}\n"
        values.append(entry)

  export = export if export else os.join(source, f"{str(uuid.uuid4())}.txt")
  with open(export, "w") as _file:
    _file.writelines(values)


# Where "/home/SOME_USER/Downloads" is the path where you have these 400 files.
extract_data("/home/SOME_USER/Downloads")

Answer 3

batch您的字面問題：

(for /f "tokens=2 delims==" %%a in ('findstr /c:"Estimated Ln Prob of Data" "job_??_??-output_f"') do echo %%a)>result.csv

如果您也需要文件名：

(for /f "tokens=1,3 delims=:=" %%a in ('findstr /c:"Estimated Ln Prob of Data" "job_??_??-output_f"') do echo %%a,%%b)>result.csv

Answer 4

首先，我提供這個答案是為了提供其他選項，我認為最好的答案是James Brown的grep解決方案，因為學習精通grep將是一項特別有用的技能。 如果您認為自己可能會被困在 Windows 環境中， Stephan的解決方案也很方便，尤其是如果您處於不一定具有 PowerShell 的最小環境中。

這是 PowerShell 中的一個選項：

Get-Content "job_01_01-output_f" | ForEach-Object { if ($_ -match "Estimated Ln Prob of Data * = * ([-.\d]+)") { $Matches[1]} }

另一個選項使用sed ：

sed -ne "s/Estimated Ln Prob of Data *= *\([-.0-9]\+\)/\1/gp" "job_01_01-output_f"

從多個文件（結構輸出）中提取數據並打印到一個文件

問題描述

4 個解決方案

解決方案1
2 已采納 2020-10-06 20:47:59

解決方案2
0 2020-10-06 20:02:58

解決方案3
0 2020-10-06 20:05:20

解決方案4
0 2020-10-06 21:45:23

從多個文件（結構輸出）中提取數據並打印到一個文件

問題描述

4 個解決方案

解決方案1 2 已采納 2020-10-06 20:47:59

解決方案2 0 2020-10-06 20:02:58

解決方案3 0 2020-10-06 20:05:20

解決方案4 0 2020-10-06 21:45:23

解決方案1
2 已采納 2020-10-06 20:47:59

解決方案2
0 2020-10-06 20:02:58

解決方案3
0 2020-10-06 20:05:20

解決方案4
0 2020-10-06 21:45:23