简体   繁体   English

从多个文件(结构输出)中提取数据并打印到一个文件

[英]Extract data from multiple files (Structure outputs) and printing to one file

Please, I need help with extracting values from 400 files.拜托,我需要帮助从 400 个文件中提取值。 So far I have never been doing something similar and I don't know where to start from.到目前为止,我从未做过类似的事情,我不知道从哪里开始。 Since I am not a programmer, I don' know which software program would be good to use: R, SAS, Python, command prompt, bash, awk.由于我不是程序员,我不知道使用哪种软件程序比较好:R、SAS、Python、命令提示符、bash、awk。 I have some experience with data manipulation/management using SAS and R (mostly “regular” files with rows and columns) while running some applications with command prompt, bash.我在使用 SAS 和 R(主要是带有行和列的“常规”文件)进行数据操作/管理方面有一些经验,同时使用命令提示符 bash 运行一些应用程序。

  1. I run Structure (Software for population genetics) on Cloud Computing.我在云计算上运行 Structure(种群遗传学软件)。
  2. Output were 400 files/runs.输出为 400 个文件/运行。 Their names are : job_01_01-output_f;他们的名字是:job_01_01-output_f; job_01_02-output_f …… job_40_10-output_f job_01_02-output_f …… job_40_10-output_f
  3. These outputs don't have any extensions (like .txt), but I normally open them using Textpad, Notepad++这些输出没有任何扩展名(如 .txt),但我通常使用 Textpad、Notepad++ 打开它们
  4. In each of these 400 files/outputs there is a line: Estimated Ln Prob of Data = -5570597.3在这 400 个文件/输出中的每一个中,都有一行:Estimated Ln Prob of Data = -5570597.3
  5. I would like to extract numeric value -5570597.3 from all these files/outputs and save it into .csv, .txt like column (one under another - the same order like files)我想从所有这些文件/输出中提取数值 -5570597.3 并将其保存到 .csv、.txt 之类的列中(一个在另一个下 - 与文件的顺序相同)
  6. Also, this line is not always in the same line within all the files because it depends on number of “parameters”.此外,这一行并不总是在所有文件的同一行中,因为它取决于“参数”的数量。
  7. So I guess something like “take value that comes after “Estimated Ln Prob of Data =” would be option.所以我想像“获取数据的估计 Ln 概率 =”之后的值将是一个选项。
  8. For example, one file/outout has around 60000 lines.例如,一个文件/输出大约有 60000 行。 The size from these files goes from 800kb to 5mb.这些文件的大小从 800kb 到 5mb。
  9. I will try to upload the file/output for example.例如,我将尝试上传文件/输出。

Best regards此致

[LINK - an example of Structure/file output][1] [链接 - 结构/文件输出示例][1]

https://www.dropbox.com/sh/idvoigkky7ldgb7/AAD5foVSKc5Ty6ijc08ge230a?dl=0 https://www.dropbox.com/sh/idvoigkky7ldgb7/AAD5foVSKc5Ty6ijc08ge230a?dl=0

Using grep with PCREs for positive look-behind and data from the Dropbox link:使用带有 PCRE 的 grep 进行正向后视和来自 Dropbox 链接的数据:

$ grep -Pohm 1 "(?<=^Estimated Ln Prob of Data   = ).*" job_*

Output:输出:

-5570597.3
-2834943326.2

Used switches:使用的开关:

-P, --perl-regexp
          Interpret PATTERNS as Perl-compatible regular expressions (PCREs).

-h, --no-filename
          Suppress the prefixing of file names on output.

-o, --only-matching
          Print only the matched (non-empty) parts of a matching line

-m NUM, --max-count=NUM
          Stop reading a file after NUM matching lines.

Another using awk:另一个使用 awk:

$ for f in job* ; do awk '/^Estimated Ln Prob of Data/{print $NF;exit}' $f ; done

and GNU awk:和 GNU awk:

$ awk '/^Estimated Ln Prob of Data/{print $NF;nextfile}' job_*

A simple implementation in Python.一个简单的 Python 实现。 Let me know if it works for you.请让我知道这对你有没有用。

import glob
import os.path as os
import re
import uuid


def extract_data(source: str,
                 export: str = None,
                 nested: bool = False,
                 delimit: str = ",",
                 extract: str = "Estimated Ln Prob of Data") -> None:
  """
  Extracts values of `Estimated Ln Prob of Data` from source and exports
  it in a text file.
  
  Args:
    source: Directory which has `job_01_01-output_f` files.
    export: Path of the output file.
    nested: Boolean, if you want to use nested files as well.
    extract: Keyword whose respective value needs to be extracted.
  """
  regex = r"^\b{}\b.+$".format(extract)
  nest = "**" if nested else "*"
  values = []

  for file in glob.glob(f"{source}/{nest}", recursive=True):
    raw = os.basename(file)
    if raw.startswith("job_") and raw.endswith("-output_f"):
      with open(file, "r") as _file:
        matches = re.finditer(regex, _file.read(), re.MULTILINE)
        entry = f"{raw}{delimit}{list(matches)[0].group().rsplit('= ')[-1]}\n"
        values.append(entry)

  export = export if export else os.join(source, f"{str(uuid.uuid4())}.txt")
  with open(export, "w") as _file:
    _file.writelines(values)


# Where "/home/SOME_USER/Downloads" is the path where you have these 400 files.
extract_data("/home/SOME_USER/Downloads")

batch for your literal question: batch您的字面问题:

(for /f "tokens=2 delims==" %%a in ('findstr /c:"Estimated Ln Prob of Data" "job_??_??-output_f"') do echo %%a)>result.csv

In case you need the filenames too:如果您也需要文件名:

(for /f "tokens=1,3 delims=:=" %%a in ('findstr /c:"Estimated Ln Prob of Data" "job_??_??-output_f"') do echo %%a,%%b)>result.csv

Firstly, I'm offering this answer to give additional options, I think the best answer is James Brown 's grep solution as learning to be proficient with grep will be a particularly useful skill.首先,我提供这个答案是为了提供其他选项,我认为最好的答案是James Browngrep解决方案,因为学习精通grep将是一项特别有用的技能。 Stephan 's solution is also handy if you think you might get stuck in a Windows environment, especially a if you're in a minimal one that won't necessarily have PowerShell.如果您认为自己可能会被困在 Windows 环境中, Stephan的解决方案也很方便,尤其是如果您处于不一定具有 PowerShell 的最小环境中。

Here's an option in PowerShell:这是 PowerShell 中的一个选项:

Get-Content "job_01_01-output_f" | ForEach-Object { if ($_ -match "Estimated Ln Prob of Data * = * ([-.\d]+)") { $Matches[1]} }

And another option using sed :另一个选项使用sed

sed -ne "s/Estimated Ln Prob of Data *= *\([-.0-9]\+\)/\1/gp" "job_01_01-output_f"

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从多个 SQL 表中提取数据以用于多个输出 - How to extract data from multiple SQL tables in one for multiple outputs 如何使用 Python 从多个文本文件中提取数据到 Excel? (每张纸一个文件的数据) - How do I extract data from multiple text files to Excel using Python? (One file's data per sheet) Python:一个文件,各种数据:如何从一个.txt文件中提取多个测量值? (熊猫) - Python: One file, various data: How to extract multiple measurements from one .txt file? (pandas) 如何从 2 个文件中提取数据并将其放在不同的文件中(一个文件中的一行和其他文件中的另一行等等)? - How to extract data from 2 files and put it in different file (one line from one file and another line from other file and on..)? 将多个数据文件中的某些列读入python中的一个文件 - Reading certain columns from multiple data files into one file in python 文件名(来自多个文件)作为一个数据框中的列名 - File names (from multiple files) as a column names in one data frame 循环 function 打印出多个相同的输出而不是一个 - Loop function printing out multiple same outputs instead of just one 数据文件输出的总和 - summation of outputs from a data file 我需要使用Python从多个.txt文件中提取数据并将其移至Excel文件 - I need to extract data from multiple .txt files and move them to an Excel file, using Python 如何使用 python 从另一个文件中的多个文件中提取数据? - How to extract data from multiple files which is inside another file using python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM