从多个文本文件读取内容

Question

Looking for help in doing this: 在执行此操作时寻求帮助：

I have a directory full of text files that are named with a numerical ID. 我有一个目录，里面充满了用数字ID命名的文本文件。 Each text file contains the body of a news article. 每个文本文件都包含新闻文章的正文。 Some news articles are segregated in different parts, so they are in different text files. 一些新闻文章分为不同的部分，因此它们位于不同的文本文件中。

The names are such 名字是这样的

1001_1.txt, 1001_2.txt   (These files contain two different part of the same article)
1002_1.txt, 
1003_1.txt, 
1004_1.txt, 1004_2.txt, 1004_3.txt, 1004_4.txt (these files contain four different parts of the same article, the parts will go up to a maximum of 4 only).

and so forth and so on. 以此类推。

Basically, I need a script (PHP, Perl, RUBY or otherwise) that would simply put the name of the text file (before the underscore) in a column, and the content of the text file in another column, and if there is any number after the underscore, to put that in one column as well. 基本上，我需要一个脚本（PHP，Perl，RUBY或其他方式），该脚本只需将文本文件的名称（在下划线之前）放在一列中，并将文本文件的内容放在另一列中（如果有）下划线后的数字，也将其放在一栏中。

So you would have a table structure looking like this: 因此，您将具有如下所示的表结构：

    1001 | 1 | content of the text file
    1001 | 2 | content of the text file
    1002 | 1 | content of the text file
    1003 | 1 | content of the text file

Any help on how I can accomplish this would be appreciated. 我如何能做到这一点的任何帮助将不胜感激。

There are about 7000 text files that need to be read and imported in a table for future usage in a database. 一个表中需要读取和导入大约7000个文本文件，以备将来在数据库中使用。

It would be even better if the _1 and _2 files content could be segregated in different colums, eg: 如果将_1和_2文件的内容分隔在不同的列中会更好，例如：

    1001 | 1 | content | 2 | content | 3 | content | 4 | content
    1002 | 1 | content
    1003 | 1 | content

(Like I said, the file names go maximum up to _4 so you could have 1001_1 , 1001_2 , 1001_3 , 1001_4.txt or only 1002_1 and 1003_1.txt ) （就像我说的，文件名去最大可达_4 ，所以你可以有1001_1 ， 1001_2 ， 1001_3 ， 1001_4.txt或仅1002_1和1003_1.txt ）

Answer 1

This is fairly straightforward with File::Find and File::Slurp : 使用File :: Find和File :: Slurp相当简单：

#!/usr/bin/perl

use strict;
use warnings;

use File::Find;
use File::Slurp;

die "Need somewhere to start\n" unless @ARGV;

my %files;
find(\&wanted, @ARGV);

for my $name (sort keys %files) {
    my $file = $files{$name};
    print join( ' | ', $name,
        map { exists $file->{$_} ? ($_, $file->{$_}) : () } 1 .. 4
    ), "\n";
}

sub wanted {
    my $file = $File::Find::name;
    return unless -f $file;
    return unless $file =~ /([0-9]{4})_([1-4])\.txt$/;
    # I do not know what you want to do with newlines
    $files{$1}->{$2} = join('\n', map { chomp; $_ } read_file $file);
    return;
}

Output: 输出：

1001 | 1 | lsdkjv\nsdfljk\nsdklfjlksjadf\nlsdjflkjdsf | 3 | sadlfkjldskfj
1002 | 1 | ldskfjsdlfjkl

Answer 2

use strict;
use warnings;
my %content;

while (<>){
    s/\s+/ /g;
    my ($f, $n) = $ARGV =~ /(\d+)_(\d)\.txt$/;
    $content{$f}{$n} .= $_;
}

for my $f (sort keys %content){
    print join('|',
        $f,
        map { $_ => $content{$f}{$_} } sort keys %{$content{$f}},
    ), "\n";
}

Answer 3

Probably not optimal, but could be your starting point (over commented on purpose): 可能不是最佳选择，但可能是您的出发点（故意评论过）：

#!/usr/bin/perl

use strict;
use warnings;

# results hash
my %res = ();

# foreach .txt files
for (glob '*.txt') {
    s/\.txt$//; # replace suffix .txt by nothing
    my $t = ''; # buffer for the file contents
    my($f, $n) = split '_'; # cut the file name ex. 1001_1 => 1001 and 1

    # read the file contents
    {
        local $/; # slurp mode
        open(my $F, $_ . '.txt') || die $!; # open the txt file
        $t = <$F>; # get contents
        close($F); # close the text file
    }

    # transform \r, \n and \t into one space
    $t =~ s/[\r\n\t]/ /g;
    # appends for example 1001 | 2 | contents of 1001_2.txt to the results hash
    $res{$f} .= "$f | $n | $t | ";
}

# print the results
for (sort { $a <=> $b } keys %res) {
    # remove the trailing ' | '
    $res{$_} =~ s/\s\|\s$//;
    # print
    print $res{$_} . "\n";
}

# happy ending
exit 0;

从多个文本文件读取内容

问题描述

3 个解决方案

解决方案1
2 已采纳 2009-11-05 14:27:28

解决方案2
1 2009-11-05 14:36:24

解决方案3
0

从多个文本文件读取内容

问题描述

3 个解决方案

解决方案1 2 已采纳 2009-11-05 14:27:28

解决方案2 1 2009-11-05 14:36:24

解决方案3 0

解决方案1
2 已采纳 2009-11-05 14:27:28

解决方案2
1 2009-11-05 14:36:24

解决方案3
0