简体   繁体   English

从多个文本文件读取内容

[英]Reading content from Multiple Text Files

Looking for help in doing this: 在执行此操作时寻求帮助:

I have a directory full of text files that are named with a numerical ID. 我有一个目录,里面充满了用数字ID命名的文本文件。 Each text file contains the body of a news article. 每个文本文件都包含新闻文章的正文。 Some news articles are segregated in different parts, so they are in different text files. 一些新闻文章分为不同的部分,因此它们位于不同的文本文件中。

The names are such 名字是这样的

1001_1.txt, 1001_2.txt   (These files contain two different part of the same article)
1002_1.txt, 
1003_1.txt, 
1004_1.txt, 1004_2.txt, 1004_3.txt, 1004_4.txt (these files contain four different parts of the same article, the parts will go up to a maximum of 4 only).

and so forth and so on. 以此类推。

Basically, I need a script (PHP, Perl, RUBY or otherwise) that would simply put the name of the text file (before the underscore) in a column, and the content of the text file in another column, and if there is any number after the underscore, to put that in one column as well. 基本上,我需要一个脚本(PHP,Perl,RUBY或其他方式),该脚本只需将文本文件的名称(在下划线之前)放在一列中,并将文本文件的内容放在另一列中(如果有)下划线后的数字,也将其放在一栏中。

So you would have a table structure looking like this: 因此,您将具有如下所示的表结构:

    1001 | 1 | content of the text file
    1001 | 2 | content of the text file
    1002 | 1 | content of the text file
    1003 | 1 | content of the text file

Any help on how I can accomplish this would be appreciated. 我如何能做到这一点的任何帮助将不胜感激。

There are about 7000 text files that need to be read and imported in a table for future usage in a database. 一个表中需要读取和导入大约7000个文本文件,以备将来在数据库中使用。

It would be even better if the _1 and _2 files content could be segregated in different colums, eg: 如果将_1和_2文件的内容分隔在不同的列中会更好,例如:

    1001 | 1 | content | 2 | content | 3 | content | 4 | content
    1002 | 1 | content
    1003 | 1 | content

(Like I said, the file names go maximum up to _4 so you could have 1001_1 , 1001_2 , 1001_3 , 1001_4.txt or only 1002_1 and 1003_1.txt ) (就像我说的,文件名去最大可达_4 ,所以你可以有1001_11001_21001_31001_4.txt或仅1002_11003_1.txt

This is fairly straightforward with File::Find and File::Slurp : 使用File :: FindFile :: Slurp相当简单:

#!/usr/bin/perl

use strict;
use warnings;

use File::Find;
use File::Slurp;

die "Need somewhere to start\n" unless @ARGV;

my %files;
find(\&wanted, @ARGV);

for my $name (sort keys %files) {
    my $file = $files{$name};
    print join( ' | ', $name,
        map { exists $file->{$_} ? ($_, $file->{$_}) : () } 1 .. 4
    ), "\n";
}

sub wanted {
    my $file = $File::Find::name;
    return unless -f $file;
    return unless $file =~ /([0-9]{4})_([1-4])\.txt$/;
    # I do not know what you want to do with newlines
    $files{$1}->{$2} = join('\n', map { chomp; $_ } read_file $file);
    return;
}

Output: 输出:

1001 | 1 | lsdkjv\nsdfljk\nsdklfjlksjadf\nlsdjflkjdsf | 3 | sadlfkjldskfj
1002 | 1 | ldskfjsdlfjkl
use strict;
use warnings;
my %content;

while (<>){
    s/\s+/ /g;
    my ($f, $n) = $ARGV =~ /(\d+)_(\d)\.txt$/;
    $content{$f}{$n} .= $_;
}

for my $f (sort keys %content){
    print join('|',
        $f,
        map { $_ => $content{$f}{$_} } sort keys %{$content{$f}},
    ), "\n";
}

Probably not optimal, but could be your starting point (over commented on purpose): 可能不是最佳选择,但可能是您的出发点(故意评论过):

#!/usr/bin/perl

use strict;
use warnings;

# results hash
my %res = ();

# foreach .txt files
for (glob '*.txt') {
    s/\.txt$//; # replace suffix .txt by nothing
    my $t = ''; # buffer for the file contents
    my($f, $n) = split '_'; # cut the file name ex. 1001_1 => 1001 and 1

    # read the file contents
    {
        local $/; # slurp mode
        open(my $F, $_ . '.txt') || die $!; # open the txt file
        $t = <$F>; # get contents
        close($F); # close the text file
    }

    # transform \r, \n and \t into one space
    $t =~ s/[\r\n\t]/ /g;
    # appends for example 1001 | 2 | contents of 1001_2.txt to the results hash
    $res{$f} .= "$f | $n | $t | ";
}

# print the results
for (sort { $a <=> $b } keys %res) {
    # remove the trailing ' | '
    $res{$_} =~ s/\s\|\s$//;
    # print
    print $res{$_} . "\n";
}

# happy ending
exit 0;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM