[英]Reading content from Multiple Text Files
Looking for help in doing this: 在执行此操作时寻求帮助:
I have a directory full of text files that are named with a numerical ID. 我有一个目录,里面充满了用数字ID命名的文本文件。 Each text file contains the body of a news article. 每个文本文件都包含新闻文章的正文。 Some news articles are segregated in different parts, so they are in different text files. 一些新闻文章分为不同的部分,因此它们位于不同的文本文件中。
The names are such 名字是这样的
1001_1.txt, 1001_2.txt (These files contain two different part of the same article) 1002_1.txt, 1003_1.txt, 1004_1.txt, 1004_2.txt, 1004_3.txt, 1004_4.txt (these files contain four different parts of the same article, the parts will go up to a maximum of 4 only).
and so forth and so on. 以此类推。
Basically, I need a script (PHP, Perl, RUBY or otherwise) that would simply put the name of the text file (before the underscore) in a column, and the content of the text file in another column, and if there is any number after the underscore, to put that in one column as well. 基本上,我需要一个脚本(PHP,Perl,RUBY或其他方式),该脚本只需将文本文件的名称(在下划线之前)放在一列中,并将文本文件的内容放在另一列中(如果有)下划线后的数字,也将其放在一栏中。
So you would have a table structure looking like this: 因此,您将具有如下所示的表结构:
1001 | 1 | content of the text file
1001 | 2 | content of the text file
1002 | 1 | content of the text file
1003 | 1 | content of the text file
Any help on how I can accomplish this would be appreciated. 我如何能做到这一点的任何帮助将不胜感激。
There are about 7000 text files that need to be read and imported in a table for future usage in a database. 一个表中需要读取和导入大约7000个文本文件,以备将来在数据库中使用。
It would be even better if the _1 and _2 files content could be segregated in different colums, eg: 如果将_1和_2文件的内容分隔在不同的列中会更好,例如:
1001 | 1 | content | 2 | content | 3 | content | 4 | content
1002 | 1 | content
1003 | 1 | content
(Like I said, the file names go maximum up to _4
so you could have 1001_1
, 1001_2
, 1001_3
, 1001_4.txt
or only 1002_1
and 1003_1.txt
) (就像我说的,文件名去最大可达_4
,所以你可以有1001_1
, 1001_2
, 1001_3
, 1001_4.txt
或仅1002_1
和1003_1.txt
)
This is fairly straightforward with File::Find and File::Slurp : 使用File :: Find和File :: Slurp相当简单:
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
use File::Slurp;
die "Need somewhere to start\n" unless @ARGV;
my %files;
find(\&wanted, @ARGV);
for my $name (sort keys %files) {
my $file = $files{$name};
print join( ' | ', $name,
map { exists $file->{$_} ? ($_, $file->{$_}) : () } 1 .. 4
), "\n";
}
sub wanted {
my $file = $File::Find::name;
return unless -f $file;
return unless $file =~ /([0-9]{4})_([1-4])\.txt$/;
# I do not know what you want to do with newlines
$files{$1}->{$2} = join('\n', map { chomp; $_ } read_file $file);
return;
}
Output: 输出:
1001 | 1 | lsdkjv\nsdfljk\nsdklfjlksjadf\nlsdjflkjdsf | 3 | sadlfkjldskfj 1002 | 1 | ldskfjsdlfjkl
use strict;
use warnings;
my %content;
while (<>){
s/\s+/ /g;
my ($f, $n) = $ARGV =~ /(\d+)_(\d)\.txt$/;
$content{$f}{$n} .= $_;
}
for my $f (sort keys %content){
print join('|',
$f,
map { $_ => $content{$f}{$_} } sort keys %{$content{$f}},
), "\n";
}
Probably not optimal, but could be your starting point (over commented on purpose): 可能不是最佳选择,但可能是您的出发点(故意评论过):
#!/usr/bin/perl
use strict;
use warnings;
# results hash
my %res = ();
# foreach .txt files
for (glob '*.txt') {
s/\.txt$//; # replace suffix .txt by nothing
my $t = ''; # buffer for the file contents
my($f, $n) = split '_'; # cut the file name ex. 1001_1 => 1001 and 1
# read the file contents
{
local $/; # slurp mode
open(my $F, $_ . '.txt') || die $!; # open the txt file
$t = <$F>; # get contents
close($F); # close the text file
}
# transform \r, \n and \t into one space
$t =~ s/[\r\n\t]/ /g;
# appends for example 1001 | 2 | contents of 1001_2.txt to the results hash
$res{$f} .= "$f | $n | $t | ";
}
# print the results
for (sort { $a <=> $b } keys %res) {
# remove the trailing ' | '
$res{$_} =~ s/\s\|\s$//;
# print
print $res{$_} . "\n";
}
# happy ending
exit 0;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.