[英]Is there a Perl module for parsing columnar text?
假設我有一個制表符分隔的文本文件,其中包含按列排列的數據(帶標題)。
不同的列可能會被“堆疊”成類似“工作表”的排列,即有一些分隔符(可能提前知道也可能不知道)允許垂直排列不同的列。
是否有一個 Perl 模塊可以幫助將此文本文件中的列數據解析為數據結構(例如,一個哈希表,其鍵是列標題,值是一個列數據標量數組)?
編輯通過“堆疊”,我的意思是一列文本可能包含多個單獨的數據“向量”,每個向量具有不同的標題和不同的長度。 誠然,這使解析變得復雜。
編輯老實說,我不確定混亂在哪里。 盡管如此,這里有一個例子:
header_one\theader_three
data_1\tdata_7
data_2\tdata_8
data_3\tdata_9
\tdata_10
header_two\tdata_11
data_4\theader_four
data_5\tdata_12
data_6\tdata_13
\tdata_14
該腳本會將其轉換為具有四個鍵的哈希表: header_one
、 header_two
、 header_three
和header_four
,每個鍵引用一個數組引用,該數組引用指向標頭下方的data_n
元素。
如果可能,我會從DBD::CSV開始,盡管您的“堆疊”要求(我不完全理解)可能需要使用Text::CSV_XS進行一些手動解析。
不要被它們的名字所迷惑——它們可以使用任何分隔符進行解析,而不僅僅是逗號。
我認為這與你所說的很接近。 如果列數發生變化,則輸入將被視為不同的表。 可以輕松修改此代碼以識別其他一些標記(例如一行等號),而不是使用列數。
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV_XS;
#setup the parser, here we want tab separated and we allow
#loose quoting, so qq/foo\t"bar\tbaz"\tquux/ is
#("foo", "bar\tbaz", "quux")
my $p = Text::CSV_XS->new(
{
sep_char => "\t",
allow_loose_quotes => 1,
}
);
my @stacked;
my $cur = 0;
while (<>) {
$p->parse($_) or die $p->error_input;
my @rec = $p->fields;
#normal case, just add the record to the last
#section in @stacked
if (@rec == $cur) {
push @{$stacked[-1]}, \@rec;
next;
}
#if the number of columns don't match then
#we have a new section
push @stacked, [\@rec];
$cur = @rec; #set the new number of columns
}
for my $table (@stacked) {
print "header: ", join("::", @{$table->[0]}), "\n";
for my $i (1 .. $#$table) {
print "data: ", join("::", @{$table->[$i]}), "\n";
}
print "\n";
}
不是很順利,但我一直這樣做:
my $recordType = unpack("A3", $_);
if ($recordType eq "APT")
{
$currentKey = parseFAAAirportAirportRecord($_);
}
elsif ($recordType eq "ATT")
{
parseFAAAirportAttendenceRecord($currentKey, $_);
}
elsif ($recordType eq "RWY")
{
parseFAAAirportRunwayRecord($currentKey, $_);
}
elsif ($recordType eq "RMK")
{
parseFAAAirportRemarkRecord($currentKey, $_);
}
...
sub parseFAAAirportAirportRecord($)
{
my ($line) = @_;
my ($recordType, $datasource_key, $type, $id, $effDate, $faaRegion,
$faaFieldOffice, $state, $stateName, $county, $countyState,
$city, $name, $ownershipType, $facilityUse, $ownersName,
$ownersAddress, $ownersCityStateZip, $ownersPhone, $facilitiesManager,
$managersAddress, $managersCityStateZip, $managersPhone,
$formattedLat, $secondsLat, $formattedLong, $secondsLong,
$refDetermined, $elev, $elevDetermined, $magVar, $magVarEpoch, $tph,
$sectional, $distFromTown, $dirFromTown, $acres,
$bndryARTCC, $bndryARTCCid,
$bndryARTCCname, $respARTCC, $respARTCCid, $respARTCCname,
$fssOnAirport, $fssId, $fssName, $fssPhone, $fssTollFreePhone,
$altFss, $altFssName,
$altFssPhone, $notamFacility, $notamD, $arptActDate,
$arptStatusCode, $arptCert,
$naspAgreementCode, $arptAirspcAnalysed, $aoe, $custLandRights,
$militaryJoint, $militaryRights, $nationalEmergency, $milUse,
$inspMeth, $inspAgency, $lastInsp, $lastInfo, $fuel, $airframeRepairs
,
$engineRepairs, $bottledOyxgen, $bulkOxygen,
$lightingSchedule, $tower, $unicomFreqs, $ctafFreq, $segmentedCircle,
$lens, $landingFee, $isMedical,
$numBasedSEL, $numBasedMEL, $numBasedJet,
$numBasedHelo, $numBasedGliders, $numBasedMilitary,
$numBasedUltraLight,
$numScheduledOperation, $numCommuter, $numAirTaxi,
$numGAlocal, $numGAItinerant,
$numMil, $countEndingDate,
$aptPosSrc, $aptPosSrcDate, $aptElevSrc, $aptElevSrcDate,
$contractFuel, $transientStorage, $otherServices, $windIndicator,
$icaoId) =
unpack("A3 A11 A13 A4 A10 A3 A4 A2 A20 A21 A2 A40 " .
"A42 A2 A2 A35 A72 A45 A16 A35 A72 A45 A16 A15 A12 A15 A12 A1 A5 A1 " .
"A3 A4 A4 A30 A2 A3 A5 A4 A3 A30 A4 A3 A30 A1 A4 A30 A16 A16 " .
"A4 A30 A16 A4 " .
"A1 A7 A2 A15 A7 A13 A1 A1 A1 A1 A18 A6 A2 A1 A8 A8 A40 A5 A5 A8 " .
"A8 A9 A1 A42 A7 A4 A3 A1 A1 A3 A3 A3 A3 A3 A3 A3 " .
"A6 A6 A6 A6 A6 A6 A10" .
"A16 A10 A16 A10 A1 A12 A71 A3 A7", $line);
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.