簡體   English   中英

是否有用於解析柱狀文本的 Perl 模塊?

[英]Is there a Perl module for parsing columnar text?

假設我有一個制表符分隔的文本文件,其中包含按列排列的數據(帶標題)。

不同的列可能會被“堆疊”成類似“工作表”的排列,即有一些分隔符(可能提前知道也可能不知道)允許垂直排列不同的列。

是否有一個 Perl 模塊可以幫助將此文本文件中的列數據解析為數據結構(例如,一個哈希表,其鍵是列標題,值是一個列數據標量數組)?

編輯通過“堆疊”,我的意思是一列文本可能包含多個單獨的數據“向量”,每個向量具有不同的標題和不同的長度。 誠然,這使解析變得復雜。

編輯老實說,我不確定混亂在哪里。 盡管如此,這里有一個例子:

header_one\theader_three
data_1\tdata_7
data_2\tdata_8
data_3\tdata_9
\tdata_10
header_two\tdata_11
data_4\theader_four
data_5\tdata_12
data_6\tdata_13
\tdata_14

該腳本會將其轉換為具有四個鍵的哈希表: header_oneheader_twoheader_threeheader_four ,每個鍵引用一個數組引用,該數組引用指向標頭下方的data_n元素。

如果可能,我會從DBD::CSV開始,盡管您的“堆疊”要求(我不完全理解)可能需要使用Text::CSV_XS進行一些手動解析。

不要被它們的名字所迷惑——它們可以使用任何分隔符進行解析,而不僅僅是逗號。

我認為這與你所說的很接近。 如果列數發生變化,則輸入將被視為不同的表。 可以輕松修改此代碼以識別其他一些標記(例如一行等號),而不是使用列數。

#!/usr/bin/perl

use strict;
use warnings;

use Text::CSV_XS;

#setup the parser, here we want tab separated and we allow
#loose quoting, so qq/foo\t"bar\tbaz"\tquux/ is 
#("foo", "bar\tbaz", "quux")
my $p = Text::CSV_XS->new(
    {
        sep_char           => "\t",
        allow_loose_quotes => 1,
    }
);

my @stacked;
my $cur = 0;
while (<>) {
    $p->parse($_) or die $p->error_input;
    my @rec = $p->fields;
    #normal case, just add the record to the last
    #section in @stacked
    if (@rec == $cur) {
        push @{$stacked[-1]}, \@rec;
        next;
    }
    #if the number of columns don't match then
    #we have a new section
    push @stacked, [\@rec];
    $cur = @rec; #set the new number of columns
}

for my $table (@stacked) {
    print "header: ", join("::", @{$table->[0]}), "\n";
    for my $i (1 .. $#$table) {
        print "data: ", join("::", @{$table->[$i]}), "\n";
    }
    print "\n";
}

不是很順利,但我一直這樣做:

        my $recordType = unpack("A3", $_);

        if ($recordType eq "APT")
        {
            $currentKey = parseFAAAirportAirportRecord($_);
        }
        elsif ($recordType eq "ATT")
        {
            parseFAAAirportAttendenceRecord($currentKey, $_);
        }
        elsif ($recordType eq "RWY")
        {
            parseFAAAirportRunwayRecord($currentKey, $_);
        }
        elsif ($recordType eq "RMK")
        {
            parseFAAAirportRemarkRecord($currentKey, $_);
        }
...
sub parseFAAAirportAirportRecord($)
{
    my ($line) = @_;

    my ($recordType, $datasource_key, $type, $id, $effDate, $faaRegion,
        $faaFieldOffice, $state, $stateName, $county, $countyState,
        $city, $name, $ownershipType, $facilityUse, $ownersName,
        $ownersAddress, $ownersCityStateZip, $ownersPhone, $facilitiesManager,
        $managersAddress, $managersCityStateZip, $managersPhone,
        $formattedLat, $secondsLat, $formattedLong, $secondsLong,
        $refDetermined, $elev, $elevDetermined, $magVar, $magVarEpoch, $tph,
        $sectional, $distFromTown, $dirFromTown, $acres,
        $bndryARTCC, $bndryARTCCid,
        $bndryARTCCname, $respARTCC, $respARTCCid, $respARTCCname,
        $fssOnAirport, $fssId, $fssName, $fssPhone, $fssTollFreePhone,
        $altFss, $altFssName,
        $altFssPhone, $notamFacility, $notamD, $arptActDate,
        $arptStatusCode, $arptCert,
        $naspAgreementCode, $arptAirspcAnalysed, $aoe, $custLandRights,
        $militaryJoint, $militaryRights, $nationalEmergency, $milUse,
        $inspMeth, $inspAgency, $lastInsp, $lastInfo, $fuel, $airframeRepairs
,
        $engineRepairs, $bottledOyxgen, $bulkOxygen,
        $lightingSchedule, $tower, $unicomFreqs, $ctafFreq, $segmentedCircle,
        $lens, $landingFee, $isMedical,
        $numBasedSEL, $numBasedMEL, $numBasedJet,
        $numBasedHelo, $numBasedGliders, $numBasedMilitary,
        $numBasedUltraLight,
        $numScheduledOperation, $numCommuter, $numAirTaxi,
        $numGAlocal, $numGAItinerant,
        $numMil, $countEndingDate,
        $aptPosSrc, $aptPosSrcDate, $aptElevSrc, $aptElevSrcDate,
        $contractFuel, $transientStorage, $otherServices, $windIndicator,
        $icaoId) =
        unpack("A3 A11 A13 A4 A10 A3 A4 A2 A20 A21 A2 A40 " .
        "A42 A2 A2 A35 A72 A45 A16 A35 A72 A45 A16 A15 A12 A15 A12 A1 A5 A1 " .
        "A3 A4 A4 A30 A2 A3 A5 A4 A3 A30 A4 A3 A30 A1 A4 A30 A16 A16 " .
        "A4 A30 A16 A4 " .
        "A1 A7 A2 A15 A7 A13 A1 A1 A1 A1 A18 A6 A2 A1 A8 A8 A40 A5 A5 A8 " .
        "A8 A9 A1 A42 A7 A4 A3 A1 A1 A3 A3 A3 A3 A3 A3 A3 " .
        "A6 A6 A6 A6 A6 A6 A10" .
        "A16 A10 A16 A10 A1 A12 A71 A3 A7", $line);

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM