简体   繁体   English

批量DOS复制文件的最后一行,限制为65536个字符

[英]Batch DOS copying last lines of a file limited by 65 536 characters

I have a heavy XML file of 1Go having the following structure: 我有一个很重的1Go XML文件,其结构如下:

 <?xml version='1.0' encoding='windows-1252'?>
 <ext:BookingExtraction>
     <Booking><Code>2016Z00258</Code><Advertiser><Code>00123</Code<Name>LOUIS VUITTON</Name></Advertiser></Booking>
     <Booking><Code>2016Z00259</Code><Advertiser><Code>00124</Code<Name>Adidas</Name></Advertiser></Booking>
 </ext:BookingExtraction>

As the structure is really simple my goal is to get the 150 last lines of an XML file copy them into new file and add the opening tag in the first line to have a well formed XML. 由于结构非常简单,因此我的目标是获取XML文件的最后150行,将它们复制到新文件中,并在第一行中添加开始标记,以形成格式正确的XML。

The algorithm works fine but some line having more than 65 536 characters are splitted in several lines. 该算法工作正常,但是某些行包含65 536个以上的字符被分成几行。 I read that DOS limit the number of character per line at 65 536. This is why it add a carriage enter character after this 65 536 characters. 我读到DOS限制每行的字符数为65536。这就是为什么它在这65 536个字符之后添加一个回车符。

The result is that the final XML is not well formed because of the carriage enter in the middle of the line. 结果是最终的XML格式不正确,因为在行的中间输入了回车。 For instance: 例如:

 <ext:BookingExtraction>
     <Booking><Code>2016Z00258</Code><Advertiser><Code>00123</Code><Name>LOUIS VUIT
TON</Name></Advertiser></Booking>
</ext:BookingExtraction>

I tried to remove the characters carriage enter but it does not work. 我试图删除字符回车,但是它不起作用。 Do you have any idea how could I fix this? 你有什么想法我该如何解决?

`@echo off
setLocal EnableDelayedExpansion

::Get XML file
for /r %%a in (extractedBookings_BookingWithoutUnitsContent_PRD_*.xml) do (
    ::echo "%%~dpa" and full path is "%%~nxa"
    set fileName="%%~nxa"
)


::Get the 150 last line of the file 
    echo File path: "%fileName%"    
    for /f %%i in ('find /v /c "" ^< "%fileName%"') do set /a lines=%%i
    echo nb lines: "%lines%"
    set /a startLine=%lines% - 150
    echo Start line "%startLine%"
    more /e +%startLine% "%fileName%" > extractedBookings_BookingWithoutUnitsContent_PRD.xml



::adding opening tag to the new file
    echo ^<?xml version='1.0' encoding='windows-1252'?^> > newFile.xml
    echo ^<ext:BookingExtraction^> >> newFile.xml

::Get the final file
   type extractedBookings_BookingWithoutUnitsContent_PRD.xml >> newFile.xml
   type newFile.xml > extractedBookings_BookingWithoutUnitsContent_PRD.xml`

Thank you in advance 先感谢您

Your question is confusing; 您的问题令人困惑; the "DOS limit the number of line at 65 536 characters" phrase is imprecise. “ DOS将行数限制为65 536个字符”的短语不准确。 When the output of more command is redirected to a disk file, it waits for a character after 65536 lines , and such character is inserted in the output. more命令的输出重定向到磁盘文件时,它将等待65536 之后的字符,并将该字符插入输出中。 Also, the max line length in FIND command is 1070 characters (accordingly to this site ), so I guess that your file have shorter lines. 另外,FIND命令中的最大行长是1070个字符(根据此站点 ),因此我想您的文件中的行更短。 You just need a method that can cleanly output more than 64K lines. 您只需要一种可以干净地输出超过64K行的方法。

The solution below is basically your same code, but it uses a combination of set /P command to skip the first lines and a findstr command to show the rest, instead of your more +%startLine% command. 下面的解决方案基本上是您相同的代码,但是它使用set /P命令跳过第一行,并使用findstr命令显示其余代码,而不是您的more +%startLine%命令。

@echo off
setLocal EnableDelayedExpansion

::Get XML file
for /r %%a in (extractedBookings_BookingWithoutUnitsContent_PRD_*.xml) do (
    ::echo "%%~dpa" and full path is "%%~nxa"
    set fileName="%%~nxa"
)


::Get the 150 last line of the file 
    echo File path: "%fileName%"    
    for /f %%i in ('find /v /c "" ^< "%fileName%"') do set /a lines=%%i
    echo nb lines: "%lines%"
    set /a startLine=%lines% - 150
    echo Start line "%startLine%"

    REM Use a code block to read from redirected input file (and write to output file)
    < "%fileName%" (

       rem adding opening tag to the new file
       echo ^<?xml version='1.0' encoding='windows-1252'?^>
       echo ^<ext:BookingExtraction^>

       REM Skip the first total-150 lines
       for /L %%i in (1,1,%startLine%) do set /P "="

       REM Copy the rest
       findstr "^"

    ) > extractedBookings_BookingWithoutUnitsContent_PRD.xml

This method may still fail if an input line is longer than 1023 characters, because this is the limit of set /P command. 如果输入行的长度超过1023个字符,此方法可能仍然会失败,因为这是set /P命令的限制。

As I commented earlier, 'tis better to parse XML as a hierarchical structure, rather than as predictably-formatted flat text. 正如我之前评论的那样,“最好将XML解析为分层结构,而不是可预测格式的纯文本。 If that flat text is beautified, uglified, minified, whatever, a flat text scraper will fail. 如果对该美文字进行美化,丑化,缩小等操作,则美金刮屏将失败。

Your example XML is still a little ambiguous, so I'm assuming you've got a single <ext:BookingExtraction> tag with a ton of <Booking> child nodes you wish to whittle down to the last 150. 您的示例XML仍然有点模棱两可,因此我假设您有一个<ext:BookingExtraction>标记,其中包含大量的<Booking>子节点,希望将其缩减到最后150个。

Before your example XML can be parsed, though, (besides fixing the missing > in </code> ) we need to massage it slightly by defining the namespace to which ext belongs. 之前你的示例XML可以分析,虽然,(除了固定失踪></code> ),我们需要通过定义该命名空间稍微按摩它ext所属。

Before: 之前:

<ext:BookingExtraction>

After: 后:

<ext:BookingExtraction xmlns:ext="http://localhost">

Although strictly speaking that's probably a bogus namespace, it's good enough to make the XML parse-able nevertheless. 尽管严格来说,这可能是虚假的名称空间,但足以使XML可解析。 We can do this programmatically by reading the XML into a variable and performing a regex replace. 我们可以通过将XML读入变量并执行正则表达式替换来以编程方式进行此操作。 After that, it's just a simple matter of removing child nodes within a while loop until you reach your 150-element goal. 之后,只需简单地在while循环中删除子节点,直到达到150个元素的目标即可。

Save this with a .bat extension, replace "test.xml" with the location of your XML file, and run it. 将其保存为.bat扩展名,将“ test.xml”替换为XML文件的位置,然后运行它。

@if (@CodeSection == @Batch) @then
@echo off & setlocal
cscript /nologo /e:JScript "%~f0" "test.xml" "output.xml"
goto :EOF
@end // end Batch / begin JScript hybrid code

var args = { infile: WSH.Arguments(0), outfile: WSH.Arguments(1) },
    fso = WSH.CreateObject('Scripting.FileSystemObject'),
    file = fso.OpenTextFile(args.infile, 1),
    xml = file.ReadAll(),
    DOM = WSH.CreateObject('MSXML2.DOMDocument.6.0'),
    ns = 'xmlns:ext="http://localhost"',
    xpath = '/ext:BookingExtraction/Booking';

file.Close();
DOM.loadXML(xml.replace(
    /<(ext:BookingExtraction)>/i,
    function($0, $1) { return '<' + $1 + ' ' + ns + '>' }
));

if (DOM.parseError.errorCode) {
    var e = DOM.parseError;
    WSH.StdErr.WriteLine('Error in ' + args.infile + ' line ' + e.line + ' char '
        + e.linepos + ':\n' + e.reason + '\n' + e.srcText);
    WSH.Quit(1);
}

DOM.setProperty('SelectionNamespaces', ns);

while (DOM.selectNodes(xpath).length > 150) {
    var node = DOM.selectSingleNode(xpath)
    node.parentNode.removeChild(node)
}

DOM.save(args.outfile)

... Or it might be a little easier just to strip out the ext: namespace and replace it later. ...或者只是ext:名称空间并在以后替换它可能会更容易一些。 Here's a batch + PowerShell hybrid script that demonstrates. 这是演示的批处理+ PowerShell混合脚本。 It's not as fast as the batch + Jscript hybrid, and it has a side effect of beautifying all tags whether you want them indented or not. 它不如批处理+ Jscript混合速度快,并且具有使所有标签美化的副作用,无论您是否希望它们缩进。 But it does have the advantage of simplicity. 但是它确实具有简单性的优点。

<# : batch portion
@echo off & setlocal

set "infile=test.xml"
set "outfile=out.xml"

powershell -noprofile "iex (${%~f0} | out-string)"
goto :EOF
: end batch / begin PowerShell hybrid #>

[xml]$xml = (gc $env:infile) -replace "ext:"
$xpath = "/BookingExtraction/Booking"
$deleted = 0

while ($xml.selectNodes($xpath).Count -gt 150) {
    $node = $xml.selectSingleNode($xpath)
    [void]$node.parentNode.removeChild($node)
    $deleted++
}

write-host "Removed $deleted ndoes" -f magenta

$xml.save($env:outfile)
(gc $env:outfile) -replace "BookingExtraction", "ext:BookingExtraction" | sc $env:outfile

Edit: if dealing with large files (1GB+), maybe it would actually be better to trim the fat as flat text, rather than manipulating as structured object data. 编辑:如果要处理大文件(大于1GB),也许实际上最好将粗文本裁剪为纯文本,而不是将其处理为结构化的对象数据。 If you want the last 150 lines, I think it'd be more efficient to start at the bottom and work backwards, rather than starting at the top and skipping millions of lines. 如果您想要最后150行,我认为从底部开始并向后工作比从顶部开始并跳过数百万行会更有效。 Opening the XML file with .NET methods will allow you to seek to the end of the file nearly instantly, then walk up. 使用.NET方法打开XML文件将使您几乎立即搜索到文件末尾,然后向上走。 Try this batch + PowerShell script and see whether it works more efficiently for you: 尝试使用此批处理+ PowerShell脚本,看看它是否对您更有效:

<# : batch portion
@echo off & setlocal

set "infile=test.xml"
set "outfile=out.xml"

powershell -noprofile "iex (${%~f0} | out-string)"
goto :EOF
: end batch / begin PowerShell hybrid #>

$lines = 150
$found = 0
$reader = new-object IO.StreamReader((gi $env:infile).FullName)
$stream = $reader.BaseStream
$xml = $reader.ReadLine(), $reader.ReadLine()

$pos = $stream.Seek(0, [IO.SeekOrigin]::End)

while ($found -le $lines) {

    $reader.DiscardBufferedData()
    $stream.Position = --$pos
    $char = $reader.Peek()

    if ($char -eq -1) { break }
    else { if ($char -eq 10) { $found++ } }
}

$reader.DiscardBufferedData()
$stream.Position = ++$pos

$xml += $reader.ReadToEnd()
$reader.Close()

$xml -join "`r`n" | out-file $env:outfile

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM