简体   繁体   中英

how to remove or replace specific chars are between two xml tags [linux, python, lxml, sed, awk,…]?

I'm using LXML library in python for XML parsing.

in a XML file, i have some bad characters that lead to below error in python:

lxml.etree.XMLSyntaxError: CharRef

Before opening and fetching the content of XML file in python, I must remove bad chars from two tags:

1: <essid cloaked="true">....</essid> or <essid cloaked="false">....</essid> .

2: <client-manuf>....</client-manuf>

the size of XML file is big. so I want to do it with sed or awk or similar tools.

    <crypt>0</crypt>
        <total>20    50</total>
        <fragments>0</fragments>
        <retries>0</retries>
    </packets>
    <datasize>0</datasize>
    <wireless-client number="1" type="established" first-time="Thu Feb 15 16:45:43 2018" last-time="Thu Feb 15 16:45:43 2018">
        <client-mac>08:EA:40:D0:55:43</client-mac>
        <client-manuf>SHENZHEN BILIAN ELECTRONIC CO.&#x  ef;&#x  bc;&#x  8c;LTD</client-manuf>
        <essid cloaked="true">&#x   0;&#x   0;&#x   0;&#x   0;&#x   0;</essid>
        <channel>8</channel>
        <maxseenrate>1.000000</maxseenrate>
        <carrier>IEEE 802.11b+</carrier>
        <encoding>CCK</encoding>
        <packets>
            <LLC>0</LLC>
            <data>0</data>
            <crypt>0</crypt>

I want to remove the bad chars from these tags (client-manuf and essid).

From: <client-manuf>SHENZHEN BILIAN ELECTRONIC CO.&#x ef;&#x bc;&#x 8c;LTD</client-manuf>

To (or this): <client-manuf>SHENZHEN BILIAN ELECTRONIC CO. LTD</client-manuf>

To (or this): <client-manuf>SHENZHEN BILIAN ELECTRONIC CO</client-manuf>

-----------------------------------------------

From: <essid cloaked="true">&#x 0;&#x 0;&#x 0;&#x 0;&#x 0;</essid>

From: <essid cloaked="false">&#x 0;&#x WiFi 0;&#x MTN 0;&#x 0;&#x 0;</essid>

To (or this): <essid cloaked="true"></essid>

To (or this): <essid cloaked="true">N/A SSID</essid>

To (or this): <essid cloaked="false">WiFi MTN</essid>

for example, two bad chars:

1: 0;

2: &#x

This is my solution. but it doesn't work well for my needs:

sed -e '/<essid cloaked="\\(true\\|false"\\)>*.*<\\/essid>/ s/\\(&#x\\|0;\\)//g' a.txt

The right way with etree.XMLParser object ( lxml.etree only):

import re
from lxml import etree

tags_to_fix = ['clientssss-manuf', 'client-manuf', 'essid']
parser = etree.XMLParser(recover=True)   # recovery mode !
tree = etree.parse("input.xml", parser)

for el in tree.xpath('//*[name()="clientssss-manuf" or name()="client-manuf" or name()="essid"]'):
    el.text = re.sub(r'\w{1,2};\s*', '', el.text).strip()

tree.write("output.xml", encoding="utf-8", pretty_print=True)

The crucial fragment from the resulting output.xml :

...
<packets>
<crypt>0</crypt>
        <total>20    50</total>
        <fragments>0</fragments>
        <retries>0</retries>
    </packets>
    <datasize>0</datasize>
    <wireless-client number="1" type="established" first-time="Thu Feb 15 16:45:43 2018" last-time="Thu Feb 15 16:45:43 2018">
        <client-mac>08:EA:40:D0:55:43</client-mac>
        <clientssss-manuf>SHENZHEN BILIAN ELECTRONIC CO.  LTD</clientssss-manuf>
        <client-manuf>SHENZHEN BILIAN ELECTRONIC CO.  LTD</client-manuf>
        <essid cloaked="true"></essid>
        <channel>8</channel>
        <maxseenrate>1.000000</maxseenrate>
        <carrier>IEEE 802.11b+</carrier>
        <encoding>CCK</encoding>
        <packets>
            <LLC>0</LLC>
            <data>0</data>
            <crypt>0</crypt>
</packets></wireless-client>
...

Your sed command didn't look so bad, it just left a lot of whitespace.

Since sed is normally greedy, you may specify any amount of space with " *".

cat bad.xml | sed '/<essid cloaked="\(true\|false"\)>*.*<\/essid>/ s/ *\(&#x\|0;\) *//g'

On the other hand, if there is some valid text, you might not want to stick it together, so you could add one space per removed pattern:

cat bad.xml | sed '/<essid cloaked="\(true\|false"\)>*.*<\/essid>/ s/ *\(&#x\|0;\) */ /g'

In the end you might condense multiple spaces to just one:

cat bad.xml | sed '/<essid cloaked="\(true\|false"\)>*.*<\/essid>/{s/ *\(&#x\|0;\) */ /g;s/  */ /g}'

Note, that the construct {foo;bar} binds the two commands to a block of commands, only operating on the before grabbed pattern. The second pattern would else affect the whole file.

With another masked pair of parenthesis and a masked plus:

cat bad.xml | sed '/<essid cloaked="\(true\|false"\)>*.*<\/essid>/{s/\( *\(&#x\|0;\) *\)\+/ missing essid /g;s/  */ /g}'

you can s:substitute a repeated occurence of a pattern with just one thing.

      s/\( *\(&#x\|0;\) *\)\+/ missing essid /;
      ^  (   (pattern1)   )+ / replacement   /(g now obsolete
         (pattern .......2)

The inner pattern is an alternative &#x or 0;. The outer pattern is the inner pattern, optionally guarded by blanks like

     "0;"
     "0; "
     " 0; "
     " 0;"
     "    0;  "
     "    &#x"

and so on.

You want the inner pattern, let's call it X, be repeated once or more than once, therefore the +. But without parens, + only addresses the last character, not the whole pattern.

You have to learn this regex-language. Find a tutorial. You can't ask for every possible variation you will need in your life.

It pays off very rapidly to have good, basic understanding. You don't need to know everything by hearth, but the basic stuff and should have a good estimation, what is possible and what not. Then a repo, to search for the things, rarely used. And then you might only ask the hard/complicated stuff.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM