简体   繁体   中英

how to find punctuation's in the string using perl

I need to grep all the punctuation's in the Markup language Content.

My Input Sample content:

__DATA__

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" ><strong>Kerala unterscheidet</strong> smtp://suriya@edu/tester sich von anderen indischen netftp://suriya@edu Bundesstaaten: Es ist sauberer, der;Verkehr nicht so.chaotisch , und Kirchen säumen die Straßen. Die Region einmalig machen aber die Backwaters <a href="http://www.cochin.org">www.cochin.org</a><link rel="stylesheet" type="text/css" href="../styles/9783734317873.css"/>

I am using [[:punct:]] however these nodes will fetch all the occurrences in the content.

my $text = do { local $/; <DATA> };

while($text=~m/(.){5}[[:punct:]](.){10}/g)
{
    print "L: $&\n";
}

Output

k rel="styleshee  
 type="text/css"
 href="../styles
g src="../images
17873_140_1.jpg"
 alt="image" cla
s nat&x00FC;rlic
xmlns="http://ww
3.org/1999/xhtml
" xml:lang="de"
ioses:Zeugnis na
x00FC;rlicher Pe
ugnis.nat&x00FC;

But I need to omit the punctuation in the element attributes and on their values. How can I list the punctuation's which is available in the content.

To be avoided : www.w3.org and "../styles/97 Needs to be find: der;Verkeh and so.chaotisch

Question Updated:

Do not remove any content or html elements to get the punctuation's in the string Since we need to get the exact line number and exact column number. If we removed the html elements column number must be changed.

Could someone help me on this one.

There is a great answer explaining why you shouldn't try to parse HTML with regex - https://stackoverflow.com/a/1732454/939457

You can use HTML::Parse and HTML::FormatText to extract the text:

 perl -MHTML::Parse -MHTML::FormatText -0777 -ne \
    'print HTML::FormatText->new->format(parse_html($_))' sample.txt

You will get only the text:

Kerala unterscheidet smtp://suriya@edu/tester sich von anderen indischen
   netftp://suriya@edu Bundesstaaten: Es ist sauberer, der;Verkehr nicht
   so.chaotisch, und Kirchen säumen die Straßen. Die Region einmalig
   machen aber die Backwaters www.cochin.org

Then you can use your original code. Something like this should work:

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parse;
use HTML::FormatText;

my $text = do { local $/; <DATA> };

$text = HTML::FormatText->new(leftmargin=>0, rightmargin=>100000000000)->format(parse_html($text));

while($text=~m/(.){5}[[:punct:]](.){10}/g)
{
        print "L: $&\n";
}

__DATA__
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" ><strong>Kerala unterscheidet</strong> smtp://suriya@edu/tester sich von anderen indischen netftp://suriya@edu Bundesstaaten: Es ist sauberer, der;Verkehr nicht so.chaotisch, und Kirchen säumen die Straßen. Die Region einmalig machen aber die Backwaters <a href="http://www.cochin.org">www.cochin.org</a><link rel="stylesheet" type="text/css" href="../styles/9783734317873.css"/>

Note: leftmargin / rightmargin are set to prevent the text wrapping done by the HTML::FormatText module

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM