使用正則表達式從Perl中的HTML中提取img標簽

Question

我需要從網址中提取驗證碼，並用Tesseract識別出來。 我的代碼是：

#!/usr/bin/perl -X
###
$user = 'user'; #Enter your username here
$pass = 'pass'; #Enter your password here
###
#Server settings
$home = "http://perltest.adavice.com";
$url = "$home/c/test.cgi?u=$user&p=$pass";
###Add code here!
#Grab img from HTML code
#if ($html =~ /<img. *?src. *?>/)
#{
#    $img1 = $1;
#}
#else 
#{
#    $img1 = "";
#}
$img2 = grep(/<img. *src=.*>/,$html);
if ($html =~ /\img[^>]* src=\"([^\"]*)\"[^>]*/)
{
    my $takeImg = $1;
    my @dirs = split('/', $takeImg);
    my $img = $dirs[2];
}
else
{
    print "Image not found\n";
}
###
die "<img> not found\n" if (!$img);
#Download image to server (save as: ocr_me.img)
print "GET '$img' > ocr_me.img\n";
system "GET '$img' > ocr_me.img";
###Add code here!
#Run OCR (using shell command tesseract) on img and save text as ocr_result.txt
system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";
###
die "ocr_result.txt not found\n" if (!-e "ocr_result.txt");
# check OCR results:
$txt = 'cat ocr_result.txt';
$txt =~ s/[^A-Za-z0-9\-_\.]+//sg;
$img =~ s/^.*\///;
print `echo -n "file=$img&text=$txt" | POST "$url"`;

如您所見，我正在嘗試提取img src標簽。 該解決方案對我不起作用（$ img1）在perl腳本中使用shell命令tesseract打印文本輸出。 我還使用了該解決方案的采納版本（$ img2）如何在Perl中從HTML提取URL和鏈接文本？ 。

如果您需要該頁面上的HTMLcode，則為：

<html>
<head>
<title>Perl test</title>
</head>
<body style="font: 18px Arial;">
<nobr>somenumbersimg src="/JJ822RCXHFC23OXONNHR.png" 
somenumbers<img src="/captcha/1533030599.png"/>
somenumbersimg src="/JJ822RCXHFC23OXONNHR.png" </nobr><br/><br/><form method="post" action="?u=user&p=pass">User: <input name="u"/><br/>PW: <input name="p"/><br/><input type="hidden" name="file" value="1533030599.png"/>Text: <input name="text"></br><input type="submit"></form><br/>
</body>
</html>

我收到找不到該圖片的錯誤。 我的問題是我認為錯誤的正則表達式。我無法安裝任何模塊，例如HTTP :: Parser或類似的模塊

Answer 1

除了在HTML上使用正則表達式不是很可靠的事實之外，以下代碼中的正則表達式也無法使用，因為它缺少捕獲組，因此不會為$1賦值。

if ($html =~ /<img. *?src. *?>/)
{
    $img = $1;
}

如果要使用正則表達式提取文本部分，則需要將該部分放在方括號中。 例如：

$example = "hello world";
$example =~ /(hello) world/;

這會將$ 1設置為“ hello”。

正則表達式本身沒有多大意義-在您有“。*？”的地方，它將匹配任何字符，后跟0或多個空格。 那是“。*”的錯字嗎？ 可以匹配任意數量的字符，但不像“。*”那樣貪婪，因此當它找到正則表達式下一部分的匹配項時，它將停止。

此正則表達式可能更接近您要查找的內容。 它將匹配第一個具有src屬性的img標簽，該標簽以“ / captcha /”開頭並將圖像URL存儲在$1

$html =~ m%<img[^>]*src="(/captcha/[^"]*)"%s;

對其進行分解。 “ m％....％”只是說“ /.../”的另一種方式，它使您可以在正則表達式中放入斜杠而不必將其轉義。 “ [^>] *”將匹配零個或多個零號（“>”除外）中的任何字符-因此它將與標簽的末尾不匹配。 “（/ captcha / [^“] *）”使用捕獲組來捕獲雙引號內將成為URL的所有內容。最后還使用“ / s”修飾符，將$html當作它只是一長行文本，並且忽略了其中可能不需要的\\n ，但是將img標簽拆分為多行的機會仍然可以使用。

使用正則表達式從Perl中的HTML中提取img標簽

問題描述

1 個解決方案

解決方案1
4 已采納 2018-07-31 14:46:33

使用正則表達式從Perl中的HTML中提取img標簽

問題描述

1 個解決方案

解決方案1 4 已采納 2018-07-31 14:46:33

解決方案1
4 已采納 2018-07-31 14:46:33