简体   繁体   English

php,strpos从字符串中提取数字

[英]php, strpos extract digit from string

I have a huge html code to scan. 我有一个巨大的HTML代码来扫描。 Until now i have been using preg_match_all to extract desired parts from it. 到目前为止,我一直在使用preg_match_all从中提取所需的部分。 The problem from the start was that it was extremely cpu time consuming. 从一开始的问题是它耗费了极大的CPU时间。 We finally decided to use some other method for extraction. 我们最终决定使用其他一些方法进行提取。 I read in some articles that preg_match can be compared in performance with strpos . 我在一些文章中读到preg_match可以在性能上与strpos进行比较。 They claim that strpos beats regex scanner up to 20 times in efficiency. 他们声称strpos击败正则表达式扫描仪的效率高达20倍。 I thought i will try this method but i dont really know how to get started. 我以为我会尝试这种方法,但我真的不知道如何开始。

Lets say i have this html string: 让我说我有这个HTML字符串:

<li id="ncc-nba-16451" class="che10"><a href="/en/star">23 - Star</a></li>
<li id="ncd-bbt-5674" class="che10"><a href="/en/moon">54 - Moon</a></li>
<li id="ertw-cxda-c6543" class="che10"><a href="/en/sun">34,780 - Sun</a></li>

I want to extract only number from each id and only text (letters) from content of a tags. 我想从内容提取每个ID只有文字(字母)仅数a标签。 so i do this preg_match_all scan: 所以我这样做preg_match_all扫描:

'/<li.*?id=".*?([\\d]+)".*?<a.*?>.*?([\\w]+)<\\/a>/s'

here you can see the result: LINK 在这里你可以看到结果: LINK

Now if i would want to replace my method to strpos functionality how the approach would look like? 现在,如果我想替换我的方法来strpos功能,该方法将如何? I understand that strpos returns a index of start where match took place. 我知道strpos返回匹配发生的起始索引。 But how can i use it to: 但我怎么能用它来:

  • get all possible matches, not just one 获得所有可能的匹配,而不仅仅是一个
  • extract numbers or text from desired place in string 从字符串中的所需位置提取数字或文本

Thank you for all the help and tips ;) 感谢您的所有帮助和提示;)

This regex finds a match in 24 steps using 0 backtracks 此正则表达式使用0回溯在24个步骤中找到匹配项

(?:id="[^\d]*(\d*))[^<]*(?:<a href="[^>]*>[^a-z]*([a-z]*))

The regex you posted requires 134 steps. 您发布的正则表达式需要134个步骤。 Maybe you will notice a difference? 也许你会注意到一个区别? Note that regex engines can optimize so that in minimizes backtracking. 请注意,正则表达式引擎可以进行优化,以最大限度地减少回溯。 I used the debugger of RegexBuddy to come to the numbers. 我使用了RegexBuddy的调试器来获取数字。

Using DOM 使用DOM

$html = '
<html>
<head></head>
<body>
<li id="ncc-nba-16451" class="che10"><a href="/en/star">23 - Star</a></li>
<li id="ncd-bbt-5674" class="che10"><a href="/en/moon">54 - Moon</a></li>
<li id="ertw-cxda-c6543" class="che10"><a href="/en/sun">34,780 - Sun</a></li>
</body>
</html>';


$dom_document = new DOMDocument();

$dom_document->loadHTML($html);

$rootElement = $dom_document->documentElement;

$getId = $rootElement->getElementsByTagName('li');
$res = [];
foreach($getId as $tag)
{
   $data = explode('-',$tag->getAttribute('id'));
   $res['li_id'][] = end($data);
}
$getNode = $rootElement->getElementsByTagName('a');
foreach($getNode as $tag)
{
   $res['a_node'][] = $tag->parentNode->textContent;
}
print_r($res);

Output : 输出:

Array
(
    [li_id] => Array
        (
            [0] => 16451
            [1] => 5674
            [2] => c6543
        )

    [a_node] => Array
        (
            [0] => 23 - Star
            [1] => 54 - Moon
            [2] => 34,780 - Sun
        )

)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM