简体   繁体   English

preg_match_all不适用于html标签

[英]preg_match_all not working with html tags

I am trying to receive the content of the <tbody> tag from this page . 我想收到的内容<tbody>标签从这个页面

There are only one table with only one tag <tbody> , and i want to get all rows from this table 只有一个表只有一个标签<tbody>我想从该表中获取所有的行

I try to do this by this way 我尝试通过这种方式

$page = file_get_contents('http://pk.zntu.edu.ua/fakultety-ta-napryamy-pidhotovky/derzhavne-zamovlennya-2011-bakalavr');

preg_match_all("/<tbody>(.+?)<\/tbody>/is", $page, $output_array);

var_dump($output_array);

And i receive empty arrays: 而且我收到空数组:

array(2) { [0]=> array(0) { } [1]=> array(0) { } }

I have tried different variants of patterns like 我尝试了不同的模式变体,例如

  • " /<tbody>(.*?)<\\/tbody>/is " /<tbody>(.*?)<\\/tbody>/is
  • " /<tbody>.+?<\\/tbody>/is " /<tbody>.+?<\\/tbody>/is
  • " /<tbody>.*?<\\/tbody>/is " /<tbody>.*?<\\/tbody>/is
  • " /<tbody>.+<\\/tbody>/is " /<tbody>.+<\\/tbody>/is
  • " /<tbody>.*<\\/tbody>/is " /<tbody>.*<\\/tbody>/is

But no one works 但是没人能用

With PCRE and Regex Library all should be okay 使用PCRE和Regex Library都可以

I don't know what's the problem, please help 我不知道怎么了,请帮忙

Your pattern it's very simple, the regex above should be fine. 您的模式非常简单,上面的regex应该可以。 but I think the problem is come from file_get_contents . 但我认为问题出在file_get_contents I just try to count number of lines in $page variable and i get this 我只是尝试计算$page变量中的行数,我得到了

71220

But the real code that I check by clicking into that website and copy source code then count it manually, it's about 1787 lines. 但是,通过单击该网站并复制源代码然后检查的真实代码,然后对其进行手动计数,大约需要1787行。

What does this mean? 这是什么意思?

It maybe means that the code that you store it in $page variable is not the same as HTML code that you see when you manually click into that website. 这可能意味着您将其存储在$page变量中的代码与您手动单击该网站时看到的HTML代码不同。 In actually when you open one website, many thing can be occurred eg listener method is working, but in case that you download those source code directly to PHP variable some methods maybe never executed and this can make you get an incomplete HTML code. 实际上,当您打开一个网站时,可能会发生很多事情,例如侦听器方法正在工作,但是如果您直接将这些源代码下载到PHP变量中,则某些方法可能永远不会执行,这会使您获得不完整的HTML代码。

Note that the another evidence that support my assumption is I can not even find a keyword tbody in your $page variable. 请注意,支持我的假设的另一个证据是,我什至在$page变量中找不到关键字tbody

tbody tag may also contain attributes. tbody标签也可以包含属性。 So you need to match that attributes also in-order to get the content of tbody tag. 所以,你需要匹配也,以获得的内容属性tbody标签。

'/<tbody\b[^>]*>(.*?)<\/tbody>/is'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM