简体   繁体   中英

Regular Expressions or web scraping in PHP for replacing {*} and {%}

Trying to build a web scraping script like feed43.com. Details: I have an html code as follows.

<div id="latest_header" onclick="getNews('79');">
                <img src="home_images/arrow.gif">&nbsp;2 DAY SEMINAR <br> <label id="news_pagedesp"><img src="home_images/li_desp.gif">NATIONAL SEMINAR..</label><label id="date_label">13th August 2014</label></div>
<div id="latest_header" onclick="getNews('78');">
                <img src="home_images/arrow.gif">&nbsp;2 DAYS WORKSHOP <br> <label id="news_pagedesp"><img src="home_images/li_desp.gif">INTERNATIONAL WOR..</label><label id="date_label">8th August 2014</label></div>

I write an expression like the following..

<div id="latest_header"{*}getNews('{%}'){*}&nbsp;{%}<br>{*}.gif">{%}..</label>

The result should be as per the following rules:

{*} - ignore everything {%} - use this as a value for a variable

that is the result should be all the occurrences of the given pattern. In above case:

{%1} - 79 {%2} - 2 DAY SEMINAR {%3} - NATIONAL SEMINAR

{%1} - 78 {%2} - 2 DAYS WORKSHOP {%3} - INTERNATIONAL WOR

I wasn't able to implement regular expressions and read at many places that it is not feasible to traverse html pages. I moved to simple_html_dom , but had no luck to get the above thing done in such an easy way. At-least, it wasn't possible for me to simulate the above thing.

The variables {*} & {%} are used to create a pattern when one uses feed43.com to create a feed of some website.

Your regex is incorrect. Use proper quantifiers to ignore items, and use for capturing the match subsections:

/<div id="latest_header"(?>.*?getNews\(')(?>(.*?)'\))(?>.*?&nbsp;)(?>(.*?)<br>)(?>.*?\.gif">)(.*?)<\/label>/s

* Atomic groups are used to eliminate . This regex without them would incur a lot of time backtracking, which is one of the major caveats with parsing HTML with regex .

This will be your match:

MATCH 1: [Group 1: 79] [Group 2: 2 DAY SEMINAR ] [Group 3: NATIONAL SEMINAR..]
MATCH 2: [Group 1: 78] [Group 2: 2 DAYS WORKSHOP ] [Group 3: INTERNATIONAL WOR..]

Here is a regex demo .

This probably might be irrelevant but the following open source project achieves what i wanted to..

hFeeds

And all i actually wanted to was to be able to create RSS feeds for any webpage like Feed43.com And hFeeds works exactly like Feed43 .com and is as easy to use. The only difference being it use {h} in place of {%} and {i} in place of {*}. It generates the regular expression as i see it.

But thanks all for ur answers

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM