简体繁体中英

How do you process invalid HTML in PHP?

原文 2012-07-18 17:19:39 0 3 php/ html/ regex/ parsing

I've seen this question , which is very nice and informative. However, it doesn't deal with a rather common scenario.

Say I need to scrape a multitude of websites (or even pages in the same domain), but the author of that website didn't care enough for his code, and has some seriously malformed code "that kinda works". I need to take information from that website.

How do I do it in this case? Ideally without going í͞ń̡͢͡s̶̢̛á̢̕͘ń̵͢҉e̶̸̢̛.

Is it possible? Do I have to revert to RegExp?

3 answers

You need a DOM Parser. Php has one . And then there are some alternatives (and more... just google for them). You can even run the "garbled HTML" trhu HTML Purifier if you want.

I don't know how your are scraping the site, but working with RegExp will allow you to add many conditions to the scrap code. This may take time, depending on the number of footprints and your RegExp skills.

You may also use Tidy on the site HTML, but this will lead to strange results as well IMO.

Does it have to be PHP? Python has a wonderful library called Beautiful Soup ( "You didn't write that awful page. You're just trying to get some data out of it" ). From my experience I'd recommend it so much that I'd say if you have the option, write a quick Python script to parse your nodes into a clean file that your PHP can pick up.

(Know PHP is in the title & this doesn't directly answer your question. Apologies if you don't have the option of (or dislike) Python, just wanted to present a good alternative.)

How do you parse and process HTML/XML in PHP?

how do you put php into HTML5

How do you validate a html unicode in php

How do you use PHP to interact with a running C++ process?

How do you process a paypal webhook event in PHP with Laravel?

How do you write HTML in a MySQL database field and echo it with PHP?

How do you output Fat Free PHP variable as HTML?

How do You Print PHP Object to Separate HTML Page

How do you configure WebMatrix to run PHP on *.htm, *.html files?

How do you style HTML inside PHP using CSS?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How do you parse and process HTML/XML in PHP? how do you put php into HTML5 How do you validate a html unicode in php How do you use PHP to interact with a running C++ process? How do you process a paypal webhook event in PHP with Laravel? How do you write HTML in a MySQL database field and echo it with PHP? How do you output Fat Free PHP variable as HTML? How do You Print PHP Object to Separate HTML Page How do you configure WebMatrix to run PHP on *.htm, *.html files? How do you style HTML inside PHP using CSS?

Related Tags

How do you process invalid HTML in PHP?

Question

3 answers

solution1
4 ACCPTED 2012-07-18 17:23:29

solution2
0 2012-07-18 17:28:40

solution3
0 2012-07-18 17:30:54

How do you process invalid HTML in PHP?

Question

3 answers

solution1 4 ACCPTED 2012-07-18 17:23:29

solution2 0 2012-07-18 17:28:40

solution3 0 2012-07-18 17:30:54

solution1
4 ACCPTED 2012-07-18 17:23:29

solution2
0 2012-07-18 17:28:40

solution3
0 2012-07-18 17:30:54