简体   繁体   中英

PHP Extract Text from Webpage

Is it possible to do something with PHP where I can set up a connection to a URL like http://en.wikipedia.org/wiki/Wiki and extract any words that contain a prefix like "Exa" and "ins" such that the resulting PHP page will print out all the words that it found. For example with "Exa", the word "Example" would be printed out each time it found an instance of "Example". Same thing for words that start with "ins".

$data = strip_tags(file_get_contents($url));
$matches = array();
preg_match('/\bExa|ins([^\b]+)/', $data, &$matches);
for ($i = 1; $i < count($matches); $i++) {
    echo "Match: '".$matches[$i]."'\r\n";
}

Probably something like this, though I'm not so sure about the regex, I haven't tested it yet...

Edit: I changed it, it should work now... (\\B => \\b and strip_tags to prevent HTML-classes from being matched).

I don't have a full answer with example to give you, but yes, you should be able to read the whole page into a string variable and then do normal string operations on it. It will read in all the HTML, so you will probably need to do a lot of regex to eliminate tags if you don't want them.

Read the page into a string using file_get_contents . Use one of the various string functions to examine the page.

Yes, this possible. A potential approach would be to:

  1. Use something like fopen (if allow_url_fopen is enabled - failing that use CURL ) to grab the external web page content.

  2. Remove the (presumably not required) HTML tags via strip_tags .

  3. Use strtok to tokenise and iterate over the remaining content, checking for whatever conditions you require.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM