PHP Extract Text from Webpage

Question

Is it possible to do something with PHP where I can set up a connection to a URL like http://en.wikipedia.org/wiki/Wiki and extract any words that contain a prefix like "Exa" and "ins" such that the resulting PHP page will print out all the words that it found. For example with "Exa", the word "Example" would be printed out each time it found an instance of "Example". Same thing for words that start with "ins".

Answer 1

$data = strip_tags(file_get_contents($url));
$matches = array();
preg_match('/\bExa|ins([^\b]+)/', $data, &$matches);
for ($i = 1; $i < count($matches); $i++) {
    echo "Match: '".$matches[$i]."'\r\n";
}

Probably something like this, though I'm not so sure about the regex, I haven't tested it yet...

Edit: I changed it, it should work now... (\\B => \\b and strip_tags to prevent HTML-classes from being matched).

Answer 2

I don't have a full answer with example to give you, but yes, you should be able to read the whole page into a string variable and then do normal string operations on it. It will read in all the HTML, so you will probably need to do a lot of regex to eliminate tags if you don't want them.

Answer 3

Read the page into a string using file_get_contents . Use one of the various string functions to examine the page.

Answer 4

Yes, this possible. A potential approach would be to:

Use something like fopen (if allow_url_fopen is enabled - failing that use CURL ) to grab the external web page content.
Remove the (presumably not required) HTML tags via strip_tags .
Use strtok to tokenise and iterate over the remaining content, checking for whatever conditions you require.

PHP Extract Text from Webpage

Question

4 answers

solution1
2 2011-05-09 18:13:24

solution2
1 2011-05-09 18:11:16

solution3
0 2011-05-09 18:09:17

solution4
0 2011-05-09 18:17:06

PHP Extract Text from Webpage

Question

4 answers

solution1 2 2011-05-09 18:13:24

solution2 1 2011-05-09 18:11:16

solution3 0 2011-05-09 18:09:17

solution4 0 2011-05-09 18:17:06

solution1
2 2011-05-09 18:13:24

solution2
1 2011-05-09 18:11:16

solution3
0 2011-05-09 18:09:17

solution4
0 2011-05-09 18:17:06