简体   繁体   中英

Grabbing proxies from website in Java?

I have been having trouble trying to get proxies from hidemyass. I was wondering if anybody could either tell me what I'm doing wrong or show me a way of fixing the following:

public void loadProxies() 
{
    proxies.clear();
    String html = null;
    String url = "http://hidemyass.com/proxy-list/";
    int page = 1;
    Pattern REPLACECRAP = Pattern.compile("<(span|div) style=\"display:none\">[\\s\\d\\s]*</(span|div)>");
    while (page <= this.pages) {
        status = "Scraping Proxies " + page + "/40";
        try {
            html = Jsoup.connect(url + page).get().html();
            org.jsoup.select.Elements ele = Jsoup.parse(html).getElementsByAttributeValueMatching("class", "altshade");
            for (Iterator localIterator = ele.iterator(); localIterator.hasNext();) { 
                Object s = localIterator.next();
                org.jsoup.select.Elements ele1 = Jsoup.parse(s.toString()).children();
                String text = ele1.toString().substring(ele1.toString().indexOf("</span>"), ele1.toString().indexOf("<span class=\"country\""));
                org.jsoup.select.Elements ele2 = Jsoup.parse(text).children();
                Matcher matcher = REPLACECRAP.matcher(ele2.toString());
                String better = matcher.replaceAll("");
                ele2 = Jsoup.parse(better).children();
                String done = ele2.text();
                String port = done.substring(done.lastIndexOf(" ") + 1);
                String ip = done.substring(0, done.lastIndexOf(" ")).replaceAll(" ", "");
                proxies.add(ip + ":" + port);
            }
            page++;
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

This does get some part of the proxy from the website although it seems to be mixing bits together like this:

PROXY:98210.285995154180237.6396219.54:3128
PROXY:58129158250.246.179237.4682139176:1080
PROXY:5373992110205212248.8199175.88107.15141185249:8080
PROXY:34596887144221.4.2449100134138186248.231:9000

Those are some of the results i get ^ when running the above code. When i would want something PROXY:210:197:182:294:8080

Any help with this would be greatly appreciated.

Except if you really want to do it this way, consider taking a look at http://import.io which provides a tool to parse anything you want and to export it as an API. Is you're using Java you can try http://thuzhen.github.io/facilitator/ which will help you getting your data a very quick way.

Parsing this website is going to take more than running a regex over the source.

It has been designed to make scraping difficult, mixing random data with display:none in with data that you're looking for.

If you're going to try and parse this correctly, you'll need to pick out the data marked as display:inline as well as parsing the inline CSS before each row which marks elements with certain ids as inline or none as appropriate.

Also, when the website is designed to make scraping as difficult as possible, I'd expect them to regularly change the source in ways that will break scrapers that currently work.

HideMyAss uses a variety of tactics. And despite what people always say about "you can't do that with regex!", yes you can. Well, with help of regex as I wrote a scraper for HideMyAss that relies on it heavily. In addition to what you've taken out, you need to check for inline css like:

.HE8g{display:none}
.rI6a{display:inline}
.aHd-{display:none}
.Ln16{display:inline}

and remove any elements matching display none in the inline css:

<span class="HE8g">48</span>

which will be interjected throughout the ip addresses. as well as empty spans: As far as I remember there are no empty divs that are your concern, but it wouldn't hurt to check for them

There are a few gotchas but the obfuscated html is very predictable and has been for years.

It was easiest for me to solve by running against the same html source and to remove the obfuscations in a step by step fashion.

I know this is an old question, but good luck to anyone reading.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM