简体   繁体   English

使用Jsoup难以从网站上获取文本

[英]Trouble grabbing text from a website using Jsoup

I'm trying to grab a price from an amazon link. 我正在尝试从亚马逊链接获取价格

Here's the html I'm focusing on: 这是我关注的html:

<div class="buying" id="priceBlock">
    <table class="product">
        <tbody>
            <tr id="actualPriceRow">
                <td class="priceBlockLabelPrice" id="actualPriceLabel">Price:</td>
                <td id="actualPriceContent">
                    <span id="actualPriceValue">
                        <b class="priceLarge">
                                $1.99
                        </b>
                    </span>

                </td>
            </tr>
        </tbody>
    </table>
</div>                

I'm trying to grab that $1.99 text. 我正在尝试获取该1.99美元的文本。

Here's my code that is trying to grab it. 这是我尝试获取的代码。

protected Void doInBackground(Void... params) {
            try {
                // Connect to the web site
                Document document = Jsoup.connect(url).get();
                // Get the html document title
                Elements trs = document.select("table.product");



                for (Element tr : trs)
                {
                    Elements tds = tr.select("b.priceLarge");
                    Element price1 = tds.first();
                    String str1 = price1.text();
                    System.out.println(str1);
                    String str2 = str1.replaceAll( "[$,]", "" );
                    double aInt = Double.parseDouble(str2);
                    System.out.println("Price: " + aInt);

                }

            } catch (IOException e) {
                e.printStackTrace();
            }

            return null;
        }

Why isn't this code working? 为什么此代码不起作用?

You have to use a user agent so the site won't reject you as a bot . 您必须使用user agent以便网站不会拒绝您成为漫游器 You should also add some timeout limit in order to override the default one, which might be too short for you. 您还应该添加一些超时限制,以覆盖默认值,这对于您来说可能太短了。 Three seconds is a good option but feel free to change it at will. 三秒是一个不错的选择,但可以随意更改。 timeout(0) will wait as long as the server needs to give some response. 只要服务器需要给出一些响应, timeout(0)就会等待。 If you don't want a limit use that. 如果您不想要限制,请使用它。 There is also some weird DOM parsing you are doing, which is causing a NullPointerException . 您正在执行一些奇怪的DOM解析,这会导致NullPointerException Try this 尝试这个

String url = "http://www.amazon.com/dp/B00H2T37SO/?tag=stackoverfl08-20";
Document doc = Jsoup
                .connect(url)
                .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36")
                .timeout(3000)
                .get();

Elements prices = doc.select("table.product b.priceLarge");
for (Element pr : prices)
{
    String priceWithCurrency = pr.text();
    System.out.println(priceWithCurrency);
    String priceAsText = priceWithCurrency.replaceAll( "[$,]", "" );
    double priceAsNumber = Double.parseDouble(priceAsText);
    System.out.println("Price: " + priceAsNumber);
}   

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM