简体   繁体   English

使用jsoup从“表单”类中提取带有可变页面数据的文本

[英]Use jsoup to extract text from 'form' class with variable page data

First post here so i'll do my best to keep this specific. 首先发布在这里,所以我会尽力保持这一点。 I have been using Jsoup to extract data from a host of web pages to bring into a utitity app. 我一直在使用Jsoup从一系列网页中提取数据以引入一个优秀的应用程序。 I have come across a page which updates the data dynamically based on the users selection from a drop down box. 我遇到了一个页面,它根据下拉框中的用户选择动态更新数据。 I can see the data when I inspect the html in Chrome, however I cannot seem to extract it. 当我在Chrome中检查html时,我可以看到数据,但我似乎无法提取它。 I can extract all the text elements around it, but anything dynamically generated wont come out. 我可以提取它周围的所有文本元素,但动态生成的任何内容都不会出来。

The page i'm looking at has the below form class, apologies for the wrapping, I couldn't get rid of it. 我正在看的页面有下面的表格类,为包装道歉,我无法摆脱它。

 <form class="variations_form cart" method="post" enctype="multipart/form-data" data-product_id="8044" data-product_variations="[{&quot;variation_id&quot;:8047,&quot;variation_is_visible&quot;:true,&quot;variation_is_active&quot;:true,&quot;is_purchasable&quot;:true,&quot;display_price&quot;:19.70,&quot;display_regular_price&quot;:19.70,&quot;attributes&quot;:{&quot;attribute_size&quot;:&quot;500g&quot;},&quot;image_src&quot;:&quot;http:\\/\\/www.sourcewebsite.com\\/wp-content\\/uploads\\/2014\\/08\\/LABELS_500g-FOOD-Vann-475x652.png&quot;,&quot;image_link&quot;:&quot;http:\\/\\/www.sourcewebsite.com\\/wp-content\\/uploads\\/2014\\/08\\/LABELS_500g-FOOD-Vann.png&quot;,&quot;image_title&quot;:&quot;LABELS_500g-FOOD Vann&quot;,&quot;image_alt&quot;:&quot;&quot;,&quot;image_srcset&quot;:&quot;http:\\/\\/www.sourcewebsite.com\\/wp-content\\/uploads\\/2014\\/08\\/LABELS_500g-FOOD-Vann-746x1024.png 746w, http:\\/\\/www.sourcewebsite.com\\/wp-content\\/uploads\\/2014\\/08\\/LABELS_500g-FOOD-Vann-475x652.png 475w, http:\\/\\/www.sourcewebsite.com\\/wp-content\\/uploads\\/2014\\/08\\/LABELS_500g-FOOD-Vann.png 1063w&quot;,&quot;image_sizes&quot;:&quot;(max-width: 475px) 100vw, 475px&quot;,&quot;price_html&quot;:&quot;<span class=\\&quot;price\\&quot;><span class=\\&quot;amount\\&quot;>$19.70<\\/span><\\/span>&quot;,&quot;availability_html&quot;:&quot;&quot;,&quot;sku&quot;:&quot;FOOD-Vanilla-500&quot;,&quot;weight&quot;:&quot;.5 kg&quot;,&quot;dimensions&quot;:&quot;&quot;,&quot;min_qty&quot;:1,&quot;max_qty&quot;:&quot;&quot;,&quot;backorders_allowed&quot;:false,&quot;is_in_stock&quot;:true,&quot;is_downloadable&quot;:false,&quot;is_virtual&quot;:false,&quot;is_sold_individually&quot;:&quot;no&quot;,&quot;variation_description&quot;:&quot;<p>500g<\\/p>\\n&quot;},{&quot;variation_id&quot;:8045,&quot;variation_is_visible&quot;:true,&quot;variation_is_active&quot;:true,&quot;is_purchasable&quot;:true,&quot;display_price&quot;:13.50,&quot;display_regular_price&quot;:13.50,&quot;attributes&quot;:{&quot;attribute_size&quot;:&quot;1kg&quot;},&quot;image_src&quot;:&quot;http:\\/\\/www.sourcewebsite.com\\/wp-content\\/uploads\\/2014\\/09\\/LABELS_1kg-FOOD-Van-475x652.png&quot;,&quot;image_link&quot;:&quot;http:\\/\\/www.sourcewebsite.com\\/wp-content\\/uploads\\/2014\\/09\\/LABELS_1kg-FOOD-Van.png&quot;,&quot;image_title&quot;:&quot;LABELS_1kg-FOOD Van&quot;,&quot;image_alt&quot;:&quot;&quot;,&quot;image_srcset&quot;:&quot;http:\\/\\/www.sourcewebsite.com\\/wp-content\\/uploads\\/2014\\/09\\/LABELS_1kg-FOOD-Van-746x1024.png 746w, http:\\/\\/www.sourcewebsite.com\\/wp-content\\/uploads\\/2014\\/09\\/LABELS_1kg-FOOD-Van-475x652.png 475w, http:\\/\\/www.sourcewebsite.com\\/wp-content\\/uploads\\/2014\\/09\\/LABELS_1kg-FOOD-Van.png 1063w&quot;,&quot;image_sizes&quot;:&quot;(max-width: 475px) 100vw, 475px&quot;,&quot;price_html&quot;:&quot;<span class=\\&quot;price\\&quot;><span class=\\&quot;amount\\&quot;>$13.50<\\/span><\\/span>&quot;,&quot;availability_html&quot;:&quot;&quot;,&quot;sku&quot;:&quot;FOOD-Vanilla-1kg&quot;,&quot;weight&quot;:&quot;1 kg&quot;,&quot;dimensions&quot;:&quot;&quot;,&quot;min_qty&quot;:1,&quot;max_qty&quot;:&quot;&quot;,&quot;backorders_allowed&quot;:false,&quot;is_in_stock&quot;:true,&quot;is_downloadable&quot;:false,&quot;is_virtual&quot;:false,&quot;is_sold_individually&quot;:&quot;no&quot;,&quot;variation_description&quot;:&quot;<p>1kg<\\/p>\\n&quot;},{&quot;variation_id&quot;:8046,&quot;variation_is_visible&quot;:true,&quot;variation_is_active&quot;:true,&quot;is_purchasable&quot;:true,&quot;display_price&quot;:199.95,&quot;display_regular_price&quot;:199.95,&quot;attributes&quot;:{&quot;attribute_size&quot;:&quot;3kg&quot;},&quot;image_src&quot;:&quot;http:\\/\\/www.sourcewebsite.com\\/wp-content\\/uploads\\/2014\\/09\\/LABELS_3kg-FOOD-Van-475x652.png&quot;,&quot;image_link&quot;:&quot;http:\\/\\/www.sourcewebsite.com\\/wp-content\\/uploads\\/2014\\/09\\/LABELS_3kg-FOOD-Van.png&quot;,&quot;image_title&quot;:&quot;LABELS_3kg-FOOD Van&quot;,&quot;image_alt&quot;:&quot;&quot;,&quot;image_srcset&quot;:&quot;http:\\/\\/www.sourcewebsite.com\\/wp-content\\/uploads\\/2014\\/09\\/LABELS_3kg-FOOD-Van-746x1024.png 746w, http:\\/\\/www.sourcewebsite.com\\/wp-content\\/uploads\\/2014\\/09\\/LABELS_3kg-FOOD-Van-475x652.png 475w, http:\\/\\/www.sourcewebsite.com\\/wp-content\\/uploads\\/2014\\/09\\/LABELS_3kg-FOOD-Van.png 1063w&quot;,&quot;image_sizes&quot;:&quot;(max-width: 475px) 100vw, 475px&quot;,&quot;price_html&quot;:&quot;<span class=\\&quot;price\\&quot;><span class=\\&quot;amount\\&quot;>$199.95<\\/span><\\/span>&quot;,&quot;availability_html&quot;:&quot;&quot;,&quot;sku&quot;:&quot;FOOD-Vanilla-3kg&quot;,&quot;weight&quot;:&quot;3 kg&quot;,&quot;dimensions&quot;:&quot;&quot;,&quot;min_qty&quot;:1,&quot;max_qty&quot;:&quot;&quot;,&quot;backorders_allowed&quot;:false,&quot;is_in_stock&quot;:true,&quot;is_downloadable&quot;:false,&quot;is_virtual&quot;:false,&quot;is_sold_individually&quot;:&quot;no&quot;,&quot;variation_description&quot;:&quot;<p>3kg<\\/p>\\n&quot;}]"> <table class="variations" cellspacing="0"> <tbody> <tr> <td class="label"> <label for="size">Size</label> </td> <td class="value"> <select id="size" class="" name="attribute_size" data-attribute_name="attribute_size"> <option value="">Choose an option</option> <option value="500g">500g</option> <option value="1kg" selected="selected">1kg</option> <option value="3kg">3kg</option> </select><a class="reset_variations" href="#" style="visibility: visible; display: block;">Clear selection</a> </td> </tr> </tbody> </table> <div class="angelleye_buton_box_relative" style="position: relative;"> <div class="single_variation_wrap"> <div class="woocommerce-variation-description" style="border: 1px solid transparent;"> <p>1kg</p> </div> <div class="single_variation"><span class="price"><span class="amount selectorgadget_selected">$13.50</span></span> </div> <div class="variations_button"> <div class="quantity"> <input type="number" step="1" name="quantity" value="1" title="Qty" class="input-text qty text" size="4" min="1"> </div> <button type="submit" class="single_add_to_cart_button button alt">Add to basket</button> <input type="hidden" name="add-to-cart" value="8044"> <input type="hidden" name="product_id" value="8044"> <input type="hidden" name="variation_id" class="variation_id" value="8045"> </div> </div> <div class="blockUI blockOverlay angelleyeOverlay" style="display:none;z-index: 1000; border: none; margin: 0px; padding: 0px; width: 100%; height: 100%; top: 0px; left: 0px; opacity: 0.6; cursor: default; position: absolute; background: url(http://www.sourcewebsite.com/wp-content/plugins/woocommerce/assets/images/select2-spinner.gif) 50% 50% / 16px 16px no-repeat rgb(255, 255, 255);"></div> </div> </form> 

I am trying to extract the price "13.50" from the below div. 我试图从下面的div中提取价格“13.50”。

 <div class="single_variation"><span class="price"><span class="amount selectorgadget_selected">$13.50</span></span> </div> 

My code is below: 我的代码如下:

    private class ParseFoodPriceURL extends AsyncTask<String, Void, String> {

    @Override
    protected String doInBackground(String... strings) {
        StringBuffer buffer = new StringBuffer();
        try {
            Document doc = Jsoup.connect(strings[0]).get();
            Elements foodPrice = doc.select("div.single_variation_wrap > div.single_variation");
            String priceTextSelection = foodPrice.text();
            buffer.append("Price: $" + priceTextSelection);

        }
        catch (Throwable t) {
            t.printStackTrace();
        }
        return buffer.toString();
    }

JSoup is not a browser, so it will not interpret and execute JavaScript. JSoup不是浏览器,因此它不会解释和执行JavaScript。 If the content of a website is generated dynamically you can't use JSoup directly. 如果网站的内容是动态生成的,则无法直接使用JSoup。 Two options come to my mind: 我想到了两个选择:

  1. Identify the AJAX calls directly and get the information via these calls. 直接识别AJAX调用并通过这些调用获取信息。 Often the response is not HTML but JSON. 通常,响应不是HTML而是JSON。 So you may need other parsing libraries. 所以你可能需要其他解析库。 This option is fast, but you need to investigate and understand how the webpage works. 此选项很快,但您需要调查并了解网页的工作原理。

  2. Use selenium webdriver with a real browser engine (phantomjs for example). 使用selenium webdriver和真正的浏览器引擎(例如phantomjs)。 This will load the website like a real browser but you can access its contents similar to JSoup. 这将像真正的浏览器一样加载网站,但您可以访问类似于JSoup的内容。 This is relatively easy to program, but slow and uses a lot of resources. 这相对容易编程,但速度慢并且使用了大量资源。 If you run within android this may be too much. 如果你在android中运行,这可能太多了。 Anyway for Android the right tool for this seems to be Selenoid . 无论如何,Android的正确工具似乎是Selenoid

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM