简体   繁体   English

使用Jsoup处理CSS类名称中的空格

[英]Deal with whitespaces in CSS class names with Jsoup

I want to select some supermarket product info from this page: 我要从此页面选择一些超市产品信息:

http://www.angeloni.com.br/super/index?grupo=15022 http://www.angeloni.com.br/super/index?grupo=15022

For that I should select <ul> tags with class "lstProd " : 为此,我应该选择"lstProd "类的<ul>标签:

If the class name were "lstProd" it would be easy, but the problem is the whitespace at the end of name. 如果类名是"lstProd" ,这很容易,但是问题是名称末尾的空格。 I couldn't make Jsoup deal with it. 我无法让Jsoup处理它。

I tried the code below and other ways but it always get an empty list. 我尝试了下面的代码以及其他方式,但是它总是得到一个空列表。

org.jsoup.nodes.Document document = Jsoup.connect("http://www.angeloni.com.br/super/index?grupo=15022").get();
    org.jsoup.select.Elements list = doc.select("ul.lstProd  ");

the code snippet from html page that I want to get: 我想从html页面获取的代码片段:

<ul class="lstProd  ">
    <li>
        <span class="cod">CÓD. 1341372</span>
        <span class="lnkImgProd">
            <a href="/super/produto?grupo=15022&amp;idProduto=1341372">
                <img src="http://assets.angeloni.com.br/files/images/7/1B/C6/1341372_1_V.jpg" width="120" height="120"
                     alt="Creme Dental SORRISO Super Refrescante Tubo 90g">
            </a>
                    </span>
        <div class="RgtDetProd">
            <div class="boxInfoProd">
                <span class="descr">
                    <a href="/super/produto?grupo=15022&amp;idProduto=1341372">Creme Dental SORRISO Super Refrescante
                        Tubo 90g</a>

                                    </span>

                <ul class="lstProdFlags after">
                </ul>
            </div>
...

I think you are facing two completely separate problems: 我认为您面临两个完全独立的问题:

  1. Jsoup does not load the site you think it loads. Jsoup不会加载您认为已加载的网站。 The website you specified renders its contents via JavaScript and loads some content after initial page loading through AJAX. 您指定的网站通过JavaScript呈现其内容,并在通过AJAX加载初始页面后加载一些内容。 JSoup can't deal with this. JSoup无法处理此问题。 You either need to investigative the AJAX calls and get them directly with Jsoup, or you use something like selenium webdriver to get the page in a real browser which will render everything as you expect it. 您要么需要调查AJAX调用并直接通过Jsoup来获得它们,要么使用硒Web驱动程序之类的东西来在真实的浏览器中获取页面,该页面将按您期望的方式呈现所有内容。

  2. CSS class names can't contain spaces for practical purposes 1 . CSS类名称不能包含用于实际用途的空格1 In HTML spaces are used as separator between class names. 在HTML中,空格用作类名之间的分隔符。 Hence <ul class="lstProd "> is the same as <ul class="lstProd"> . 因此, <ul class="lstProd "><ul class="lstProd"> In CSS selectors however a class name is specified by .className , ie dot followed by the class name. 但是,在CSS选择器中,类名由.className指定,即点号后跟类名。 You can concatinate several classes like this: element.select(".className1.className2") 您可以像这样概括几个类: element.select(".className1.className2")

1 Technically you can put spaces in CSS classes, but you need to escape them with '\\ ' . 1从技术上讲,您可以在CSS类中放置空格,但是您需要使用'\\ '对其进行转义。 See https://mathiasbynens.be/notes/css-escapes or Which characters are valid in CSS class names/selectors? 请参阅https://mathiasbynens.be/notes/css-escapesCSS类名称/选择器中哪些字符有效?

edit: be more precise about CSS class names 编辑:更精确地关于CSS类名称

CSS class names CAN contain whitespaces. CSS类名称可以包含空格。
And <ul class="lstProd "> is NOT same as <ul class="lstProd"> . <ul class="lstProd "> 相同<ul class="lstProd">

And I can see that you have multiple <ul> with same class name. 而且我可以看到您有多个具有相同类名的<ul>
The better way to inspect or traverse such element is by nth-child 检查或遍历此类元素的更好方法是通过nth-child
So to find your required selector you can use #abaProd > ul:nth-child(4) 因此,要查找所需的选择器,可以使用#abaProd > ul:nth-child(4)
For more details about nth-child 有关nth-child更多详细信息

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM