使用jsoup从特定标签之间的网页中抓取数据

Question

Currently I am developing a program that allows me to collect the most recent 5 fanfiction stories added to my Ao3 (Archive of Our Own) fandom. 目前，我正在开发一个程序，该程序可以让我收集最近添加到我的Ao3（我们自己的存档）同人圈中的5个同名小说故事。 These stories will then be added to an ArrayList I have set up which will hold fanfiction submissions from the past week. 然后将这些故事添加到我设置的ArrayList中，该列表将保存过去一周提交的同人小说。 At the end of every week I plan on having the ArrayList's contents be dumped into a textfile that will allow me to paste it into a Reddit post for my subreddit. 在每个星期结束时，我计划将ArrayList的内容转储到文本文件中，这将允许我将其粘贴到Reddit帖子中以供我的subreddit使用。 Now, to prevent duplicates, I wanted to compare the newly parsed stories with stories currently held in the ArrayList. 现在，为了防止重复，我想将新解析的故事与ArrayList中当前包含的故事进行比较。

(Additional info: The bot will check the webpage every 30 minutes) （其他信息：该漫游器将每30分钟检查一次网页）

The part that I'm getting caught up on is the actual parsing of the webpage and getting the content from between the HTML tags. 我要了解的部分是网页的实际解析，以及从HTML标记之间获取内容。

I looked up CSS Selectors, but I'm still left thoroughly confused, as almost every example was from what seems like an easy website to scrape from, such as IMBD. 我抬头看了CSS选择器，但仍然感到非常困惑，因为几乎每个示例都是从一个似乎容易抓取的网站（例如IMBD）中获得的。

From basic research, it looks like within the main body where I'm looking, the stories are all inside an ordered list tag. 从基础研究来看，这些故事看起来都在我正在寻找的主体内部，所有故事都位于一个有序列表的标签内。

<o1 class="work index group">
    <li class="work blurb group" id="work_10504812" role="article>...</li>
    <li class="work blurb group" id="work_9656693" role="article>...</li>
    <li class="work blurb group" id="work_11814486" role="article>...</li>
    //Goes on for ~20 more stories
    <li class="work blurb group" id="work_11687247" role="article>...</li>
</ol>

So for clarity's sake, each list type is a single story located within the ordered list. 因此，为清楚起见，每种列表类型都是位于有序列表中的单个故事。 Any within one list tag is the following. 一个列表标记内的任何内容如下。 (ordered list tag added for context) （为上下文添加了有序列表标签）

<ol class="work index group">
    <li class="work blurb group" id="work_10504812" role="article">
  <!--title, author, fandom-->
  <div class="header module">
    <h4 class="heading">
      <a href="/works/10504812">Pocket Healer</a>
      by

      <!-- do not cache -->
      <a rel="author" href="/users/OverNoot/pseuds/OverNoot">OverNoot</a> 
    </h4>
    <h5 class="fandoms heading">
      <span class="landmark">Fandoms:</span>
      <a class="tag" href="/tags/Overwatch%20(Video%20Game)/works">Overwatch (Video Game)</a>
      &nbsp;
    </h5>
    <!--required tags-->
    <ul class="required-tags">
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="rating-general-audience rating" title="General Audiences"><span class="text">General Audiences</span></span></a></li>
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="warning-no warnings" title="No Archive Warnings Apply"><span class="text">No Archive Warnings Apply</span></span></a></li>
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="category-femslash category" title="F/F"><span class="text">F/F</span></span></a></li>
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="complete-no iswip" title="Work in Progress"><span class="text">Work in Progress</span></span></a></li>
</ul>
    <p class="datetime">17 Aug 2017</p>
  </div>
  <!--warnings again, cast, freeform tags-->
  <h6 class="landmark heading">Tags</h6>
  <ul class="tags commas">
    <li class="warnings"><strong><a class="tag" href="/tags/No%20Archive%20Warnings%20Apply/works">No Archive Warnings Apply</a></strong></li><li class="relationships"><a class="tag" href="/tags/Fareeha%20%22Pharah%22%20Amari*s*Angela%20%22Mercy%22%20Ziegler/works">Fareeha "Pharah" Amari/Angela "Mercy" Ziegler</a></li><li class="characters"><a class="tag" href="/tags/Fareeha%20%22Pharah%22%20Amari/works">Fareeha "Pharah" Amari</a></li> <li class="characters"><a class="tag" href="/tags/Angela%20%22Mercy%22%20Ziegler/works">Angela "Mercy" Ziegler</a></li> <li class="characters"><a class="tag" href="/tags/Winston%20(Overwatch)/works">Winston (Overwatch)</a></li> <li class="characters"><a class="tag" href="/tags/Lena%20%22Tracer%22%20Oxton/works">Lena "Tracer" Oxton</a></li><li class="freeforms"><a class="tag" href="/tags/Tiny%20Pharah%20and%20Tiny%20Mercy/works">Tiny Pharah and Tiny Mercy</a></li> <li class="freeforms"><a class="tag" href="/tags/Fluff/works">Fluff</a></li> <li class="freeforms last"><a class="tag" href="/tags/Cute/works">Cute</a></li>
  </ul>
  <!--summary-->
    <h6 class="landmark heading">Summary</h6>
    <blockquote class="userstuff summary">
      <p>Angela and Fareeha wake up to find tiny alternate versions of themselves have appeared and are now imprinted on them. How will these tiny Pharahs and Mercies impact their work at Overwatch and more importantly how will it impact the feelings they have for each other.</p>
    </blockquote>
  <!--stats-->

  <dl class="stats">
      <dt class="language">Language:</dt>
      <dd class="language">English</dd>
    <dt class="words">Words:</dt>
    <dd class="words">35,143</dd>
    <dt class="chapters">Chapters:</dt>
    <dd class="chapters">10/11</dd>
    <dt class="comments">Comments:</dt>
    <dd class="comments"><a href="/works/10504812?show_comments=true&amp;view_full_work=true#comments">168</a></dd>
    <dt class="kudos">Kudos:</dt>
    <dd class="kudos"><a href="/works/10504812?view_full_work=true#comments">438</a></dd>
    <dt class="bookmarks">Bookmarks:</dt>
    <dd class="bookmarks"><a href="/works/10504812/bookmarks">35</a></dd>
    <dt class="hits">Hits:</dt>
    <dd class="hits">5890</dd>
  </dl>
</li>

And basically I wanted to extract the title, author, url, summary, and rating. 基本上，我想提取标题，作者，URL，摘要和等级。

So far I've gathered the locations of the items I want to extract, but I have no actual idea how to do so. 到目前为止，我已经收集了要提取的项目的位置，但是我不知道如何提取。

Title: 标题：

<a href="/works/10504812">Pocket Healer</a>

Author: 作者：

<a rel="author" href="/users/OverNoot/pseuds/OverNoot">OverNoot</a>

Url: 网址：

<li class="work blurb group" id="work_10504812" role="article">
<!--(http://archiveofourown.com/works/<the number after 'work_'>)-->

Summary: 摘要：

<blockquote class="userstuff summary">
    <p> (SUMMARY GOES HERE) </p>
</blockquote>

Rating: 评分：

<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="rating-general-audience rating" title="General Audiences"><span class="text">General Audiences</span></span></a></li>

Additional question: Is it possible to iterate through the contents of the ordered list in something like a forloop? 另一个问题：是否可以像forloop那样遍历有序列表的内容？

The current code I have set up for opening the webpage is below. 我为打开网页设置的当前代码如下。

    while (true) {
        try {

            String url = "http://archiveofourown.org/tags/Fareeha%20%22Pharah%22%20Amari*s*Angela%20%22Mercy%22%20Ziegler/works";
            Document doc = Jsoup.connect(url).get();

            //Returns element of webpage
            doc.select("<Narrow down to ordered list>");

            //Run for loop to run through first 5 items of 
            Thread.sleep(THIRTY_MINUTES);

        }
        catch (Exception ex) {
            ex.printStackTrace();
        }

    }

Answer 1

You can use Document.select(String cssSelector) method that returns Elements that you can iterate over. 您可以使用Document.select(String cssSelector)方法返回可以迭代的Elements 。 For example ol.work > li will return all li elements that are first-level children to this ol.work element. 例如， ol.work > li会将所有属于第一级子级的li元素返回到此ol.work元素。 You can use this to iterate over all stories. 您可以使用它遍历所有故事。

Consider following part of code: 考虑以下代码部分：

Elements ol = doc.select("ol.work > li");

for (Element li : ol) {
    String title = li.select("h4.heading a").first().text();
    String author = li.select("h4.heading a[rel=author]").text();
    String id = li.attr("id").replaceAll("work_","");
    String url = "http://archiveofourown.com/works/" + id;
    String summary = li.select("blockquote.summary").text();
    String rating = li.select("span.rating").text();

    System.out.println("Title: " + title);
    System.out.println("Author: " + author);
    System.out.println("ID: " + id);
    System.out.println("URL: " + url);
    System.out.println("Summary: " + summary);
    System.out.println("Rating: " + rating);
}

In this example we get all li elements in for-loop and extract expected content. 在此示例中，我们将所有li元素放入for循环中并提取期望的内容。 As you can see we use select method for every data extraction limited to current li element. 如您所见，我们对仅限于当前li元素的每个数据提取使用select方法。 Element.text() method returns body of an element as a plain text, removing all tags if they are present. Element.text()方法以纯文本形式返回元素的主体，如果存在所有标记，则将其删除。

Running following code with HTML you put in your question produces following output: 使用您的问题中的HTML运行以下代码会产生以下输出：

Title: Pocket Healer
Author: OverNoot
ID: 10504812
URL: http://archiveofourown.com/works/10504812
Summary: Angela and Fareeha wake up to find tiny alternate versions of themselves have appeared and are now imprinted on them. How will these tiny Pharahs and Mercies impact their work at Overwatch and more importantly how will it impact the feelings they have for each other.
Rating: General Audiences

I hope it helps. 希望对您有所帮助。

使用jsoup从特定标签之间的网页中抓取数据

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-08-18 16:36:25

使用jsoup从特定标签之间的网页中抓取数据

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-08-18 16:36:25

解决方案1
0 已采纳 2017-08-18 16:36:25