簡體   English   中英

如何使用 Apify 和 Puppeteer 構建成功的網頁抓取結果?

[英]How to structure the results of a successful web scrape using Apify and Puppeteer?

使用 Apify 和 Puppeteer,我想從以下 URL 抓取數據表:

https://en.wikipedia.org/wiki/List_of_hedge_funds

我希望結果是一個對象數組。 數組的每個元素應該代表原始數據源表的每一行<tr>並且是一個具有以下屬性的 JS 對象。

{ firmName, firmUrl, hq, hqUrl, aum, }

在哪里:

  • firmName是每行第一個<td>元素的.innerText()
  • firmUrl是每行第一個<td>元素的href屬性。
  • hq. innerText() 每行的第二個<td>元素的. innerText()
  • hqUrl是每行第二個<td>元素的href屬性。
  • aum. innerText() 每行的第三個<td>元素的. innerText()

具體來說,例如,我希望看到以下對象返回給我。

我想看到的,備選方案 A:
[
  {
    "url": "https://en.wikipedia.org/wiki/List_of_hedge_funds",
    "pageTitle": "List of hedge funds - Wikipedia",
    "links": {
      firmName: "Bridgewater Associates",
      firmUrl: "/wiki/Bridgewater_Associates",
      hq: "Westport, Connecticut",
      hqUrl: "/wiki/Westport,_Connecticut",
      aum: "$132,050",
    },  
  },
  // ...x39 more times
]

或者,或者,對象可能如下(我不知道哪個是可能的,這是我困惑的一部分)

我想看到的,備選方案 B:
// The function accepts a single argument: the "context" object.
// For a complete list of its properties and functions,
// see https://apify.com/apify/web-scraper#page-function 
async function pageFunction( context ) {
    const url = 'https://en.wikipedia.org/wiki/List_of_hedge_funds';
    const TITLE_SELECTOR = 'title';
    const ANCHOR_SELECTOR = 'tr > td > a';
    const HREF_SELECTOR = 'href';

    // jQuery is handy for finding DOM elements and extracting data from them.
    //  To use it, make sure to enable the "Inject jQuery" option.
    const $ = context.jQuery;
    const pageTitle = $( TITLE_SELECTOR ).first().text();
    const anchorTag = $( ANCHOR_SELECTOR );
    const links = [];
    anchorTag.each((index, item,) => {
      const link = $(item).attr( HREF_SELECTOR );
      if( link ) links.push( link );
    });

    return {
      url: context.request.url,
      pageTitle,
      links,
    };
}

但相反,我實際上看到了以下結果。

我實際看到的:
 [{ "url": "https://en.wikipedia.org/wiki/List_of_hedge_funds", "pageTitle": "List of hedge funds - Wikipedia", "links": [ "/wiki/Bridgewater_Associates", "/wiki/Westport,_Connecticut", "/wiki/Renaissance_Technologies", "/wiki/East_Setauket,_New_York", "/wiki/Man_Group", "/wiki/London", "/wiki/AQR_Capital_Management", "/wiki/Greenwich,_Connecticut", "/wiki/Two_Sigma_Investments", "/wiki/New_York_City,_New_York", "/wiki/Millennium_Management,_LLC", "/wiki/New_York_City,_New_York", "/wiki/Elliott_Management", "/wiki/New_York_City,_New_York", "/wiki/BlackRock", "/wiki/New_York_City,_New_York", "/wiki/Citadel_LLC", "/wiki/Chicago,_IL", "/wiki/Davidson_Kempner_Capital_Management", "/wiki/New_York_City,_New_York", "/wiki/Viking_Global_Investors", "/wiki/Greenwich,_Connecticut", "/wiki/Baupost_Group", "/wiki/Boston,_MA", "/wiki/DE_Shaw_%26_Co.", "/wiki/New_York_City,_New_York", "/wiki/Farallon_Capital", "/wiki/San_Francisco,_CA", "/wiki/Marshall_Wace", "/wiki/London", "/wiki/The_Children%27s_Investment_Fund_Management", "/wiki/London", "/wiki/Wellington_Management_Company", "/wiki/Boston,_MA", "/wiki/Winton_Group", "/wiki/London", "/wiki/Capula_Investment_Management", "/wiki/London", "/wiki/York_Capital_Management", "/wiki/New_York_City,_NY" ] }]

我使用以下代碼作為我的pageFunction

頁面函數
// The function accepts a single argument: the "context" object. // For a complete list of its properties and functions, // see https://apify.com/apify/web-scraper#page-function async function pageFunction( context ) { const url = 'https://en.wikipedia.org/wiki/List_of_hedge_funds'; const TITLE_SELECTOR = 'title'; const ANCHOR_SELECTOR = 'tr > td > a'; const HREF_SELECTOR = 'href'; // jQuery is handy for finding DOM elements and extracting data from them. // To use it, make sure to enable the "Inject jQuery" option. const $ = context.jQuery; const pageTitle = $( TITLE_SELECTOR ).first().text(); const anchorTag = $( ANCHOR_SELECTOR ); const links = []; anchorTag.each((index, item,) => { const link = $(item).attr( HREF_SELECTOR ); if( link ) links.push( link ); }); return { url: context.request.url, pageTitle, links, }; }

我需要如何更改我的代碼?

看起來不錯,您需要更改表中數據的解析。 有一個 pageFunction 工作的例子。

// The function accepts a single argument: the "context" object.
// For a complete list of its properties and functions,
// see https://apify.com/apify/web-scraper#page-function 
async function pageFunction( context ) {
    const url = 'https://en.wikipedia.org/wiki/List_of_hedge_funds';
    const TITLE_SELECTOR = 'title';
    const ANCHOR_SELECTOR = 'tr > td > a';
    const LINE_SELECTOR = '.wikitable tr'
    const HREF_SELECTOR = 'href';

    // jQuery is handy for finding DOM elements and extracting data from them.
    //  To use it, make sure to enable the "Inject jQuery" option.
    const $ = context.jQuery;
    const pageTitle = $( TITLE_SELECTOR ).first().text();
    const anchorTag = $( ANCHOR_SELECTOR );
    const lines = $( LINE_SELECTOR );
    const links = [];
    lines.each((index, item) => {
        const columns = $(item).find('td');
        const link = {
          firmName: columns.eq(1).text().trim(),
          firmUrl: columns.eq(1).find('a').eq(0).attr('href'),
          hq: columns.eq(2).text().trim(),
          hqUrl: columns.eq(2).find('a').eq(0).attr('href'),
        }
        if (link.firmUrl) {
            links.push(link);
        }       
    });

    return {
      url: context.request.url,
      pageTitle,
      links,
    };
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM