简体   繁体   English

Cheerio无法正确解析HTML

[英]Cheerio Not Parsing HTML Correctly

I've got an array of rows that I've parsed out of a table from html, stored in a list. 我有一个行数组,这些行是从html表中解析出来的,存储在一个列表中。 Each of the rows in the list is a string that looks (something) like this: 列表中的每一行都是一个看起来像这样的字符串:

["<td headers="DOCUMENT" class="t14data"><a target="6690-Exhibit-C-20190611-1" href="http://www.fara.gov/docs/6690-Exhibit-C-20190611-1.pdf" class="doj-analytics-processed"><span style="color:blue">Click Here </span></a></td><td headers="REGISTRATIONNUMBER" class="t14data">6690</td><td headers="REGISTRANTNAME" class="t14data">SKDKnickerbocker LLC</td><td headers="DOCUMENTTYPE" class="t14data">Exhibit C</td><td headers="STAMPED/RECEIVEDDATE" class="t14data">06/11/2019</td>","<td headers="DOCUMENT" class="t14data"><a target="5334-Supplemental-Statement-20190611-30" href="http://www.fara.gov/docs/5334-Supplemental-Statement-20190611-30.pdf" class="doj-analytics-processed"><span style="color:blue">Click Here </span></a></td><td headers="REGISTRATIONNUMBER" class="t14data">5334</td><td headers="REGISTRANTNAME" class="t14data">Commonwealth of Dominica Maritime Registry, Inc.</td><td headers="DOCUMENTTYPE" class="t14data">Supplemental Statement</td><td headers="STAMPED/RECEIVEDDATE" class="t14data">06/11/2019</td>"]

The code is pulled from the page with the following page.evaluate function using puppeteer. 使用page.evaluate从具有以下page.evaluate函数的页面中提取代码。

I'd like to then parse this code with cheerio, which I find to be simpler and more understandable. 然后,我想用cheerio解析此代码,我发现它更简单易懂。 However, when I pass each of the strings of html into cheerio, it fails to parse them correctly. 但是,当我将html的每个字符串传递给cheerio时,它无法正确解析它们。 Here's the current function I'm using: 这是我正在使用的当前函数:

    let data = res.map((tr) => {
        let $ = cheerio.load(tr);
        const link = $("a").attr("href");
        const number = $("td[headers='REGISTRATIONNUMBER']").text();
        const name = $("td[headers='REGISTRANTNAME']").text();
        const type = $("td[headers='DOCUMENTTYPE']").text();
        const date = $("td[headers='STAMPED/RECEIVEDDATE']").text();
        return { link, number, name, type, date };
    });

For some reason, only the "a" tag is working correctly for each row. 出于某种原因,每行仅“ a”标签可以正常工作。 Meaning, the "link" variable is correctly defined, but none of the other ones are. 这意味着“ link”变量已正确定义,但其他变量均未定义。 When I use $("*") to return a list of what should be all of the td's, it returns an unusual node list: 当我使用$(“ *”)返回所有td的列表时,它返回一个不寻常的节点列表:

在此处输入图片说明

What am I doing wrong, and how can I gain access to the td's with the various headers, and their text content? 我在做什么错,我如何通过各种标题及其文本内容访问td? Thanks! 谢谢!

It usually looks more like this: 通常看起来像这样:

let data = res.map((i, tr) => {
  const link   = $(tr).find("a").attr("href");
  const number = $(tr).find("td[headers='REGISTRATIONNUMBER']").text();
  const name   = $(tr).find("td[headers='REGISTRANTNAME']").text();
  const type   = $(tr).find("td[headers='DOCUMENTTYPE']").text();
  const date   = $(tr).find("td[headers='STAMPED/RECEIVEDDATE']").text();
  return { link, number, name, type, date };
}).get();

Keep in mind that cheerio map has the arguments reversed from js map. 请记住,cheerio映射的参数与js映射相反。

I found the solution. 我找到了解决方案。 I'm simply returning the full html through puppeteer instead of trying to get individual rows, and then using the above suggestion (from @pguardiario) to parse the text: 我只是通过puppeteer返回完整的html,而不是尝试获取单个行,然后使用上面的建议(来自@pguardiario)来解析文本:

 const res = await page.evaluate(() => {
            return document.body.innerHTML;
        });

 let $ = cheerio.load(res);
        let trs = $(".t14Standard tbody tr.highlight-row");

 let data = trs.map((i, tr) => {
        const link = $(tr).find("a").attr("href");
        const number = $(tr).find("td[headers='REGISTRATIONNUMBER']").text();
        const registrant = $(tr).find("td[headers='REGISTRANTNAME']").text();
        const type = $(tr).find("td[headers='DOCUMENTTYPE']").text();
        const date = moment($(tr).find("td[headers='STAMPED/RECEIVEDDATE']").text()).valueOf().toString();
        return { link, number, registrant, type, date };
    });

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM