简体   繁体   English

抓取由 Javascript 生成的网站

[英]Scrape web site generated by Javascript

I think this is a real challenging one!我认为这是一个真正具有挑战性的!

I write a website for my local football league, www.rdyfl.co.uk , and include javascript code snippets from the FA's Full-Time system where we generate our fixtures, linking in tables fixtures recent results etc.我为我当地的足球联赛写了一个网站 www.rdyfl.co.uk ,并包含来自 FA 全职系统的 javascript 代码片段,我们在其中生成我们的赛程,在表格中链接赛程最近的结果等。

For another feature I want to add to the site I need to scrape the 'Upcoming Fixtures' for each agegroup and division but when I examine the source I have two problems.对于我想添加到网站的另一个功能,我需要为每个年龄组和部门抓取“即将到来的比赛”,但是当我检查来源时,我遇到了两个问题。

  1. The fixtures content is generated by javascript and therefore I need to see the generated source and not just the source.固定装置内容是由 javascript 生成的,因此我需要查看生成的源代码,而不仅仅是源代码。

  2. When I view the generated source using Firefox the team names are actually further javascript links and not the name itself.当我使用 Firefox 查看生成的源代码时,团队名称实际上是进一步的 javascript 链接,而不是名称本身。

I basically want to somehow download the fixtures on a regular basis and write then to a mysql database ?我基本上想以某种方式定期下载装置,然后写入 mysql 数据库?

I have asked the FA and they have no more options available to access the data ?我已经问过 FA,他们没有更多选项可以访问数据?

Having never coded for scraping before can anyone point me to a simple solution or does anyone fancy the challange?以前从未编码过抓取,谁能指出我一个简单的解决方案,或者有人喜欢挑战吗?

This question was asked a long time ago, but I noticed it was active today 🤷.这个问题很久以前就被问到了,但我发现它今天很活跃🤷。

You should be able to scrape the website using a headless browser such as Puppeteer .您应该能够使用无头浏览器(例如Puppeteer )抓取网站。 Using Puppeteer you are able to access a URL and execute JavaScript or interact with the website as you would with an ordinary browser.使用 Puppeteer,您可以像使用普通浏览器一样访问 URL 并执行 JavaScript 或与网站交互。 Parsing the output DOM and storing it should then be relatively straightforward.解析输出 DOM 并存储它应该相对简单。

There are plenty of articles on this topic using Puppeteer.有很多关于这个主题的文章使用 Puppeteer。

The latest version of OutWit Hub is doing a pretty good job on dynamic content.最新版本的OutWit Hub在动态内容方面做得非常好。 The source scraped by outwit to extract links, images, documents and tables and text is the updated DOM.被智取用于提取链接、图像、文档、表格和文本的源是更新后的 DOM。 You can certainly make a job to grab what you need using these.你当然可以做一份工作来使用这些来获取你需要的东西。 Custom scrapers are still applied to the static source in version 1.0.3 but version 1.1.x (still in beta) will offers the choice between the static source and the dynamically modified DOM.在 1.0.3 版中,自定义抓取工具仍然应用于静态源,但 1.1.x 版(仍处于测试阶段)将提供静态源和动态修改的 DOM 之间的选择。

Scrapping content produced by Javascript is challenging.抓取 Javascript 生成的内容具有挑战性。 AFAIK you will need to do this with AJAX. AFAIK 你需要用 AJAX 来做到这一点。 Hopefully the content has some css that you can grab with jQuery or at least some id's.希望内容有一些 css,您可以使用 jQuery 或至少一些 id 来获取。 Do you have id's or classes that you can grab?你有可以获取的 id 或类吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM