简体   繁体   English

如何抓取网页并从中提取信息?

[英]How to scrape webpages and extract information from them?

As a student in chemistry, I repeatedly have to look up molecules and get their SMILES string.作为一名化学专业的学生,我必须反复查找分子并获取它们的 SMILES 字符串。 SMILES strings are a mechanism that help us recreate the molecule in all sorts of chemical software. SMILES 字符串是一种机制,可以帮助我们在各种化学软件中重新创建分子。

For example, consider Alanine.例如,考虑丙氨酸。 I will search Alanine and go to the PubChem link .我将搜索丙氨酸和 go 到PubChem 链接 There I will look for the "Canonical SMILES" section, and copy paste the SMILES string into the code I am using.在那里,我将查找“Canonical SMILES”部分,并将 SMILES 字符串复制粘贴到我正在使用的代码中。

If it is just one molecule, I might as well do the above.如果只是一个分子,我还不如做上面的。 However, I now have to do this for 20 molecules.但是,我现在必须对 20 个分子执行此操作。 This seems like a lot of googling, clicking, and copy-pasting.这似乎需要大量的谷歌搜索、点击和复制粘贴。

Is there a way to automate this process?有没有办法自动化这个过程? Are there python libraries I can use for such a process?是否有可用于此类过程的 python 库? Can you do the same tricks using grep/awk on information on webpages?你能在网页信息上使用 grep/awk 做同样的技巧吗?

The module I use to scrape webpages might be able to help?我用来抓取网页的模块可能会有所帮助? All of the other web scraping modules are really complicated but have more features.所有其他 web 抓取模块都非常复杂,但功能更多。 the requests module just gets the exact data from a website, if you scrape a.html document it would return somthing that looked like this <html><head><title>test</title></head></html> , just the raw data. requests 模块只是从网站获取确切的数据,如果您抓取 a.html 文档,它将返回类似于<html><head><title>test</title></head></html>的内容,只是原始数据。 it could be more helpful for getting more information but it can be a little bit more frustrating if you just want a specific part of the page.它可能对获取更多信息更有帮助,但如果您只想要页面的特定部分,它可能会更令人沮丧。

the code to use it looks like this使用它的代码看起来像这样

import requests

data = requests.get("google.com")
print(data)

Before you do any of this most websites have an API that can return exactly the data you need from the site within your code, in the footer they should have a developers link if they have an API在您执行任何此操作之前,大多数网站都有一个 API 可以在您的代码中从该站点准确返回您需要的数据,如果他们有一个 API,在页脚中应该有一个开发人员链接

To return the html document.返回 html 文档。 (make sure to pip install requests!) (确保 pip 安装请求!)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM