简体   繁体   English

通过python从HTML提取特定信息

[英]extract specific information from HTML by python

I am trying to extract information such as prices and vendors from Amazon. 我正在尝试从亚马逊提取价格和供应商等信息。

The way I do this now is to find the key words such as price and then to find the information I want. 我现在这样做的方法是找到诸如价格之类的关键词,然后找到我想要的信息。

The problem is if the websites such as Amazon changes their frame a little bit, the code might not work anymore. 问题是,如果像Amazon这样的网站稍微改变了框架,则代码可能不再起作用。

I am wondering if there is some better way / algorithm doing similar things. 我想知道是否有更好的方法/算法来做类似的事情。

Thanks! 谢谢!

You want to access data from a website. 您要从网站访问数据。 What you suggested is a handcrafted API, or Application Programming Interface. 您建议的是手工制作的API或应用程序编程接口。

One of the main flaws of a handcrafted API is exactly what you mentioned, or that the supplier of the webpage could make a small change that would render your API unusable. 手工制作的API的主要缺陷之一就是您所提到的,或者网页的供应商可能进行了很小的更改,从而使您的API无法使用。

Generally, it is a better idea to use an API that has direct access to the data that belongs to the owners of the website. 通常,最好使用直接访问属于网站所有者的数据的API。 These APIs are created by the website owners themselves, so they have straight access to the data, and they get around all of the messy formatting that comes in between you and the data that you want when you use HTML scraping. 这些API由网站所有者自己创建,因此可以直接访问数据,并且可以避免使用HTML抓取时出现在您和所需数据之间的所有混乱格式。


Specifically, Amazon's price API is located here . 具体来说,亚马逊的价格API位于此处

IMPORTANT: 重要:

As mentioned here , please read Section 4b of the Licensing Agreement: 如前所述这里 ,请阅读许可协议第4B:

(b) You will use Product Advertising Content only (i) in a lawful manner; (b)您将仅(i)以合法方式使用产品广告内容; (ii) in accordance with the terms of this License Agreement and within the express scope of the license granted in Section 6; (ii)根据本许可协议的条款,并在第6条授予的许可的明示范围内; and (iii) to send end users to and drive sales on the Amazon Site. (iii)将最终用户发送到亚马逊网站并推动其销售。 You will not use the Product Advertising API, Data Feed, or Product Advertising Content with any site or application, or in any other manner, that does not have the principal purpose of advertising and marketing the Amazon Site and driving sales of products and services on the Amazon Site. 您不得将产品广告API,数据馈送或产品广告内容与任何不具有广告和营销Amazon网站以及促进产品和服务销售的主要目的的网站或应用程序一起使用。亚马逊网站。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM