简体   繁体   English

Web 使用 BS4 抓取数据 - Python

[英]Web Scraping data with BS4 - Python

I have been trying to export a web scraped document from the below code.我一直在尝试从以下代码导出 web 抓取的文档。

import pandas as pd
import requests
from bs4 import BeautifulSoup 

url="https://www.marketwatch.com/tools/markets/stocks/country/sri-lanka/1"

data  = requests.get(url).text
soup = BeautifulSoup(data, 'html5lib')

cse = pd.DataFrame(columns=["Name", "Exchange", "Sector"])
for row in soup.find('tbody').find('tr'): ##for row in soup.find("tbody").find_all('tr'):
    col = row.find("td")
    Name = col[0].text
    Exchange = col[1].text
    Sector = col[2].text
    cse = cse.append({"Name":Company_Name,"Exchange":Exchange_code,"Sector":Industry}, ignore_index=True) 

but I am receiving an error 'TypeError: 'int' object is not subscriptable'.但我收到错误“TypeError:‘int’object 不可订阅”。 Can anyone help me to crack this out?谁能帮我解决这个问题?

You need to know the difference between .find() and.find_all() .您需要知道.find().find_all()之间的区别。

The only difference is that find_all() returns a list containing the single result, and find() just returns the result.唯一的区别是 find_all() 返回一个包含单个结果的列表,而 find() 只返回结果。

Since you are using col = row.find_all("td") , col is not a list.由于您使用的是col = row.find_all("td") ,因此col不是列表。 So you get this error - 'TypeError: 'int' object is not subscriptable'所以你得到这个错误 - 'TypeError: 'int' object is not subscriptable'

Since you need to iterate over all the <tr> and inturn <td> inside every <tr> , you have to use find_all() .由于您需要遍历所有<tr>并在每个<tr>中转入<td> ,因此您必须使用find_all()

You can try this out.你可以试试这个。

import pandas as pd
import requests
from bs4 import BeautifulSoup 

url="https://www.marketwatch.com/tools/markets/stocks/country/sri-lanka/1"

data  = requests.get(url).text
soup = BeautifulSoup(data, 'lxml')

cse = pd.DataFrame(columns=["Name", "Exchange", "Sector"])
for row in soup.find('tbody').find_all('tr'):
    col = row.find_all("td")
    Company_Name = col[0].text
    Exchange_code = col[1].text
    Industry = col[2].text
    cse = cse.append({"Name":Company_Name,"Exchange":Exchange_code,"Sector":Industry}, ignore_index=True) 
                                          Name  ...                       Sector
0           Abans Electricals PLC (ABAN.N0000)  ...                   Housewares
1               Abans Finance PLC (AFSL.N0000)  ...            Finance Companies
2           Access Engineering PLC (AEL.N0000)  ...                 Construction
3                   ACL Cables PLC (ACL.N0000)  ...       Industrial Electronics
4                ACL Plastics PLC (APLA.N0000)  ...          Industrial Products
..                                         ...  ...                          ...
145      Lanka Hospital Corp. PLC (LHCL.N0000)  ...         Healthcare Provision
146                 Lanka IOC PLC (LIOC.N0000)  ...             Specialty Retail
147     Lanka Milk Foods (CWE) PLC (LMF.N0000)  ...                Food Products
148  Lanka Realty Investments PLC (ASCO.N0000)  ...       Real Estate Developers
149               Lanka Tiles PLC (TILE.N0000)  ...  Building Materials/Products

[150 rows x 3 columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM