简体   繁体   中英

Get specific value BeautifulSoup (parsing)

I'm trying to extract information from a website.

Using Python ( BeautifulSoup )

I want to extract the following data ( just the figures )

EPS (Basic)

from: https://www.marketwatch.com/investing/stock/aapl/financials/income/quarter 每股收益

From the xml :

在此处输入图像描述

I'm built the code:

import pandas as pd
from bs4 import BeautifulSoup
import urllib.request as ur
import request 

url_is = 'https://www.marketwatch.com/investing/stock/aapl/financials/income/quarter'


read_data = ur.urlopen(url_is).read()
soup_is=BeautifulSoup(read_data, 'lxml')
cells = soup_is.findAll('tr', {'class': 'mainRow'} )
for cell in cells:
  print(cell.text)

But I'm not to extract the figures for EPS (Basic)

每股收益

Is there a way to extract just the data and sorted by column?

Try following css selector which check td tag contains EPS (Basic) text.

import urllib.request as ur

url_is = 'https://www.marketwatch.com/investing/stock/aapl/financials/income/quarter'
read_data = ur.urlopen(url_is).read()
soup_is=BeautifulSoup(read_data, 'lxml')
row = soup_is.select_one('tr.mainRow>td.rowTitle:contains("EPS (Basic)")')
print([cell.text for cell in row.parent.select('td') if cell.text!=''])

Output :

[' EPS (Basic)', '2.47', '2.20', '3.05', '5.04', '2.58']

To print in DF

import pandas as pd
from bs4 import BeautifulSoup
import urllib.request as ur

url_is = 'https://www.marketwatch.com/investing/stock/aapl/financials/income/quarter'
read_data = ur.urlopen(url_is).read()
soup_is=BeautifulSoup(read_data, 'lxml')
row = soup_is.select_one('tr.mainRow>td.rowTitle:contains("EPS (Basic)")')
data=[cell.text for cell in row.parent.select('td') if cell.text!='']
df=pd.DataFrame(data)
print(df.T)

Output :

              0     1     2     3     4     5
0   EPS (Basic)  2.47  2.20  3.05  5.04  2.58

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM