简体   繁体   English

如何使用 python 编辑 web 抓取的文本数据?

[英]How can I edit web scraped text data using python?

Trying to build my first webscraper to print out how the stock market is doing on Yahoo finance.尝试构建我的第一个网络爬虫来打印出股票市场在雅虎财经上的表现。 I have found out how to isolate the information I want but it returns super sloppy.我已经找到了如何隔离我想要的信息的方法,但它返回的信息非常草率。 How can I manipulate this data to present in an easier way?我怎样才能操作这些数据以更简单的方式呈现?

import requests 
from bs4 import BeautifulSoup


#Import your website here
html_text = requests.get('https://finance.yahoo.com/').text

soup = BeautifulSoup(html_text, 'lxml')

#Find the part of the webpage where your information is in
sp_market = soup.find('h3', class_ = 'Maw(160px)').text
print(sp_market)

The return here is: S&P 5004,587.18+65.64(+1.45%)这里的回报是:S&P 5004,587.18+65.64(+1.45%)

I want to grab these elements such as the labels and percentages and isolate them so I can print them in a way I want.我想抓取这些元素(例如标签和百分比)并将它们隔离,以便我可以按照我想要的方式打印它们。 Anyone know how?任何人都知道如何? Thanks so much!非常感谢!

edit: ((S&P 500编辑:((标准普尔 500 指数
4,587.18+65.64(+1.45%))) 4,587.18+65.64(+1.45%)))

For simple splitting you could use the.split(separator) method that is built-in.对于简单的拆分,您可以使用内置的 .split(separator) 方法。 (fe First split by 'x', then split by 'y', then split by 'z' with x, y, z being seperators). (fe 首先被 'x' 分割,然后被 'y' 分割,然后被 'z' 分割,x、y、z 是分隔符)。 Since this is not efficient and if you have bit more complex regular expressions that look the same way for different elements (here: stocks) then take a look at the python regex module.由于这效率不高,如果您有更复杂的正则表达式,对于不同的元素(此处:股票)看起来相同,那么请查看 python 正则表达式模块。

string = "Stock +45%"
pattern = '[a-z]+[0-9][0-9]'

Then, consider to use a function like find_all oder search.然后,考虑使用 function 之类的 find_all oder 搜索。

I assume that the format is always S&P 500\n[number][+/-][number]([+/-][number]%) .我假设格式始终为S&P 500\n[number][+/-][number]([+/-][number]%)

If that is the case, we could do the following.如果是这种情况,我们可以执行以下操作。

import re 

# [your existing code]

# e.g. 
# sp_market = 'S&P 500\n4,587.18+65.64(+1.45%)'

label,line2 = sp_market.split('\n')
pm = re.findall(r"[+-]",line2)
total,change,percent,_ = re.split(r"[\+\-\(\)%]+",line2)
total = float(''.join(total.split(',')))
change = float(change)
if pm[0]=='-':
    change=-change
percent = float(percent)
if pm[1]=='-':
    percent=-percent

print(label, total,change,percent)
# S&P 500 4587.18 65.64 1.45

Not sure, cause question do not provide an expected result, but you can "isolate" the information with stripped_strings .不确定,因为问题没有提供预期的结果,但您可以使用stripped_strings “隔离”信息。

This will give you a list of "isolated" values you can process:这将为您提供您可以处理的“孤立”值列表:

list(soup.find('h3', class_ = 'Maw(160px)').stripped_strings)

#Output
['S&P 500', '4,587.18', '+65.64', '(+1.45%)']

For example stripping following characters "()%":例如剥离以下字符“()%”:

[x.strip('\(|\)|%') for x in soup.find('h3', class_ = 'Maw(160px)').stripped_strings]

#Output
['S&P 500', '4,587.18', '+65.64', '+1.45']

Simplest way to print the data not that sloppy way, is to join() the values by whitespace:打印数据的最简单方法不是那种草率的方法,就是通过空格join()值:

' '.join([x.strip('\(|\)|%') for x in soup.find('h3', class_ = 'Maw(160px)').stripped_strings])

#Output
S&P 500 4,587.18 +65.64 +1.45

You can also create dict() and print the key / value pairs:您还可以创建dict()并打印键/值对:

for k, v in dict(zip(['Symbol','Last Price','Change','% Change'], [x.strip('\(|\)|%') for x in soup.find('h3', class_ = 'Maw(160px)').stripped_strings])).items():
    print(f'{k}: {v}')

#Output
Symbol: S&P 500
Last Price: 4,587.18
Change: +65.64
% Change: +1.45

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM