[英]How to parse code after it has been stripped of styles and elements in python
This is a very basic question regarding html parsing: 这是关于html解析的一个非常基本的问题:
I am new to python(coding,computer science, etc), teaching myself to parse html and I have imported both pattern and beautiful soup modules to parse with. 我是python(编码,计算机科学等)的新手,教我自己解析html,我已经导入了模式和美丽的汤模块来解析。 I found this code on the internet to cut out all formatting.
我在互联网上找到了这个代码来删除所有格式。
import requests
import json
import urllib
from lxml import etree
from pattern import web
from bs4 import BeautifulSoup
url = "http://webrates.truefx.com/rates/connect.html?f=html"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
This produces the following Output: 这会产生以下输出:
EUR/USD14265522866931.056661.056751.056081.057911.05686USD/JPY1426552286419121.405121.409121.313121.448121.382GBP/USD14265522866821.482291.482361.481941.483471.48281EUR/GBP14265522865290.712790.712900.712300.713460.71273USD/CHF14265522866361.008041.008291.006551.008791.00682EUR/JPY1426552286635128.284128.296128.203128.401128.280EUR/CHF14265522866551.065121.065441.063491.066281.06418USD/CAD14265522864891.278211.278321.276831.278531.27746AUD/USD14265522864960.762610.762690.761150.764690.76412GBP/JPY1426552286682179.957179.976179.854180.077179.988
now from this point how can I parse further to say If I only want the string 'USD/CHF' or a particular point of data? 现在从这一点开始如何进一步解析如果我只想要字符串'USD / CHF'或特定的数据点?
Is there a easier method to webscrape and parse with? 是否有更简单的webscrape和解析方法? Any suggestions would be great!
任何建议都会很棒!
System Specs: windows 7 64bit IDE: idle python: 2.7.5 系统规格:windows 7 64bit IDE:idle python:2.7.5
Thank you all in advance, Rusty 提前谢谢大家,Rusty
Keep it simple . 保持简单 。 Find the cell by text (
USD/CHF
, for example) and get the following siblings : 按文本查找单元格(例如,
USD/CHF
)并获取以下兄弟 :
text = 'USD/CHF'
cell = soup.find('td', text=text)
for td in cell.next_siblings:
print td.text
Prints: 打印:
1426561775912
1.00
768
1.00
782
1.00655
1.00879
1.00682
In my experience, beautiful soup is pretty much as easy as it gets. 根据我的经验,美丽的汤非常容易。 I would write a regex to strip the number out after a string of characters.
我会写一个正则表达式来删除一串字符后的数字。 I hope this gets you on the right track.
我希望这能让你走上正轨。
You can try something quick and dirty like this. 你可以尝试这样快速和肮脏的东西。 Obviously code like this will change based on the string itself.
显然,这样的代码会根据字符串本身而改变。 More advanced ways would use python's regex library.
更高级的方法将使用python的正则表达式库。 But sometimes it's nice to keep it simple.
但有时保持简单是件好事。
string = []
starting_position = text[text.find("USD/CHF")+7:] #+7 to start after the tag USD/CHF
for ch in starting_position:
if ch.isdigit() == True or ch == ".":
string.append(str(ch))
else:
break
print "".join(string)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.