如何在python中删除样式和元素后解析代码

Question

This is a very basic question regarding html parsing: 这是关于html解析的一个非常基本的问题：

I am new to python(coding,computer science, etc), teaching myself to parse html and I have imported both pattern and beautiful soup modules to parse with. 我是python（编码，计算机科学等）的新手，教我自己解析html，我已经导入了模式和美丽的汤模块来解析。 I found this code on the internet to cut out all formatting. 我在互联网上找到了这个代码来删除所有格式。

import requests
import json
import urllib
from lxml import etree
from pattern import web
from bs4 import BeautifulSoup


url = "http://webrates.truefx.com/rates/connect.html?f=html"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)


print(text)

This produces the following Output: 这会产生以下输出：

EUR/USD14265522866931.056661.056751.056081.057911.05686USD/JPY1426552286419121.405121.409121.313121.448121.382GBP/USD14265522866821.482291.482361.481941.483471.48281EUR/GBP14265522865290.712790.712900.712300.713460.71273USD/CHF14265522866361.008041.008291.006551.008791.00682EUR/JPY1426552286635128.284128.296128.203128.401128.280EUR/CHF14265522866551.065121.065441.063491.066281.06418USD/CAD14265522864891.278211.278321.276831.278531.27746AUD/USD14265522864960.762610.762690.761150.764690.76412GBP/JPY1426552286682179.957179.976179.854180.077179.988

now from this point how can I parse further to say If I only want the string 'USD/CHF' or a particular point of data? 现在从这一点开始如何进一步解析如果我只想要字符串'USD / CHF'或特定的数据点？

Is there a easier method to webscrape and parse with? 是否有更简单的webscrape和解析方法？ Any suggestions would be great! 任何建议都会很棒！

System Specs: windows 7 64bit IDE: idle python: 2.7.5 系统规格：windows 7 64bit IDE：idle python：2.7.5

Thank you all in advance, Rusty 提前谢谢大家，Rusty

Answer 1

Keep it simple . 保持简单。 Find the cell by text ( USD/CHF , for example) and get the following siblings : 按文本查找单元格（例如， USD/CHF ）并获取以下兄弟 ：

text = 'USD/CHF'
cell = soup.find('td', text=text)
for td in cell.next_siblings:
    print td.text

Prints: 打印：

1426561775912
1.00
768
1.00
782
1.00655
1.00879
1.00682

Answer 2

In my experience, beautiful soup is pretty much as easy as it gets. 根据我的经验，美丽的汤非常容易。 I would write a regex to strip the number out after a string of characters. 我会写一个正则表达式来删除一串字符后的数字。 I hope this gets you on the right track. 我希望这能让你走上正轨。

Answer 3

You can try something quick and dirty like this. 你可以尝试这样快速和肮脏的东西。 Obviously code like this will change based on the string itself. 显然，这样的代码会根据字符串本身而改变。 More advanced ways would use python's regex library. 更高级的方法将使用python的正则表达式库。 But sometimes it's nice to keep it simple. 但有时保持简单是件好事。

string = []
starting_position = text[text.find("USD/CHF")+7:] #+7 to start after the tag USD/CHF
for ch in starting_position:
    if ch.isdigit() == True or ch == ".":
        string.append(str(ch))
    else:
        break
print "".join(string)

如何在python中删除样式和元素后解析代码

问题描述

3 个解决方案

解决方案1
2 已采纳 2015-03-17 03:10:18

解决方案2
1 2015-03-17 01:52:53

解决方案3
1 2015-03-17 02:05:23

如何在python中删除样式和元素后解析代码

问题描述

3 个解决方案

解决方案1 2 已采纳 2015-03-17 03:10:18

解决方案2 1 2015-03-17 01:52:53

解决方案3 1 2015-03-17 02:05:23

解决方案1
2 已采纳 2015-03-17 03:10:18

解决方案2
1 2015-03-17 01:52:53

解决方案3
1 2015-03-17 02:05:23