简体   繁体   English

如何在python中删除样式和元素后解析代码

[英]How to parse code after it has been stripped of styles and elements in python

This is a very basic question regarding html parsing: 这是关于html解析的一个非常基本的问题:

I am new to python(coding,computer science, etc), teaching myself to parse html and I have imported both pattern and beautiful soup modules to parse with. 我是python(编码,计算机科学等)的新手,教我自己解析html,我已经导入了模式和美丽的汤模块来解析。 I found this code on the internet to cut out all formatting. 我在互联网上找到了这个代码来删除所有格式。

import requests
import json
import urllib
from lxml import etree
from pattern import web
from bs4 import BeautifulSoup


url = "http://webrates.truefx.com/rates/connect.html?f=html"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)


print(text)

This produces the following Output: 这会产生以下输出:

EUR/USD14265522866931.056661.056751.056081.057911.05686USD/JPY1426552286419121.405121.409121.313121.448121.382GBP/USD14265522866821.482291.482361.481941.483471.48281EUR/GBP14265522865290.712790.712900.712300.713460.71273USD/CHF14265522866361.008041.008291.006551.008791.00682EUR/JPY1426552286635128.284128.296128.203128.401128.280EUR/CHF14265522866551.065121.065441.063491.066281.06418USD/CAD14265522864891.278211.278321.276831.278531.27746AUD/USD14265522864960.762610.762690.761150.764690.76412GBP/JPY1426552286682179.957179.976179.854180.077179.988

now from this point how can I parse further to say If I only want the string 'USD/CHF' or a particular point of data? 现在从这一点开始如何进一步解析如果我只想要字符串'USD / CHF'或特定的数据点?

Is there a easier method to webscrape and parse with? 是否有更简单的webscrape和解析方法? Any suggestions would be great! 任何建议都会很棒!

System Specs: windows 7 64bit IDE: idle python: 2.7.5 系统规格:windows 7 64bit IDE:idle python:2.7.5

Thank you all in advance, Rusty 提前谢谢大家,Rusty

Keep it simple . 保持简单 Find the cell by text ( USD/CHF , for example) and get the following siblings : 按文本查找单元格(例如, USD/CHF )并获取以下兄弟

text = 'USD/CHF'
cell = soup.find('td', text=text)
for td in cell.next_siblings:
    print td.text

Prints: 打印:

1426561775912
1.00
768
1.00
782
1.00655
1.00879
1.00682

In my experience, beautiful soup is pretty much as easy as it gets. 根据我的经验,美丽的汤非常容易。 I would write a regex to strip the number out after a string of characters. 我会写一个正则表达式来删除一串字符后的数字。 I hope this gets you on the right track. 我希望这能让你走上正轨。

You can try something quick and dirty like this. 你可以尝试这样快速和肮脏的东西。 Obviously code like this will change based on the string itself. 显然,这样的代码会根据字符串本身而改变。 More advanced ways would use python's regex library. 更高级的方法将使用python的正则表达式库。 But sometimes it's nice to keep it simple. 但有时保持简单是件好事。

string = []
starting_position = text[text.find("USD/CHF")+7:] #+7 to start after the tag USD/CHF
for ch in starting_position:
    if ch.isdigit() == True or ch == ".":
        string.append(str(ch))
    else:
        break
print "".join(string)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 运行代码后如何禁用/限制一行代码? (蟒蛇) - How to disable/restrict a line of code after it has been run? (Python) 如何让我的 Python 摩尔斯电码解码器在一个单词被翻译后分离? - How to get my Python Morse code-decoder to separate after a word has been translated? Python:如何在捕获到异常后停止执行? - Python : How to stop execution after exception has been caught? 应用更改后,如何比较同一字典? (蟒蛇) - How to compare the same dict after a change has been applied? (Python) 在引发异常后,python 回溯如何运行? - How is the python traceback able to run after an exception has been raised? 创建python后如何使用python移动tkinter窗口? - How to move tkinter windows using python after it has been created? Python-找到匹配项后如何读取字符串的其余部分 - Python - how to read the remainder of a string after a match has been found 替换后如何重新打印代码-Python 3.2 - How to reprint codes after a substituiton has been made - python 3.2 如何在文件循环后删除文件-Python - How to remove a file after it has been loop through-Python 根据已转换为日期时间格式并从POSIX时间戳记python中剥离的时间从熊猫帧中选择行 - Selecting a rows from a pandas frame based on a time that has been converted into datetime format and stripped from a POSIX time stamp, python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM