简体   繁体   English

从嘈杂的字符串中提取文本。

[英]extracting text from noisy string.. python

I have some html documents and I want to extract a very particular text from it. 我有一些html文档,我想从中提取一个非常特殊的文本。 Now, this text is always located as 现在,此文本始终位于

<div class = "fix">text </div>

Now, sometimes what happens is... there are other opening divs as well...something like: 现在,有时候会发生什么……还有其他开头的div……诸如此类:

 <div class = "fix"> part of text <div something> other text </div> some more text </div>

Now.. I want to extract all the text corresponding to 现在..我想提取所有对应的文本

 <div class = "fix">                     </div> markups??

How do i do this? 我该怎么做呢?

I would use the BeautifulSoup libraries. 我将使用BeautifulSoup库。 They're kinda built for this, as long your data is correct html it should find exactly what you're looking for. 它们是为此而构建的,只要您的数据是正确的html,它就可以准确找到您要查找的内容。 They've got reasonably good documentation, and it's extremely straight forward, even for beginners. 他们有相当不错的文档,而且即使是初学者,也非常简单。 If your file is on the web somewhere where you can't access the direct html, grab the html with urllib. 如果您的文件位于无法访问直接html的网络上,请使用urllib来获取html。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.find({"class":"fix"})

If there is more than one item with it use find_all instead. 如果有多个项目,则使用find_all。 This should give you what you're looking for (roughly). 这应大致为您提供所需的内容。

Edit: Fixed example (class is a keyword, so you can't use the usual (attr="blah") 编辑:固定示例(类是关键字,所以您不能使用通常的(attr =“ blah”)

Here's a really simple solution that uses a non-greedy regex to remove all html tags.: 这是一个非常简单的解决方案,使用非贪婪的正则表达式删除所有html标签。:

import re
s =  "<div class = \"fix\"> part of text <div something> other text </div> some more text </div>"
s_text = re.sub(r'<.*?>', '', s)

The values are then: 值如下:

print(s)
<div class = "fix"> part of text <div something> other text </div> some more text </div>
print(s_text)
 part of text  other text  some more text

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM