简体   繁体   English

使用BeautifulSoup从XML文档中提取Unicode文本

[英]Extract unicode text from XML document with BeautifulSoup

I have this code: 我有以下代码:

for fileid in wordlist.fileids()[4:5]:
    url = open(fileid, 'r').read()
    soup = BeautifulSoup(url)
    find_all = soup.find_all("speech", soup)
    soup_sub = re.sub("<.+?>", "", str(find_all))
    print fileid
    print soup_sub

from local xml files it gets a certain ellement. 从本地xml文件中获取一定的麻烦。 Then it subs the xml code out of it and prints a list. 然后,它从其中扣除xml代码并打印一个列表。 A snipset of that list is down here. 该列表的摘要位于此处。 You can see that there is allot of unicode in it. 您可以看到其中包含unicode。 How can i get this unicode out of that list? 我如何从列表中删除该unicode?

<p>\nIk heet de minister van Sociale Zaken en Werkgelegenheid van harte welkom. Er hebben zich vijf sprekers voor dit VAO aangemeld.\u200a\n, \nVoorzitter. Ik wil drie moties indienen. Dit wordt topsport voor mij.\u200a\n\nMotie\nDe Kamer,\u200a\ngehoord de beraadslaging,\u200a\noverwegende dat bedrijfsongevallen wel bij de inspectie gemeld moeten worden en beroepsziekten niet;\u200a\noverwegende dat door registratie van beroepsziekten optimaal preventief beleid gevoerd kan worden;\u200a\</p>

First of all, if you are parsing XML with BeautifulSoup, do pick the right parser for the job (and have lxml installed). 首先,如果要使用BeautifulSoup解析XML ,请为该作业选择正确的解析器 (并已安装lxml )。 You can pass an open file object to BeautifulSoup, no need to read it all into memory before parsing: 您可以将打开的文件对象传递给BeautifulSoup,而无需在解析之前将其全部读取到内存中:

with open(fileid, 'r') as xml_file:
    soup = BeautifulSoup(xml_file, 'xml')

Next, don't use str(find_all) ; 接下来,不要使用str(find_all) that turns all your element objects into a single (byte) string, and you won't be able to access the original Unicode text contents anymore. 它将所有元素对象转换为单个(字节)字符串,您将无法再访问原始Unicode文本内容。

Use the Element.get_text() method to extract the text from each element: 使用Element.get_text()方法从每个元素中提取文本:

speech_elements = soup.find_all("speech")
speech_text = [elem.get_text() for elem in speech_elements]

This'll ensure that you still get the full unicode contents, not some str() conversion; 这将确保您仍然获得完整的unicode内容,而不是进行一些str()转换; you now have a list with unicode objects per <speech> element found. 现在,您将找到一个列表,其中每个<speech>元素都包含unicode对象。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM