繁体 English 中英

如何使用Python从文本文件中剥离SGML标签？

[英]How to strip SGML tags from a text file using Python?

原文 2016-11-10 16:14:45 7 2 python/ regex/ unicode/ beautifulsoup/ sgml

我最近遇到了标准通用标记语言。 我已经从EMILLE / CIIL语料库获取了SGML格式的语料库。 这是该语料库的文档：

EMILLE语料库文档

我只想提取文件中存在的文本。 文档中语料库的编码和标记信息是：

文本被编码为两字节Unicode文本。 有关Unicode的更多信息。 使用1级CES兼容标记在SGML中标记文本。 每个文件还包括一个完整的标头，用于指定文本的来源。

我很难剥离这些标签。 我尝试了“正则表达式”和“美丽汤”，但是它不起作用。 这是示例文本文件。 我要保留的语言是旁遮普语。

2 个解决方案

请尝试以下操作：

from bs4 import BeautifulSoup
import requests

# Assuming this is the url where the file is
html = requests.get('http://www.lancaster.ac.uk/fass/projects/corpus/emille/MANUAL.htm').content

bsObj = BeautifulSoup(html)

textData = bsObj.findAll('p')

for item in textData:
    print item.get_text()

或者，您可以使用简单的正则表达式； 如果数据是包含以<开头和以>结束的标记的字符串，则这些标记之间的所有内容都将被丢弃，您可以将多个空格限制为一个，然后剥离数据。

data = re.sub(r'<.*?>', '', data)
data = re.sub(r'\s+', ' ', data)
data = data.strip()

如何从文本文件中去除不需要的 html 标签？

[英]How to strip unwanted html tags from a text file?

如何使用Python从HTML剥离高度和宽度标签？

[英]How to strip height and width tags from a html using Python?

如何使用Python从html字符串中剥离（不删除）指定的标签？

[英]How to strip(not remove) specified tags from a html string using Python?

我想从sgml文件中删除html标签

[英]i want to remove html tags from sgml file

如何使用Python从文本文件中读取元数据（带有标签）

[英]How to read in metadata (with tags) from a text file using Python

剥离html标签并在python中使用机械化仅返回文本

[英]strip the html tags and return only text using mechanize in python

如何去除 xml 标记中的所有子标记，但使用 python 中的 lxml 将文本合并到括号？

[英]how to strip all child tags in an xml tag but leaving the text to merge to the parens using lxml in python?

在Python 3中使用开放的任意标记解析SGML

[英]Parse SGML with Open Arbitrary Tags in Python 3

从SGML中提取纯文本

[英]Extract plain text from SGML

如何使用 Python 从 csv 文件中的单元格中删除某些值？

[英]How to strip certain values from a cell in a csv file using Python?

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从文本文件中去除不需要的 html 标签？如何使用Python从HTML剥离高度和宽度标签？如何使用Python从html字符串中剥离（不删除）指定的标签？我想从sgml文件中删除html标签如何使用Python从文本文件中读取元数据（带有标签）剥离html标签并在python中使用机械化仅返回文本如何去除 xml 标记中的所有子标记，但使用 python 中的 lxml 将文本合并到括号？在Python 3中使用开放的任意标记解析SGML 从SGML中提取纯文本如何使用 Python 从 csv 文件中的单元格中删除某些值？

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM