如何在 Python 中的標簽之間提取內容？

Question

我有一個大約 50 萬行的文本文件，其中包含相當隨機的 HTML 語法。 文件的大致結構如下：

content <title> title1 </title> more words 

title contents2 title more words <body> <title> title2 </title> 

<body><title>title3</title></body>

我想提取標簽之間的所有內容。

title1
title2 
title3

這是我迄今為止嘗試過的：

    content_list = []

with open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', errors = 'ignore') as openfile2:
    for line in openfile2:
        for item in line.split("<title>"):
            if "</title>" in item:
                content = (item [ item.find("<title>")+len("<title>") : ])
                content_list.append(content)

但是這種方法不是檢索所有標簽。 我認為這可能是由於標簽與其他單詞相連，沒有空格。 IE。 <body><title> 。

我已經考慮用空格替換每個 '<' 和 '>' 並執行相同的方法，但如果我這樣做，我會得到“contents2”作為輸出。

Answer 1

我相信你可以用BeautifulSoup做到這一點。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('file_to_read.txt', 'r'), 'html.parser')
print(soup.find_all('title'))
# [<title> title1 </title>, <title> title2 </title>, <title>title3</title>]

print(soup.find_all('title')[0].text)
# ' title1 '

Answer 2

您的代碼語法示例：

with open('file.txt', 'r') as file:
    for line in file:
        for item in line.split('<title>'):
            if '</title>' in item:
                content_list.append(str.strip(item.split('</title>')[0]))
print(content_list)

但無論如何， BeautifulSoup對我來說是最好的選擇。

Answer 3

嘗試運行：

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', 'r'), 'html.parser')
content_list = []
contents = soup.find_all('title')
for content in content:
    print(content.get_text().strip())
    content_list.append(content.get_text().strip())

如何在 Python 中的標簽之間提取內容？

問題描述

3 個解決方案

解決方案1
1 2020-02-07 01:45:28

解決方案2
0 2020-02-07 01:51:39

解決方案3
0 2020-02-07 01:54:39

如何在 Python 中的標簽之間提取內容？

問題描述

3 個解決方案

解決方案1 1 2020-02-07 01:45:28

解決方案2 0 2020-02-07 01:51:39

解決方案3 0 2020-02-07 01:54:39

解決方案1
1 2020-02-07 01:45:28

解決方案2
0 2020-02-07 01:51:39

解決方案3
0 2020-02-07 01:54:39