如何使用python從字節文件中提取文本

Question

我正在嘗試編寫一個腳本，以獲取網站的代碼，將所有html保存在文件中，然后提取一些信息。

在第一部分中，我已經將所有html保存到文本文件中。

現在，我必須提取相關信息，然后將其保存在另一個文本文件中。

但是我在編碼方面遇到了問題，而且我也不怎么知道如何在python中提取文本。

解析網站：

import urllib.request

...用於存儲數據的文件名

file_name = r'D:\scripts\datos.txt'

我想獲取此標記之后和另一標記之前的文本

tag_starts_with = '<p class="item-description">'
tag_ends_with = '</p>'

我獲得了網站代碼，並將其保存到文本文件中

with urllib.request.urlopen("http://www.website.com/") as response, open(file_name, 'wb') as out_file:
    data = response.read() 
    out_file.write(data)

print (out_file) ＃第一個問題如何打印文件？ 給我一個錯誤，我無法打印字節

該文件現在已包含html文本，所以我想打開它並對其進行處理

file_for_results = open(r'D:\scripts\datos.txt',encoding="utf8")

從文件中提取信息

第二個問題如何處理包含文件的行的子字符串並獲取p class =“ item-description”和/ p之間的文本，以便我可以存儲在file_for_results中

這是我無法編寫的偽代碼。

for line in file_to_filter:
    if line contains word_starts_with
      copy in file_for_results until you find </p>

在此先感謝您的幫助

Answer 1

我假設這是某種分配，您需要在給定算法的情況下解析html，如果不僅僅使用Beautiful Soup。

偽代碼實際上很容易轉換為python代碼：

file_to_filter = open("file.html", 'r')
out_file = open("text_output",'w')
for line in file_to_filter:
    if word_starts_with in line:
        print(line, end='', file=out_file) # Store data in another file
    if word_ends_with in line:
        break

當然，您需要關閉文件，確保刪除標簽等，但這大致就是應該為該代碼提供該算法的內容。

如何使用python從字節文件中提取文本

問題描述

1 個解決方案

解決方案1
2 已采納 2016-02-22 00:17:46

如何使用python從字節文件中提取文本

問題描述

1 個解決方案

解決方案1 2 已采納 2016-02-22 00:17:46

解決方案1
2 已采納 2016-02-22 00:17:46