Python Readline 循環和子循環

Question

我正在嘗試在 python 中遍歷一些非結構化文本數據。 最終目標是在數據框中構建它。 現在我只是想在一個數組中獲取相關數據並理解 python 中的 readline() 功能。

這是文本的樣子：

Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number 
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python

對於同一文件中的許多文本文章，重復使用相同的格式。 到目前為止，我已經弄清楚如何提取包含某些文本的行。 例如，我可以遍歷它並將所有文章標題放在一個列表中，如下所示：

a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:  
for line in unstr:
      if a in line:
        titleList.append(line)

現在我想做以下事情：

a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:  
for line in unstr:
  if a in line:
    list.append(line)
  if b in line:
     1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
     2. Continue the for loop within which all of this sits

作為一名 Python 初學者，我正在旋轉我的輪子在谷歌上搜索這個主題。 任何指針將不勝感激。

Answer 1

如果你想堅持你的 for 循環，你可能需要這樣的東西：

titles = []
texts = []
subjects = []

with open('sample.txt', encoding="utf8") as f:
    inside_fulltext = False
    for line in f:
        if line.startswith("Title:"):
            inside_fulltext = False
            titles.append(line)
        elif line.startswith("Full text:"):
            inside_fulltext = True
            full_text = line
        elif line.startswith("Subject:"):
            inside_fulltext = False
            texts.append(full_text)
            subjects.append(line)
        elif inside_fulltext:
            full_text += line
        else:
            # Possibly throw a format error here?
            pass

（有幾件事：Python 在名稱方面很奇怪，當你寫list = [] ，你實際上是覆蓋了list類的標簽，這可能會導致你以后出現問題。你應該真正對待list 、 set等等在 like 關鍵字上 - 即使認為 Python 在技術上沒有 - 只是為了避免讓自己頭疼。另外，這里的startswith方法更精確一些，根據您對數據的描述。）

或者，您可以將文件對象包裝在迭代器中（ i = iter(f) ，然后是next(i) ），但這會導致捕獲StopIteration異常的一些麻煩 - 但它會讓您使用更經典的 while-循環整個事情。 就我自己而言，我會堅持使用上面的狀態機方法，並使其足夠健壯以處理所有合理預期的邊緣情況。

Answer 2

由於您的目標是構建一個 DataFrame，這里是一個re + numpy + pandas解決方案：

import re
import pandas as pd
import numpy as np

# read all file
with open('sample.txt', encoding="utf8") as f:
    text = f.read()


keys = ['Subject', 'Title', 'Full text']

regex = '(?:^|\n)(%s): ' % '|'.join(keys)

# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])

輸出：

                      Title                                                                                                                                               Full text Subject
0       title of an article  unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three..  Python
1  title of another article                                                                               again unfortunately the full text of each article,\nis on numerous lines.  Python

Python Readline 循環和子循環

問題描述

2 個解決方案

解決方案1
1 已采納 2021-10-14 01:58:42

解決方案2
1 2021-10-14 02:20:05

Python Readline 循環和子循環

問題描述

2 個解決方案

解決方案1 1 已采納 2021-10-14 01:58:42

解決方案2 1 2021-10-14 02:20:05

解決方案1
1 已采納 2021-10-14 01:58:42

解決方案2
1 2021-10-14 02:20:05