簡體   English   中英

Python Readline 循環和子循環

[英]Python Readline Loop and Subloop

我正在嘗試在 python 中遍歷一些非結構化文本數據。 最終目標是在數據框中構建它。 現在我只是想在一個數組中獲取相關數據並理解 python 中的 readline() 功能。

這是文本的樣子:

Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number 
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python

對於同一文件中的許多文本文章,重復使用相同的格式。 到目前為止,我已經弄清楚如何提取包含某些文本的行。 例如,我可以遍歷它並將所有文章標題放在一個列表中,如下所示:

a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:  
for line in unstr:
      if a in line:
        titleList.append(line)

現在我想做以下事情:

a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:  
for line in unstr:
  if a in line:
    list.append(line)
  if b in line:
     1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
     2. Continue the for loop within which all of this sits

作為一名 Python 初學者,我正在旋轉我的輪子在谷歌上搜索這個主題。 任何指針將不勝感激。

如果你想堅持你的 for 循環,你可能需要這樣的東西:

titles = []
texts = []
subjects = []

with open('sample.txt', encoding="utf8") as f:
    inside_fulltext = False
    for line in f:
        if line.startswith("Title:"):
            inside_fulltext = False
            titles.append(line)
        elif line.startswith("Full text:"):
            inside_fulltext = True
            full_text = line
        elif line.startswith("Subject:"):
            inside_fulltext = False
            texts.append(full_text)
            subjects.append(line)
        elif inside_fulltext:
            full_text += line
        else:
            # Possibly throw a format error here?
            pass

(有幾件事:Python 在名稱方面很奇怪,當你寫list = [] ,你實際上是覆蓋了list類的標簽,這可能會導致你以后出現問題。你應該真正對待listset等等在 like 關鍵字上 - 即使認為 Python 在技術上沒有 - 只是為了避免讓自己頭疼。另外,這里的startswith方法更精確一些,根據您對數據的描述。)

或者,您可以將文件對象包裝在迭代器中( i = iter(f) ,然后是next(i) ),但這會導致捕獲StopIteration異常的一些麻煩 - 但它會讓您使用更經典的 while-循環整個事情。 就我自己而言,我會堅持使用上面的狀態機方法,並使其足夠健壯以處理所有合理預期的邊緣情況。

由於您的目標是構建一個 DataFrame,這里是一個re + numpy + pandas解決方案:

import re
import pandas as pd
import numpy as np

# read all file
with open('sample.txt', encoding="utf8") as f:
    text = f.read()


keys = ['Subject', 'Title', 'Full text']

regex = '(?:^|\n)(%s): ' % '|'.join(keys)

# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])

輸出:

                      Title                                                                                                                                               Full text Subject
0       title of an article  unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three..  Python
1  title of another article                                                                               again unfortunately the full text of each article,\nis on numerous lines.  Python

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM