如何從 python 中的 a.txt 文件中讀取數據框中的大文本文件

Question

我有一個大文本文件，其中包含幾個不同人的名稱和長段陳述。 文件格式是.txt，我試圖將名稱和語句分成數據框的兩個不同列。

數據采用這種格式-

Harvey: I’m inclined to give you a shot. But what if I decide to go the other way?

Mike: I’d say that’s fair. Sometimes I like to hang out with people who aren’t that bright, you know, just to see how the other half lives.
Mike in the club
(mike speaking to jessica.)
Jessica: How are you mike?

Mike: good!
.....
....

等等

文本文件的長度為400萬。

在 output 中，我需要一個 dataframe，其中一個名稱列具有發言人姓名，另一個聲明列具有該人各自的陳述。

Answer 1

if：格式始終為“名稱：單行無冒號”
你可以試試：
df = pd.read_csv('untitled.txt',sep=': ', header=None)

或 go 手動：

f = open("untitled.txt", "r")
file_contents = []

current_name = ""
current_dialogue = ""

for line in f:
    splitted_line = line.split(": ")
    if len(splitted_line) > 1:
        # you are on a row with name: on it
        # first stop the current dialogue - save it
        if current_name:
            file_contents.append([current_name, current_dialogue])
        # then update the name encountered
        current_name = splitted_line.pop(0)
        current_dialogue = ""
    current_dialogue += ": ".join(splitted_line)    
# add the last dialogue line
file_contents.append([current_name, current_dialogue])

f.close()

df = pd.DataFrame(file_contents)
df

Answer 2

如果您逐行閱讀文件，則可以使用類似這樣的方法將說話者與語音文本分開，而無需使用正則表達式。

def find_speaker_and_text_from_line(line):
  split = line.split(": ")
  name = split.pop(0)
  rest = ": ".join(split)
  return name, rest

如何從 python 中的 a.txt 文件中讀取數據框中的大文本文件

問題描述

2 個解決方案

解決方案1
0 已采納 2021-02-11 16:44:35

解決方案2
0 2021-02-11 16:45:00

如何從 python 中的 a.txt 文件中讀取數據框中的大文本文件

問題描述

2 個解決方案

解決方案1 0 已采納 2021-02-11 16:44:35

解決方案2 0 2021-02-11 16:45:00

解決方案1
0 已采納 2021-02-11 16:44:35

解決方案2
0 2021-02-11 16:45:00