简体   繁体   English

如何从 python 中的 a.txt 文件中读取数据框中的大文本文件

[英]how to read a large text file in a data frame from a .txt file in python

I have a large text file which has names and long paragraphs of statements made by several different people.我有一个大文本文件,其中包含几个不同人的名称和长段陈述。 the file format is.txt, I am trying to separate the name and the statement into two different columns of a data frame.文件格式是.txt,我试图将名称和语句分成数据框的两个不同列。

Data is in this format-数据采用这种格式-

Harvey: I’m inclined to give you a shot. But what if I decide to go the other way?

Mike: I’d say that’s fair. Sometimes I like to hang out with people who aren’t that bright, you know, just to see how the other half lives.
Mike in the club
(mike speaking to jessica.)
Jessica: How are you mike?

Mike: good!
.....
....

and so on等等

the length of text file is 4million.文本文件的长度为400万。

in the output I need a dataframe with one name column having the name of speaker and another statement column with that persons respective statement.在 output 中,我需要一个 dataframe,其中一个名称列具有发言人姓名,另一个声明列具有该人各自的陈述。

if: the format is always "name: one-liner-no-colon" if:格式始终为“名称:单行无冒号”
you could try:你可以试试:
df = pd.read_csv('untitled.txt',sep=': ', header=None)

or go manually:或 go 手动:

f = open("untitled.txt", "r")
file_contents = []

current_name = ""
current_dialogue = ""

for line in f:
    splitted_line = line.split(": ")
    if len(splitted_line) > 1:
        # you are on a row with name: on it
        # first stop the current dialogue - save it
        if current_name:
            file_contents.append([current_name, current_dialogue])
        # then update the name encountered
        current_name = splitted_line.pop(0)
        current_dialogue = ""
    current_dialogue += ": ".join(splitted_line)    
# add the last dialogue line
file_contents.append([current_name, current_dialogue])

f.close()

df = pd.DataFrame(file_contents)
df

If you read the file line-by-line, you can use something like this to split the speaker from the spoken text, without using regex.如果您逐行阅读文件,则可以使用类似这样的方法将说话者与语音文本分开,而无需使用正则表达式。

def find_speaker_and_text_from_line(line):
  split = line.split(": ")
  name = split.pop(0)
  rest = ": ".join(split)
  return name, rest

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用python读取从网络复制到txt文件的文本 - how to read text copied from web to txt file using python 如何使用 python 程序从这个文本文件中读取正确的行,然后通过填写从 .txt 文件中提取的数据来创建一个 .py 文件? - How read the correct lines from this text file with a python program, and then create a .py file by filling in the data extracted from the .txt file? 如何读取文本文件列以应用python数据框? - How to read text file columns to apply python data frame? Python,从txt文件中读取并拆分数据 - Python, Read from txt file and split data 如何在python live plot中从本地txt文件读取数据 - How to read data from local txt file in python live plot 如何从 Python 中的文本文件中读取数据? - How to read data from a text file in Python? 如何编写 python 从名为“file1.txt”的文本文件中读取前两行 将从“file1.txt”读取的两行写入新文件“file2.txt” - How write python to Read the first two lines from a text file named "file1.txt" Write the two lines read from "file1.txt" to a new file "file2.txt" 如何读取大文件txt,然后制作数据框 - How to read big file txt, and then make data frame 如何在Python中读取大型文本文件? - How to read a large text file in Python? Python如何从txt文件读取orderedDict - Python how to read orderedDict from a txt file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM