[英]More efficient way to import and apply a function to text data in a Pandas Dataframe
Basically my code reads in a text file, tokenizes it into sentences, then finds sentences where two pre-defined anchor words occur, finds the distance between them in the sentence, and then appends the distance to a column where the column header is the two pre-defined words, and the row is the sentence in the file. 基本上,我的代码读取一个文本文件,将其标记为句子,然后查找出现两个预定义锚词的句子,在句子中找到它们之间的距离,然后将该距离追加到列标题为两个的列中预定义的单词,该行是文件中的句子。 Or if the two words don't occur in the sentence, null 或者,如果句子中没有出现两个词,则为null
ie If the sentence is 'The brown fox jumped. 即如果句子是“棕狐跳了。 Over the lazy dog. 懒惰的狗。 Happy word.' 快乐的话。 The data frame looks like. 数据框看起来像。
0 | 0 | brown+jumped | 棕色+跳跃| jumped+over | 跳过+ the+dog | +狗| over+dog 过度+狗
1 | 1 | 1 | 1 | null | 空| null | 空| null 空值
2 | 2 | null | 空| null | 空| 2 | 2 | 3 3
3 | 3 | null | 空| null | 空| null | 空| null 空值
The code runs fine when parsing a short paragraph, but when working on larger text files the code takes a lot longer. 解析较短的段落时,代码运行良好,但是处理较大的文本文件时,代码花费的时间更长。 I know the key to speed when working with Dataframes is to avoid for-loops and to apply functions to the whole data set. 我知道使用数据框时加快速度的关键是避免for循环并将函数应用于整个数据集。
My question is, is there a quicker way to apply a function to a string when reading in a text file, other than line by line and appending it to a dataframe? 我的问题是,除了逐行并将其附加到数据框之外,在读取文本文件时是否有更快的方法将功能应用于字符串?
Here's what the code looks like, if its any help. 如果有帮助,这是代码的样子。
for filename in file_list:
doc_df = pd.DataFrame()
doc = open(doc_folder+filename, "r")
doc_text = doc.read().replace('-\n\n ', '').replace('\n', '').replace('\x0c', '.')
doc.close()
sents = sent_detector.tokenize(doc_text)
sent_count=0
for sent in sents:
sent_l = sent.lower()
sent_ws = set(re.findall(r'[A-Z]?[a-z]+|(?<= )[A-Z]+', sent_l))
sent_anchs = anchor_words.intersection(sent_ws) #anchor_words is a predefined list of words that I'm looking for
if sent_anchs != set():
sent_vecs = sent_to_vecs(sent_l, list(sent_anchs)) # sent_to_vec vectorizes the words in the sentence, and a list of anchor words
for sent_vec in sent_vecs:
# Save the word that it was measured from
base_word = sent_vec[0]
df_dict = {}
for each_tup in mean_treat(sent_vec)[1]:
if each_tup[0] in ['ROOT', 'a', 'an', 'the']:
continue
df_dict[base_word+'+'+each_tup[0]]=1/(each_tup[1]) #append distance between two words to a the line the sentence is on
sent_count+=1
doc_df = doc_df.append(pd.DataFrame(df_dict, index=[sent_count]))
doc_df = doc_df.append(pdf) #pdf is the column header.
doc_df = doc_df.fillna(null_val)
print('Saving {} to a csv.'.format(filename))
doc_df.to_csv(doc_folder+filename[0:-4]+'.csv')
I suggest restructuring your code to reduce the number of nested loops. 我建议重组代码以减少嵌套循环的数量。
Below is an example which uses the TextBlob to identify words and sentences, and collections to build the various word combinations. 下面是一个使用TextBlob识别单词和句子以及使用集合构建各种单词组合的示例。 The results are appended to a Pandas DataFrame. 结果将附加到Pandas DataFrame。
import itertools
from textblob import TextBlob
from collections import defaultdict
import pandas as pd
data = TextBlob('The brown fox jumped. Over the lazy dog. Happy word.')
anchors = ['brown', 'jumped', 'over', 'the', 'dog']
anchor_pairs = [x for x in itertools.combinations(anchors, 2)]
df = pd.DataFrame()
for idx, sentence in enumerate(data.sentences):
word_list = sentence.words.lower()
row = {}
for pair in itertools.combinations(word_list, 2):
if pair in anchor_pairs:
first, second = pair
label = '%s+%s' % (first, second)
row.update({label: word_list.index(second) - word_list.index(first)})
df = df.append(pd.Series(row), ignore_index=True)
The result is: 结果是:
brown+jumped over+dog over+the the+dog
0 2 NaN NaN NaN
1 NaN 3 1 2
2 NaN NaN NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.