[英]Apply function between subsequent rows in more efficient way with Pandas
[英]More efficient way to import and apply a function to text data in a Pandas Dataframe
基本上,我的代码读取一个文本文件,将其标记为句子,然后查找出现两个预定义锚词的句子,在句子中找到它们之间的距离,然后将该距离追加到列标题为两个的列中预定义的单词,该行是文件中的句子。 或者,如果句子中没有出现两个词,则为null
即如果句子是“棕狐跳了。 懒惰的狗。 快乐的话。 数据框看起来像。
0 | 棕色+跳跃| 跳过+ +狗| 过度+狗
1 | 1 | 空| 空| 空值
2 | 空| 空| 2 | 3
3 | 空| 空| 空| 空值
解析较短的段落时,代码运行良好,但是处理较大的文本文件时,代码花费的时间更长。 我知道使用数据框时加快速度的关键是避免for循环并将函数应用于整个数据集。
我的问题是,除了逐行并将其附加到数据框之外,在读取文本文件时是否有更快的方法将功能应用于字符串?
如果有帮助,这是代码的样子。
for filename in file_list:
doc_df = pd.DataFrame()
doc = open(doc_folder+filename, "r")
doc_text = doc.read().replace('-\n\n ', '').replace('\n', '').replace('\x0c', '.')
doc.close()
sents = sent_detector.tokenize(doc_text)
sent_count=0
for sent in sents:
sent_l = sent.lower()
sent_ws = set(re.findall(r'[A-Z]?[a-z]+|(?<= )[A-Z]+', sent_l))
sent_anchs = anchor_words.intersection(sent_ws) #anchor_words is a predefined list of words that I'm looking for
if sent_anchs != set():
sent_vecs = sent_to_vecs(sent_l, list(sent_anchs)) # sent_to_vec vectorizes the words in the sentence, and a list of anchor words
for sent_vec in sent_vecs:
# Save the word that it was measured from
base_word = sent_vec[0]
df_dict = {}
for each_tup in mean_treat(sent_vec)[1]:
if each_tup[0] in ['ROOT', 'a', 'an', 'the']:
continue
df_dict[base_word+'+'+each_tup[0]]=1/(each_tup[1]) #append distance between two words to a the line the sentence is on
sent_count+=1
doc_df = doc_df.append(pd.DataFrame(df_dict, index=[sent_count]))
doc_df = doc_df.append(pdf) #pdf is the column header.
doc_df = doc_df.fillna(null_val)
print('Saving {} to a csv.'.format(filename))
doc_df.to_csv(doc_folder+filename[0:-4]+'.csv')
我建议重组代码以减少嵌套循环的数量。
下面是一个使用TextBlob识别单词和句子以及使用集合构建各种单词组合的示例。 结果将附加到Pandas DataFrame。
import itertools
from textblob import TextBlob
from collections import defaultdict
import pandas as pd
data = TextBlob('The brown fox jumped. Over the lazy dog. Happy word.')
anchors = ['brown', 'jumped', 'over', 'the', 'dog']
anchor_pairs = [x for x in itertools.combinations(anchors, 2)]
df = pd.DataFrame()
for idx, sentence in enumerate(data.sentences):
word_list = sentence.words.lower()
row = {}
for pair in itertools.combinations(word_list, 2):
if pair in anchor_pairs:
first, second = pair
label = '%s+%s' % (first, second)
row.update({label: word_list.index(second) - word_list.index(first)})
df = df.append(pd.Series(row), ignore_index=True)
结果是:
brown+jumped over+dog over+the the+dog
0 2 NaN NaN NaN
1 NaN 3 1 2
2 NaN NaN NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.