导入功能并将功能应用于熊猫数据框中的文本数据的更有效方法

Question

基本上，我的代码读取一个文本文件，将其标记为句子，然后查找出现两个预定义锚词的句子，在句子中找到它们之间的距离，然后将该距离追加到列标题为两个的列中预定义的单词，该行是文件中的句子。 或者，如果句子中没有出现两个词，则为null

即如果句子是“棕狐跳了。 懒惰的狗。 快乐的话。 数据框看起来像。

0 | 棕色+跳跃| 跳过+ +狗| 过度+狗
1 | 1 | 空| 空| 空值
2 | 空| 空| 2 | 3
3 | 空| 空| 空| 空值

解析较短的段落时，代码运行良好，但是处理较大的文本文件时，代码花费的时间更长。 我知道使用数据框时加快速度的关键是避免for循环并将函数应用于整个数据集。

我的问题是，除了逐行并将其附加到数据框之外，在读取文本文件时是否有更快的方法将功能应用于字符串？

如果有帮助，这是代码的样子。

   for filename in file_list:
        doc_df = pd.DataFrame()
        doc = open(doc_folder+filename, "r")
        doc_text = doc.read().replace('-\n\n ', '').replace('\n', '').replace('\x0c', '.')
        doc.close()
        sents = sent_detector.tokenize(doc_text)
        sent_count=0
        for sent in sents:
            sent_l = sent.lower()
            sent_ws = set(re.findall(r'[A-Z]?[a-z]+|(?<= )[A-Z]+', sent_l))
            sent_anchs = anchor_words.intersection(sent_ws) #anchor_words is a predefined list of words that I'm looking for
            if sent_anchs != set():
                sent_vecs = sent_to_vecs(sent_l, list(sent_anchs)) # sent_to_vec vectorizes the words in the sentence, and a list of anchor words
                for sent_vec in sent_vecs:
                    # Save the word that it was measured from                
                    base_word = sent_vec[0]
                    df_dict = {}                    
                    for each_tup in mean_treat(sent_vec)[1]:
                        if each_tup[0] in ['ROOT', 'a', 'an', 'the']:
                            continue
                        df_dict[base_word+'+'+each_tup[0]]=1/(each_tup[1]) #append distance between two words to a the line the sentence is on
                    sent_count+=1
                    doc_df = doc_df.append(pd.DataFrame(df_dict, index=[sent_count]))
        doc_df = doc_df.append(pdf) #pdf is the column header.
        doc_df = doc_df.fillna(null_val)
        print('Saving {} to a csv.'.format(filename))
        doc_df.to_csv(doc_folder+filename[0:-4]+'.csv')

Answer 1

我建议重组代码以减少嵌套循环的数量。

下面是一个使用TextBlob识别单词和句子以及使用集合构建各种单词组合的示例。 结果将附加到Pandas DataFrame。

import itertools
from textblob import TextBlob
from collections import defaultdict
import pandas as pd

data = TextBlob('The brown fox jumped. Over the lazy dog. Happy word.')

anchors = ['brown', 'jumped', 'over', 'the', 'dog']
anchor_pairs = [x for x in itertools.combinations(anchors, 2)]

df = pd.DataFrame()
for idx, sentence in enumerate(data.sentences):
    word_list = sentence.words.lower()
    row = {}
    for pair in itertools.combinations(word_list, 2):
        if pair in anchor_pairs:
            first, second = pair
            label = '%s+%s' % (first, second)
            row.update({label: word_list.index(second) - word_list.index(first)})
    df = df.append(pd.Series(row), ignore_index=True)

结果是：

    brown+jumped    over+dog    over+the    the+dog
0   2               NaN         NaN         NaN
1   NaN             3           1           2
2   NaN             NaN         NaN         NaN

导入功能并将功能应用于熊猫数据框中的文本数据的更有效方法

问题描述

1 个解决方案

解决方案1
0 2015-11-16 22:01:15

导入功能并将功能应用于熊猫数据框中的文本数据的更有效方法

问题描述

1 个解决方案

解决方案1 0 2015-11-16 22:01:15

解决方案1
0 2015-11-16 22:01:15