简体   繁体   English

导入功能并将功能应用于熊猫数据框中的文本数据的更有效方法

[英]More efficient way to import and apply a function to text data in a Pandas Dataframe

Basically my code reads in a text file, tokenizes it into sentences, then finds sentences where two pre-defined anchor words occur, finds the distance between them in the sentence, and then appends the distance to a column where the column header is the two pre-defined words, and the row is the sentence in the file. 基本上,我的代码读取一个文本文件,将其标记为句子,然后查找出现两个预定义锚词的句子,在句子中找到它们之间的距离,然后将该距离追加到列标题为两个的列中预定义的单词,该行是文件中的句子。 Or if the two words don't occur in the sentence, null 或者,如果句子中没有出现两个词,则为null

ie If the sentence is 'The brown fox jumped. 即如果句子是“棕狐跳了。 Over the lazy dog. 懒惰的狗。 Happy word.' 快乐的话。 The data frame looks like. 数据框看起来像。

0 | 0 | brown+jumped | 棕色+跳跃| jumped+over | 跳过+ the+dog | +狗| over+dog 过度+狗
1 | 1 | 1 | 1 | null | 空| null | 空| null 空值
2 | 2 | null | 空| null | 空| 2 | 2 | 3 3
3 | 3 | null | 空| null | 空| null | 空| null 空值

The code runs fine when parsing a short paragraph, but when working on larger text files the code takes a lot longer. 解析较短的段落时,代码运行良好,但是处理较大的文本文件时,代码花费的时间更长。 I know the key to speed when working with Dataframes is to avoid for-loops and to apply functions to the whole data set. 我知道使用数据框时加快速度的关键是避免for循环并将函数应用于整个数据集。

My question is, is there a quicker way to apply a function to a string when reading in a text file, other than line by line and appending it to a dataframe? 我的问题是,除了逐行并将其附加到数据框之外,在读取文本文件时是否有更快的方法将功能应用于字符串?

Here's what the code looks like, if its any help. 如果有帮助,这是代码的样子。

   for filename in file_list:
        doc_df = pd.DataFrame()
        doc = open(doc_folder+filename, "r")
        doc_text = doc.read().replace('-\n\n ', '').replace('\n', '').replace('\x0c', '.')
        doc.close()
        sents = sent_detector.tokenize(doc_text)
        sent_count=0
        for sent in sents:
            sent_l = sent.lower()
            sent_ws = set(re.findall(r'[A-Z]?[a-z]+|(?<= )[A-Z]+', sent_l))
            sent_anchs = anchor_words.intersection(sent_ws) #anchor_words is a predefined list of words that I'm looking for
            if sent_anchs != set():
                sent_vecs = sent_to_vecs(sent_l, list(sent_anchs)) # sent_to_vec vectorizes the words in the sentence, and a list of anchor words
                for sent_vec in sent_vecs:
                    # Save the word that it was measured from                
                    base_word = sent_vec[0]
                    df_dict = {}                    
                    for each_tup in mean_treat(sent_vec)[1]:
                        if each_tup[0] in ['ROOT', 'a', 'an', 'the']:
                            continue
                        df_dict[base_word+'+'+each_tup[0]]=1/(each_tup[1]) #append distance between two words to a the line the sentence is on
                    sent_count+=1
                    doc_df = doc_df.append(pd.DataFrame(df_dict, index=[sent_count]))
        doc_df = doc_df.append(pdf) #pdf is the column header.
        doc_df = doc_df.fillna(null_val)
        print('Saving {} to a csv.'.format(filename))
        doc_df.to_csv(doc_folder+filename[0:-4]+'.csv')

I suggest restructuring your code to reduce the number of nested loops. 我建议重组代码以减少嵌套循环的数量。

Below is an example which uses the TextBlob to identify words and sentences, and collections to build the various word combinations. 下面是一个使用TextBlob识别单词和句子以及使用集合构建各种单词组合的示例。 The results are appended to a Pandas DataFrame. 结果将附加到Pandas DataFrame。

import itertools
from textblob import TextBlob
from collections import defaultdict
import pandas as pd

data = TextBlob('The brown fox jumped. Over the lazy dog. Happy word.')

anchors = ['brown', 'jumped', 'over', 'the', 'dog']
anchor_pairs = [x for x in itertools.combinations(anchors, 2)]

df = pd.DataFrame()
for idx, sentence in enumerate(data.sentences):
    word_list = sentence.words.lower()
    row = {}
    for pair in itertools.combinations(word_list, 2):
        if pair in anchor_pairs:
            first, second = pair
            label = '%s+%s' % (first, second)
            row.update({label: word_list.index(second) - word_list.index(first)})
    df = df.append(pd.Series(row), ignore_index=True)

The result is: 结果是:

    brown+jumped    over+dog    over+the    the+dog
0   2               NaN         NaN         NaN
1   NaN             3           1           2
2   NaN             NaN         NaN         NaN

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Pandas以更有效的方式在后续行之间应用函数 - Apply function between subsequent rows in more efficient way with Pandas 将函数应用于 Pandas 数据帧:是否有更节省内存的方法? - Applying function to pandas dataframe: is there a more memory-efficient way of doing this? 是否有更有效的方法将自定义 function 文本切片器应用于整个数据集? - Is there a more efficient way to apply the custom function text slicer to the entire dataset? 从熊猫数据帧(股票)计算数据的更有效方法 - More efficient way of calculating data from pandas dataframe (stock) 有没有更有效的方法将映射应用于熊猫系列? - Is there a more efficient way to apply a mapping to a pandas series? 将条件函数应用于 Pandas 中按天分组的数据的有效方法 - Efficient way to apply conditional function to data grouped by day in Pandas 将pandas数据框函数转换为更有效的函数 - Convert a pandas dataframe function into a more efficient function 在 Pandas DataFrame 中匹配字符串的更有效方法 - More efficient way to match strings in a Pandas DataFrame 更有效的方法来迭代 groupby Pandas 数据框? - More efficient way to iterate groupby Pandas dataframe? 在 Python/Pandas 中,将自定义 function 应用于输入包含字符串的 dataframe 的列的最有效方法是什么? - In Python/Pandas, what is the most efficient way, to apply a custom function, to a column of a dataframe, where the input includes strings?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM