简体   繁体   中英

More efficient way to import and apply a function to text data in a Pandas Dataframe

Basically my code reads in a text file, tokenizes it into sentences, then finds sentences where two pre-defined anchor words occur, finds the distance between them in the sentence, and then appends the distance to a column where the column header is the two pre-defined words, and the row is the sentence in the file. Or if the two words don't occur in the sentence, null

ie If the sentence is 'The brown fox jumped. Over the lazy dog. Happy word.' The data frame looks like.

0 | brown+jumped | jumped+over | the+dog | over+dog
1 | 1 | null | null | null
2 | null | null | 2 | 3
3 | null | null | null | null

The code runs fine when parsing a short paragraph, but when working on larger text files the code takes a lot longer. I know the key to speed when working with Dataframes is to avoid for-loops and to apply functions to the whole data set.

My question is, is there a quicker way to apply a function to a string when reading in a text file, other than line by line and appending it to a dataframe?

Here's what the code looks like, if its any help.

   for filename in file_list:
        doc_df = pd.DataFrame()
        doc = open(doc_folder+filename, "r")
        doc_text = doc.read().replace('-\n\n ', '').replace('\n', '').replace('\x0c', '.')
        doc.close()
        sents = sent_detector.tokenize(doc_text)
        sent_count=0
        for sent in sents:
            sent_l = sent.lower()
            sent_ws = set(re.findall(r'[A-Z]?[a-z]+|(?<= )[A-Z]+', sent_l))
            sent_anchs = anchor_words.intersection(sent_ws) #anchor_words is a predefined list of words that I'm looking for
            if sent_anchs != set():
                sent_vecs = sent_to_vecs(sent_l, list(sent_anchs)) # sent_to_vec vectorizes the words in the sentence, and a list of anchor words
                for sent_vec in sent_vecs:
                    # Save the word that it was measured from                
                    base_word = sent_vec[0]
                    df_dict = {}                    
                    for each_tup in mean_treat(sent_vec)[1]:
                        if each_tup[0] in ['ROOT', 'a', 'an', 'the']:
                            continue
                        df_dict[base_word+'+'+each_tup[0]]=1/(each_tup[1]) #append distance between two words to a the line the sentence is on
                    sent_count+=1
                    doc_df = doc_df.append(pd.DataFrame(df_dict, index=[sent_count]))
        doc_df = doc_df.append(pdf) #pdf is the column header.
        doc_df = doc_df.fillna(null_val)
        print('Saving {} to a csv.'.format(filename))
        doc_df.to_csv(doc_folder+filename[0:-4]+'.csv')

I suggest restructuring your code to reduce the number of nested loops.

Below is an example which uses the TextBlob to identify words and sentences, and collections to build the various word combinations. The results are appended to a Pandas DataFrame.

import itertools
from textblob import TextBlob
from collections import defaultdict
import pandas as pd

data = TextBlob('The brown fox jumped. Over the lazy dog. Happy word.')

anchors = ['brown', 'jumped', 'over', 'the', 'dog']
anchor_pairs = [x for x in itertools.combinations(anchors, 2)]

df = pd.DataFrame()
for idx, sentence in enumerate(data.sentences):
    word_list = sentence.words.lower()
    row = {}
    for pair in itertools.combinations(word_list, 2):
        if pair in anchor_pairs:
            first, second = pair
            label = '%s+%s' % (first, second)
            row.update({label: word_list.index(second) - word_list.index(first)})
    df = df.append(pd.Series(row), ignore_index=True)

The result is:

    brown+jumped    over+dog    over+the    the+dog
0   2               NaN         NaN         NaN
1   NaN             3           1           2
2   NaN             NaN         NaN         NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM