Basically my code reads in a text file, tokenizes it into sentences, then finds sentences where two pre-defined anchor words occur, finds the distance between them in the sentence, and then appends the distance to a column where the column header is the two pre-defined words, and the row is the sentence in the file. Or if the two words don't occur in the sentence, null
ie If the sentence is 'The brown fox jumped. Over the lazy dog. Happy word.' The data frame looks like.
0 | brown+jumped | jumped+over | the+dog | over+dog
1 | 1 | null | null | null
2 | null | null | 2 | 3
3 | null | null | null | null
The code runs fine when parsing a short paragraph, but when working on larger text files the code takes a lot longer. I know the key to speed when working with Dataframes is to avoid for-loops and to apply functions to the whole data set.
My question is, is there a quicker way to apply a function to a string when reading in a text file, other than line by line and appending it to a dataframe?
Here's what the code looks like, if its any help.
for filename in file_list:
doc_df = pd.DataFrame()
doc = open(doc_folder+filename, "r")
doc_text = doc.read().replace('-\n\n ', '').replace('\n', '').replace('\x0c', '.')
doc.close()
sents = sent_detector.tokenize(doc_text)
sent_count=0
for sent in sents:
sent_l = sent.lower()
sent_ws = set(re.findall(r'[A-Z]?[a-z]+|(?<= )[A-Z]+', sent_l))
sent_anchs = anchor_words.intersection(sent_ws) #anchor_words is a predefined list of words that I'm looking for
if sent_anchs != set():
sent_vecs = sent_to_vecs(sent_l, list(sent_anchs)) # sent_to_vec vectorizes the words in the sentence, and a list of anchor words
for sent_vec in sent_vecs:
# Save the word that it was measured from
base_word = sent_vec[0]
df_dict = {}
for each_tup in mean_treat(sent_vec)[1]:
if each_tup[0] in ['ROOT', 'a', 'an', 'the']:
continue
df_dict[base_word+'+'+each_tup[0]]=1/(each_tup[1]) #append distance between two words to a the line the sentence is on
sent_count+=1
doc_df = doc_df.append(pd.DataFrame(df_dict, index=[sent_count]))
doc_df = doc_df.append(pdf) #pdf is the column header.
doc_df = doc_df.fillna(null_val)
print('Saving {} to a csv.'.format(filename))
doc_df.to_csv(doc_folder+filename[0:-4]+'.csv')
I suggest restructuring your code to reduce the number of nested loops.
Below is an example which uses the TextBlob to identify words and sentences, and collections to build the various word combinations. The results are appended to a Pandas DataFrame.
import itertools
from textblob import TextBlob
from collections import defaultdict
import pandas as pd
data = TextBlob('The brown fox jumped. Over the lazy dog. Happy word.')
anchors = ['brown', 'jumped', 'over', 'the', 'dog']
anchor_pairs = [x for x in itertools.combinations(anchors, 2)]
df = pd.DataFrame()
for idx, sentence in enumerate(data.sentences):
word_list = sentence.words.lower()
row = {}
for pair in itertools.combinations(word_list, 2):
if pair in anchor_pairs:
first, second = pair
label = '%s+%s' % (first, second)
row.update({label: word_list.index(second) - word_list.index(first)})
df = df.append(pd.Series(row), ignore_index=True)
The result is:
brown+jumped over+dog over+the the+dog
0 2 NaN NaN NaN
1 NaN 3 1 2
2 NaN NaN NaN NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.