简体   繁体   中英

Applying function to pandas dataframe: is there a more memory-efficient way of doing this?

I have a dataframe that has a small number of columns but many rows (about 900K right now, and it's going to get bigger as I collect more data). It looks like this:

Author Title Date Category Text url
0 Amira Charfeddine Wild Fadhila 01 2019-01-01 novel الكتاب هذا نهديه لكل تونسي حس إلي الكتاب يحكي ... NaN
1 Amira Charfeddine Wild Fadhila 02 2019-01-01 novel في التزغريت، والعياط و الزمامر، ليوم نتيجة الب... NaN
2 253826 1515368_7636953 2010-12-28 /forums/forums/91/ هذا ما ينص عليه إدوستور التونسي لا رئاسة مدى ا... https://www.tunisia-sat.com/forums/threads/151.. .
3 250442 1504416_7580403 2010-12-21 /forums/sports/ \\n\\n\\n\\n\\n\\nاعلنت الجامعة التونسية لكرة اليد ا... https://www.tunisia-sat.com/forums/threads/150.. .
4 312628 1504416_7580433 2010-12-21 /forums/sports/ quel est le résultat final\\n,,,,???? https://www.tunisia-sat.com/forums/threads/150.. .

The "Text" Column has a string of text that may be just a few words (in the case of a forum post) or it may a portion of a novel and have tens of thousands of words (as in the two first rows above).

I have code that constructs the dataframe from various corpus files (.txt and .json), then cleans the text and saves the cleaned dataframe as a pickle file.

I'm trying to run the following code to analyze how variable the spelling of different words are in the corpus. The functions seem simple enough: One counts the occurrence of a particular spelling variable in each Text row; the other takes a list of such frequencies and computes a Gini Coefficient for each lemma (which is just a numerical measure of how heterogenous the spelling is). It references a spelling_var dictionary that has a lemma as its key and the various ways of spelling that lemma as values. (like {'color': ['color', 'colour']} except not in English.)

This code works, but it uses a lot of memory. I'm not sure how much, but I use PythonAnywhere for my coding and this code sends me into the tarpit (in other words, it makes me exceed my daily allowance of CPU seconds).

Is there a way to do this so that it's less memory intensive? Preferably without me having to learn another package (I've spent the past several weeks learning Pandas and am liking it, and need to just get on with my analysis). Once I have the code and have finished collecting the corpus, I'll only run it a few times; I won't be running it everyday or anything (in case that matters).

Here's the code:

import pickle
import pandas as pd
import re

with open('1_raw_df.pkl', 'rb') as pickle_file:
    df = pickle.load(pickle_file)

spelling_var = {
    'illi': ["الي", "اللي"],
    'besh': ["باش", "بش"],
    ...
    }

spelling_df = df.copy()

def count_word(df, word):
    pattern = r"\b" + re.escape(word) + r"\b"
    return df['Text'].str.count(pattern)

def compute_gini(freq_list):
    proportions = [f/sum(freq_list) for f in freq_list]
    squared = [p**2 for p in proportions]
    return 1-sum(squared)

for w, var in spelling_var.items():
    count_list = []
    for v in var:
        count_list.append(count_word(spelling_df, v))
        gini = compute_gini(count_list)
    spelling_df[w] = gini

I rewrote two lines in the last double loop, see the comments in the code below. does this solve your issue?

gini_lst = []
for w, var in spelling_var.items():
    count_list = []
    for v in var:
        count_list.append(count_word(spelling_df, v))
        #gini = compute_gini(count_list)  # don't think you need to compute this at every iteration of the inner loop, right?
    #spelling_df[w] = gini  # having this inside of the loop creates a new column at each iteration, which could crash your CPU
    gini_lst.append(compute_gini(count_list))

# this creates a df with a row for each lemma with its associated gini value
df_lemma_gini = pd.DataFrame(data={"lemma_column": list(spelling_var.keys()), "gini_column": gini_lst})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM