Applying function to pandas dataframe: is there a more memory-efficient way of doing this?

Question

I have a dataframe that has a small number of columns but many rows (about 900K right now, and it's going to get bigger as I collect more data). It looks like this:

	Author	Title	Date	Category	Text	url
0	Amira Charfeddine	Wild Fadhila 01	2019-01-01	novel	الكتاب هذا نهديه لكل تونسي حس إلي الكتاب يحكي ...	NaN
1	Amira Charfeddine	Wild Fadhila 02	2019-01-01	novel	في التزغريت، والعياط و الزمامر، ليوم نتيجة الب...	NaN
2	253826	1515368_7636953	2010-12-28	/forums/forums/91/	هذا ما ينص عليه إدوستور التونسي لا رئاسة مدى ا...	https://www.tunisia-sat.com/forums/threads/151.. .
3	250442	1504416_7580403	2010-12-21	/forums/sports/	\\n\\n\\n\\n\\n\\nاعلنت الجامعة التونسية لكرة اليد ا...	https://www.tunisia-sat.com/forums/threads/150.. .
4	312628	1504416_7580433	2010-12-21	/forums/sports/	quel est le résultat final\\n,,,,????	https://www.tunisia-sat.com/forums/threads/150.. .

The "Text" Column has a string of text that may be just a few words (in the case of a forum post) or it may a portion of a novel and have tens of thousands of words (as in the two first rows above).

I have code that constructs the dataframe from various corpus files (.txt and .json), then cleans the text and saves the cleaned dataframe as a pickle file.

I'm trying to run the following code to analyze how variable the spelling of different words are in the corpus. The functions seem simple enough: One counts the occurrence of a particular spelling variable in each Text row; the other takes a list of such frequencies and computes a Gini Coefficient for each lemma (which is just a numerical measure of how heterogenous the spelling is). It references a spelling_var dictionary that has a lemma as its key and the various ways of spelling that lemma as values. (like {'color': ['color', 'colour']} except not in English.)

This code works, but it uses a lot of memory. I'm not sure how much, but I use PythonAnywhere for my coding and this code sends me into the tarpit (in other words, it makes me exceed my daily allowance of CPU seconds).

Is there a way to do this so that it's less memory intensive? Preferably without me having to learn another package (I've spent the past several weeks learning Pandas and am liking it, and need to just get on with my analysis). Once I have the code and have finished collecting the corpus, I'll only run it a few times; I won't be running it everyday or anything (in case that matters).

Here's the code:

import pickle
import pandas as pd
import re

with open('1_raw_df.pkl', 'rb') as pickle_file:
    df = pickle.load(pickle_file)

spelling_var = {
    'illi': ["الي", "اللي"],
    'besh': ["باش", "بش"],
    ...
    }

spelling_df = df.copy()

def count_word(df, word):
    pattern = r"\b" + re.escape(word) + r"\b"
    return df['Text'].str.count(pattern)

def compute_gini(freq_list):
    proportions = [f/sum(freq_list) for f in freq_list]
    squared = [p**2 for p in proportions]
    return 1-sum(squared)

for w, var in spelling_var.items():
    count_list = []
    for v in var:
        count_list.append(count_word(spelling_df, v))
        gini = compute_gini(count_list)
    spelling_df[w] = gini

Answer 1

I rewrote two lines in the last double loop, see the comments in the code below. does this solve your issue?

gini_lst = []
for w, var in spelling_var.items():
    count_list = []
    for v in var:
        count_list.append(count_word(spelling_df, v))
        #gini = compute_gini(count_list)  # don't think you need to compute this at every iteration of the inner loop, right?
    #spelling_df[w] = gini  # having this inside of the loop creates a new column at each iteration, which could crash your CPU
    gini_lst.append(compute_gini(count_list))

# this creates a df with a row for each lemma with its associated gini value
df_lemma_gini = pd.DataFrame(data={"lemma_column": list(spelling_var.keys()), "gini_column": gini_lst})

Applying function to pandas dataframe: is there a more memory-efficient way of doing this?

Question

1 answers

solution1
0 2021-11-13 22:47:10

Applying function to pandas dataframe: is there a more memory-efficient way of doing this?

Question

1 answers

solution1 0 2021-11-13 22:47:10

solution1
0 2021-11-13 22:47:10