Optimizing string manipulations in pandas

Question

I have a dataset of 10M records where the first step is to clean the data and make the length of the words in the dataset less than 400 if any. Can this be done faster in raw form without using numba /dask or other multiprocessing libraries?

from cleantext import clean
def func_vect(val):
    temp=clean(val,no_line_breaks=True,no_urls=True,no_emails=True,lower=True).split()

    if len(temp)<=400:
        return " ".join(u for u in temp if len(u)<=15)

    else:
        return " ".join(u for u in temp[:175]+temp[-175:] if len(u)<=15)

ufunc_vec=np.vectorize(func_vect,otypes=[str])

Answer 1

This might work:

df['truncated_string'] = df['string'].str[:400]

Optimizing string manipulations in pandas

Question

1 answers

solution1
-1 2021-03-06 12:41:06

Optimizing string manipulations in pandas

Question

1 answers

solution1 -1 2021-03-06 12:41:06

solution1
-1 2021-03-06 12:41:06