简体   繁体   中英

Optimizing string manipulations in pandas

I have a dataset of 10M records where the first step is to clean the data and make the length of the words in the dataset less than 400 if any. Can this be done faster in raw form without using numba /dask or other multiprocessing libraries?

from cleantext import clean
def func_vect(val):
    temp=clean(val,no_line_breaks=True,no_urls=True,no_emails=True,lower=True).split()

    if len(temp)<=400:
        return " ".join(u for u in temp if len(u)<=15)

    else:
        return " ".join(u for u in temp[:175]+temp[-175:] if len(u)<=15)

ufunc_vec=np.vectorize(func_vect,otypes=[str])

This might work:

df['truncated_string'] = df['string'].str[:400]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM