简体   繁体   English

Jupyter Notebook 上的大型数据集

[英]large dataset on Jupyter notebook

I try to extract sentiment for very large dataset that consists of more than 606912 instances on Jupyter notebook, but it takes several days and interrupted this my code:我尝试为 Jupyter notebook 上包含超过 606912 个实例的超大数据集提取情绪,但这需要几天时间并中断了我的代码:

from camel_tools.sentiment import SentimentAnalyzer

sentiment_dataset=pd.DataFrame()
full_text=[]
sa = SentimentAnalyzer("CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment")
full_text =  dataset['clean_text'].tolist()
iter_len = len(full_text)
for e in range(iter_len):
    print("Iterate through list:",full_text[e])
    s = sa.predict(full_text[e])
    sentiments.insert(e, s)
    print("Iterate through sentiments list:",sentiments[e])
dataset['sentiments']=pd.DataFrame(sentiments)

can someone help me to solve this issue or speed up the operations.有人可以帮我解决这个问题或加快操作速度吗?

It is not too efficient to proceed one big source dataset in one python instance.在一个 python 实例中处理一个大的源数据集效率不是很高。 My recommendation are:我的建议是:

Version 1. - use our own parallelization版本 1. - 使用我们自己的并行化

  • split the big source dataset to smaller parts将大源数据集拆分为较小的部分
  • run the same code in more instances (processes) for increase parallelization with focus on smaller parts of original dataset在更多实例(进程)中运行相同的代码以增加并行化,重点关注原始数据集的较小部分
  • run this code directly from command line直接从命令行运行此代码

Version 2. - use existing solution for parallelization版本 2. - 使用现有解决方案进行并行化

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM