Jupyter Notebook 上的大型数据集

Question

I try to extract sentiment for very large dataset that consists of more than 606912 instances on Jupyter notebook, but it takes several days and interrupted this my code:我尝试为 Jupyter notebook 上包含超过 606912 个实例的超大数据集提取情绪，但这需要几天时间并中断了我的代码：

from camel_tools.sentiment import SentimentAnalyzer

sentiment_dataset=pd.DataFrame()
full_text=[]
sa = SentimentAnalyzer("CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment")
full_text =  dataset['clean_text'].tolist()
iter_len = len(full_text)
for e in range(iter_len):
    print("Iterate through list:",full_text[e])
    s = sa.predict(full_text[e])
    sentiments.insert(e, s)
    print("Iterate through sentiments list:",sentiments[e])
dataset['sentiments']=pd.DataFrame(sentiments)

can someone help me to solve this issue or speed up the operations.有人可以帮我解决这个问题或加快操作速度吗？

Answer 1

It is not too efficient to proceed one big source dataset in one python instance.在一个 python 实例中处理一个大的源数据集效率不是很高。 My recommendation are:我的建议是：

Version 1. - use our own parallelization版本 1. - 使用我们自己的并行化

split the big source dataset to smaller parts将大源数据集拆分为较小的部分
run the same code in more instances (processes) for increase parallelization with focus on smaller parts of original dataset在更多实例（进程）中运行相同的代码以增加并行化，重点关注原始数据集的较小部分
run this code directly from command line直接从命令行运行此代码

Version 2. - use existing solution for parallelization版本 2. - 使用现有解决方案进行并行化

install eg Apache Spark, Polaris, etc. and use parallel execution in this one安装例如 Apache Spark、Polaris 等，并在这一个中使用并行执行
see short performance comparing https://h2oai.github.io/db-benchmark/查看短期性能比较https://h2oai.github.io/db-benchmark/

Jupyter Notebook 上的大型数据集

问题描述

1 个解决方案

解决方案1
0 已采纳 2023-01-21 11:51:42

Jupyter Notebook 上的大型数据集

问题描述

1 个解决方案

解决方案1 0 已采纳 2023-01-21 11:51:42

解决方案1
0 已采纳 2023-01-21 11:51:42