I have a dataset of more than 300k files which I need to read and append to dictionary.
corpus_path = "data"
article_paths = [os.path.join(corpus_path,p) for p in os.listdir(corpus_path)]
doc = []
for path in article_paths:
dp = pd.read_table(path, header=None, encoding='utf-8', quoting=3, error_bad_lines=False)
doc.append(dp)
Is there a faster way to do this, as the current method takes more than an hour.
You can use multiprocessing module.
from multiprocessing import Pool
def readFile(path):
return pd.read_table(path, header=None, encoding='utf-8', quoting=3, error_bad_lines=False)
result = list(Pool(processes=nprocs).imap(readFile, article_paths)) #nprocs = Number of processors
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.