简体   繁体   English

numpy 字符串数组比 python 字符串快吗

[英]Are numpy string arrays faster than python strings

I am creating a string that is about 30 million words long.我正在创建一个大约 3000 万字长的字符串。 As you can imagine, this takes absolutely forever to create with a for-loop increasing by about 100 words at a time.可以想象,创建一个每次增加大约 100 个单词的 for 循环需要花费很长时间。 Is there a way to represent the string in a more memory-friendly way, like a numpy array?有没有办法以更内存友好的方式表示字符串,比如 numpy 数组? I have very little experience with numpy.我对 numpy 的经验很少。

bigStr = ''
for tweet in df['text']:
  bigStr = bigStr + ' ' + tweet
len(bigStr)

I can see you're trying to get the length of all data.我可以看到您正在尝试获取所有数据的长度。 For that you don't need to append all strings.为此,您不需要附加所有字符串。 (And I see you add a white space for each element) (我看到你为每个元素添加了一个空格)

Just get the length of tweet and add it to an integer variable (+1 for each white space):只需获取tweet的长度并将其添加到一个整数变量(每个空格+1):

number_of_texts = 0
for tweet in df['text']:
  number_of_texts += 1 + len(tweet)

print(number_of_texts)

如果要构建字符串,请使用' '.join ,它将在 O(n) 时间内创建最终字符串,而不是一次构建一个字符串,这需要 O(n^2) 时间。

bigStr = ' '.join([tweet for tweet in df['text']])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM