numpy 字符串数组比 python 字符串快吗

Question

I am creating a string that is about 30 million words long.我正在创建一个大约 3000 万字长的字符串。 As you can imagine, this takes absolutely forever to create with a for-loop increasing by about 100 words at a time.可以想象，创建一个每次增加大约 100 个单词的 for 循环需要花费很长时间。 Is there a way to represent the string in a more memory-friendly way, like a numpy array?有没有办法以更内存友好的方式表示字符串，比如 numpy 数组？ I have very little experience with numpy.我对 numpy 的经验很少。

bigStr = ''
for tweet in df['text']:
  bigStr = bigStr + ' ' + tweet
len(bigStr)

Answer 1

I can see you're trying to get the length of all data.我可以看到您正在尝试获取所有数据的长度。 For that you don't need to append all strings.为此，您不需要附加所有字符串。 (And I see you add a white space for each element) （我看到你为每个元素添加了一个空格）

Just get the length of tweet and add it to an integer variable (+1 for each white space):只需获取tweet的长度并将其添加到一个整数变量（每个空格+1）：

number_of_texts = 0
for tweet in df['text']:
  number_of_texts += 1 + len(tweet)

print(number_of_texts)

Answer 2

如果要构建字符串，请使用' '.join ，它将在 O(n) 时间内创建最终字符串，而不是一次构建一个字符串，这需要 O(n^2) 时间。

bigStr = ' '.join([tweet for tweet in df['text']])

numpy 字符串数组比 python 字符串快吗

问题描述

2 个解决方案

解决方案1
0 2021-07-30 13:51:08

解决方案2
0 已采纳 2021-07-30 13:54:52

numpy 字符串数组比 python 字符串快吗

问题描述

2 个解决方案

解决方案1 0 2021-07-30 13:51:08

解决方案2 0 已采纳 2021-07-30 13:54:52

解决方案1
0 2021-07-30 13:51:08

解决方案2
0 已采纳 2021-07-30 13:54:52