[英]How to split a set of strings into substrings in Python, making shorter substrings more likely?
I have a set of strings which are some millions of characters each. 我有一组字符串,每个字符串都有数百万个字符。 I want to split them into substrings of random length, and this I can do with no particular issue.
我想把它们分成随机长度的子串,这个我可以做没有特别的问题。
However, my question is: how can I apply some sort of weight to the substring length choice? 但是,我的问题是:如何在子串长度选择中应用某种权重? My code runs in
python3
, so I would like to find a pythonic solution. 我的代码在
python3
运行,所以我想找到一个pythonic解决方案。 In detail, my aim is to: 详细地说,我的目标是:
Thanks for the help! 谢谢您的帮助!
NumPy
supplies lots of random samping functions. NumPy
提供大量随机抽样功能。 Have a look through the various distributions available. 浏览各种可用的发行版。
If you're looking for something that it weighted towards the lower end of the scale, maybe the exponential distribution would work? 如果你正在寻找一些加权到标度下端的东西,那么指数分布是否可行?
With matplotlib
you can plot the histogram of the values, so you can get a better idea if the distribution fits what you want. 使用
matplotlib
您可以绘制值的直方图,这样您就可以更好地了解分布是否符合您的要求。
So something like this: 所以像这样:
import numpy as np
import matplotlib.pyplot as plt
# desired range of values
mn = 1e04
mx = 8e06
# random values following exp distribution
values = np.random.exponential(scale=1, size=2000)
# scale the values to the desired range
values = ((mx-mn)*values/np.max(values)) + mn
# plot the distribution of values
plt.hist(values)
plt.grid()
plt.show()
plt.close()
There are probably many ways to do this. 可能有很多方法可以做到这一点。 I would do it as follows:
我会这样做:
rand
in the interval [0,1]
: [0,1]
取随机数rand
: import random rand = random.random()
[0,1]
. [0,1]
的范围内。 What operation you use depends on how you want your likelihood distribution to look like. rand = rand**2
[0,1]
up to [1e04, 8e06]
and round to the next integer: [0,1]
缩放到[1e04, 8e06]
并舍入到下一个整数: subStringLen = round(rand*(8e06-1e04)+1e04)
subStringLen
from your string and check how many characters are left. subStringLen
的子字符串,并检查剩余的字符数。
8e06
characters left go to step 1. 8e06
字符离开返回步骤1。 1e04
and 8e06
, use them as your last substring. 1e04
和8e06
,把它们作为你的最后一个子。 1e04
you need to decide if you want to throw the rest away or allow substrings smaller than 1e04
in this speciel case. 1e04
你需要决定是否要扔掉剩下的丢掉,或者允许子小于1e04
在这种情况下,speciel。 I'm sure there is a lot of improvements possible in terms of efficiency, this is just to give you an idea of my method. 我确信在效率方面有很多改进,这只是为了让你了解我的方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.