简体   繁体   English

如何在Python中将一组字符串拆分为子字符串,从而更可能缩短子字符串?

[英]How to split a set of strings into substrings in Python, making shorter substrings more likely?

I have a set of strings which are some millions of characters each. 我有一组字符串,每个字符串都有数百万个字符。 I want to split them into substrings of random length, and this I can do with no particular issue. 我想把它们分成随机长度的子串,这个我可以做没有特别的问题。

However, my question is: how can I apply some sort of weight to the substring length choice? 但是,我的问题是:如何在子串长度选择中应用某种权重? My code runs in python3 , so I would like to find a pythonic solution. 我的代码在python3运行,所以我想找到一个pythonic解决方案。 In detail, my aim is to: 详细地说,我的目标是:

  • split the strings into substrings that range in length between 1*e04 and 8*e06 characters. 将字符串拆分为长度在1 * e04和8 * e06个字符之间的子字符串。
  • make it so, that the script chooses more often a short length (1*e04) over a long length (8*e06) for the newly generated substrings, like a descending length likelihood gradient. 这样做,脚本在新生成的子串中选择较长的长度(8 * e06)的短长度(1 * e04),如下降长度似然梯度。

Thanks for the help! 谢谢您的帮助!

NumPy supplies lots of random samping functions. NumPy提供大量随机抽样功能。 Have a look through the various distributions available. 浏览各种可用的发行版。

If you're looking for something that it weighted towards the lower end of the scale, maybe the exponential distribution would work? 如果你正在寻找一些加权到标度下端的东西,那么指数分布是否可行?

With matplotlib you can plot the histogram of the values, so you can get a better idea if the distribution fits what you want. 使用matplotlib您可以绘制值的直方图,这样您就可以更好地了解分布是否符合您的要求。

So something like this: 所以像这样:

import numpy as np
import matplotlib.pyplot as plt

# desired range of values
mn = 1e04
mx = 8e06

# random values following exp distribution
values = np.random.exponential(scale=1, size=2000)

# scale the values to the desired range
values = ((mx-mn)*values/np.max(values)) + mn

# plot the distribution of values
plt.hist(values)
plt.grid()
plt.show()
plt.close()

There are probably many ways to do this. 可能有很多方法可以做到这一点。 I would do it as follows: 我会这样做:

  1. Take a random number rand in the interval [0,1] : 在区间[0,1]取随机数rand
     import random rand = random.random() 
  2. Use an operation on that number to make smaller numbers more likely, but stay in the range of [0,1] . 对该数字使用操作可以更小的数字,但保持在[0,1]的范围内。 What operation you use depends on how you want your likelihood distribution to look like. 您使用什么操作取决于您希望可能性分布的样子。 A simple choice would be the square. 一个简单的选择就是广场。
     rand = rand**2 
  3. Scale the number space [0,1] up to [1e04, 8e06] and round to the next integer: 将数字空间[0,1]缩放到[1e04, 8e06]并舍入到下一个整数:
     subStringLen = round(rand*(8e06-1e04)+1e04) 
  4. Get the substring of length subStringLen from your string and check how many characters are left. 从字符串中获取长度为subStringLen的子字符串,并检查剩余的字符数。
    • If there are more than 8e06 characters left go to step 1. 如果有超过8e06字符离开返回步骤1。
    • If there are between 1e04 and 8e06 , use them as your last substring. 如果有之间1e048e06 ,把它们作为你的最后一个子。
    • If there are less than 1e04 you need to decide if you want to throw the rest away or allow substrings smaller than 1e04 in this speciel case. 如果有小于1e04你需要决定是否要扔掉剩下的丢掉,或者允许子小于1e04在这种情况下,speciel。

I'm sure there is a lot of improvements possible in terms of efficiency, this is just to give you an idea of my method. 我确信在效率方面有很多改进,这只是为了让你了解我的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM