简体   繁体   中英

How to split a set of strings into substrings in Python, making shorter substrings more likely?

I have a set of strings which are some millions of characters each. I want to split them into substrings of random length, and this I can do with no particular issue.

However, my question is: how can I apply some sort of weight to the substring length choice? My code runs in python3 , so I would like to find a pythonic solution. In detail, my aim is to:

  • split the strings into substrings that range in length between 1*e04 and 8*e06 characters.
  • make it so, that the script chooses more often a short length (1*e04) over a long length (8*e06) for the newly generated substrings, like a descending length likelihood gradient.

Thanks for the help!

NumPy supplies lots of random samping functions. Have a look through the various distributions available.

If you're looking for something that it weighted towards the lower end of the scale, maybe the exponential distribution would work?

With matplotlib you can plot the histogram of the values, so you can get a better idea if the distribution fits what you want.

So something like this:

import numpy as np
import matplotlib.pyplot as plt

# desired range of values
mn = 1e04
mx = 8e06

# random values following exp distribution
values = np.random.exponential(scale=1, size=2000)

# scale the values to the desired range
values = ((mx-mn)*values/np.max(values)) + mn

# plot the distribution of values
plt.hist(values)
plt.grid()
plt.show()
plt.close()

There are probably many ways to do this. I would do it as follows:

  1. Take a random number rand in the interval [0,1] :
     import random rand = random.random() 
  2. Use an operation on that number to make smaller numbers more likely, but stay in the range of [0,1] . What operation you use depends on how you want your likelihood distribution to look like. A simple choice would be the square.
     rand = rand**2 
  3. Scale the number space [0,1] up to [1e04, 8e06] and round to the next integer:
     subStringLen = round(rand*(8e06-1e04)+1e04) 
  4. Get the substring of length subStringLen from your string and check how many characters are left.
    • If there are more than 8e06 characters left go to step 1.
    • If there are between 1e04 and 8e06 , use them as your last substring.
    • If there are less than 1e04 you need to decide if you want to throw the rest away or allow substrings smaller than 1e04 in this speciel case.

I'm sure there is a lot of improvements possible in terms of efficiency, this is just to give you an idea of my method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM