简体   繁体   English

对一系列数字进行重采样(上采样,内插)

[英]Resampling (upsampling, interpolating) a series of numbers

I have a comma separated series of integer values that I'd like to resample so that I have twice as many, where a new value is added half way between each of the existing values. 我有一个逗号分隔的整数值序列,我想对其重采样,以便获得两倍的整数值,其中在每个现有值之间添加一个新值。 For example, if this is my source: 例如,如果这是我的来源:

1,5,11,9,13,21

the result would be: 结果将是:

1,3,5,8,11,10,9,11,13,17,21

In case that's not clear, I'm trying to add a number between each of the values in my source series, like this: 如果不清楚,我尝试在源代码系列的每个值之间添加一个数字,如下所示:

1   5   11    9    13    21
1 3 5 8 11 10 9 11 13 17 21

I've searched quite a bit and it seems that something like scipy.signal.resample or panda should work, but I'm completely new at this and I haven't been able to get it working. 我已经搜索了很多,似乎scipy.signal.resample或panda应该可以工作,但是我是一个全新的人,我一直无法使它工作。 For example, here's one of my attempts with scipy: 例如,这是我尝试scipy的尝试之一:

import numpy as np
from scipy import signal
InputFileName = "sample.raw"
DATA250  = np.loadtxt(InputFileName, delimiter=',', dtype=int);
print(DATA250)
DATA500 = signal.resample(DATA250, 11)
print(DATA500)

Which outputs: 哪个输出:

[ 1  5 11  9 13 21]
[ 1.         -0.28829461  6.12324489 10.43251996 10.9108191   9.84503237
  8.40293529 10.7641676  18.44182898 21.68506897 12.68267746]

Obviously I'm using signal.resample incorrectly. 显然我使用的是signal.resample错误。 Is there a way I can do this with signal.resample or panda? 有没有办法我可以使用signal.resample或panda做到这一点? Should I be using some other method? 我应该使用其他方法吗?

Also, in my example all of source numbers have an integer half way in between. 同样,在我的示例中,所有源编号之间都存在一个整数。 In my actual data, that won't be the case. 在我的实际数据中,情况并非如此。 So if two of the number are 10,15, the new number would be 12.5. 因此,如果数字中的两个为10,15,则新数字将为12.5。 However I'd like to have all of the resulting numbers be integers. 但是,我希望所有结果数字均为整数。 So the new number that gets inserted would need to either be 12 or 13 (it doesn't matter to me which it is). 因此,要插入的新数字必须是12或13(对我来说这无关紧要)。

Note that once I get this working, the source file will actually be a comma separated list of 2,000 numbers and the output should be 4,000 numbers (or technically 3,999 since there won't be one added to the end). 请注意,一旦我开始工作,源文件实际上将是一个由2,000个数字组成的逗号分隔列表,并且输出应为4,000个数字(或从技术上讲为3,999个数字,因为末尾不会再添加一个)。 Also, this is going to be used to process something similar to an ECG recording- currently the ECG is sampled at 250 Hz for 8 seconds, which is then passed to a separate process to analyze the recording. 同样,这将用于处理类似于ECG记录的内容-当前,ECG在250 Hz下采样8秒钟,然​​后传递到单独的过程中以分析记录。 However that separate process needs the recording to be sampled at 500 Hz. 但是,该单独的过程需要以500 Hz采样采样。 So the workflow will be that I'll take a 250 Hz recording every 8 seconds and upsample it to 500 Hz, then pass the resulting output to the analysis process. 因此,工作流程是,我将每8秒记录250 Hz,并将其上采样到500 Hz,然后将结果输出传递给分析过程。

Thanks for any guidance you can provide. 感谢您提供的任何指导。

Since the interpolation is simple, you can do it by hand: 由于插值很简单,因此您可以手工完成:

import numpy as np
a = np.array([1,5,11,9,13,21])
b = np.zeros(2*len(a)-1, dtype=np.uint32)
b[0::2] = a
b[1::2] = (a[:-1] + a[1:]) // 2

You can also use scipy.signal.resample this way: 您还可以scipy.signal.resample以下方式使用scipy.signal.resample

import numpy as np
from scipy import signal
a = np.array([1,5,11,9,13,21])
b = signal.resample(a, len(a) * 2)
b_int = b.astype(int)

The trick is to have exactly twice the number of elements, so that odd points match your initial points. 诀窍是使元素数量恰好是原来的两倍,以使奇数点与初始点匹配。 Also I think that the Fourier interpolation done by scipy.signal.resample is better for your ECG signal than the linear interpolation you're asking for. 我还认为,由scipy.signal.resample完成的傅里叶插值比您要求的线性插值更适合您的ECG信号。

Since you suggested a pandas solution, here is one possibility: 由于您建议使用熊猫解决方案,因此有一种可能性:

import pandas as pd
import numpy as np

l = [1,4,11,9,14,21]
n = len(l)

df = pd.DataFrame(l, columns = ["l"]).reindex(np.linspace(0, n-1, 2*n-1)).interpolate().astype(int)

print(df)

It feels unnecessary complicated, though. 但是,这感觉不必要的复杂。 I tag in pandas, so people more familiar with pandas functionality see it. 我标记了熊猫,所以更熟悉熊猫功能的人们会看到它。

Although I probably would just use NumPy here, pretty similar to J. Martinot-Lagarde's answer , you don't actually have to. 尽管我可能只在这里使用NumPy,这与J. Martinot-Lagarde的答案非常相似,但实际上您不必这样做。


First, you can read a single row of comma-separated numbers with just the csv module: 首先,您可以仅使用csv模块读取一行用逗号分隔的数字:

with open(path) as f:
    numbers = map(int, next(csv.reader(f))

… or just string operations: …或只是字符串操作:

with open(path) as f:
    numbers = map(int, next(f).split(','))

And then you can interpolate that easily: 然后,您可以轻松地进行插值:

def interpolate(numbers):
    last = None
    for number in numbers:
        if last is not None:
            yield (last+number)//2
        yield number
        last=number

If you want it to be fully general and reusable, just take a function argument and yield function(last, number) , and replace None with sentinel = object() . 如果您希望它完全通用且可重用,则只需接受一个function参数并yield function(last, number) ,然后用sentinel = object()代替None


And now, all you need to do is join the results and write them: 现在,您需要做的就是join结果并write

with open(outpath, 'w') as f:
    f.write(','.join(map(str, interpolate(numbers))))

Are there any advantages to this solution? 该解决方案有什么优势? Well, other than the read/split and join/write, it's purely lazy. 好吧,除了读/拆分和联接/写之外,它纯粹是惰性的。 And we can write lazy split and join functions pretty easily (or just do it manually). 而且,我们可以很轻松地(或手动完成)编写延迟拆分和联接函数。 So if you ever had to deal with a billion comma-separated numbers instead of a thousand, that's all you'd have to change. 因此,如果您不得不处理十亿个逗号分隔的数字,而不是一千个,那么这就是您所要做的全部更改。

Here's a lazy split : 这是一个懒惰的split

def isplit(s, sep):
    start = 0
    while True:
        nextpos = s.find(sep, start)
        if nextpos == -1:
            yield s[start:]
            return
        yield s[start:nextpos]
        start=nextpos+1

And you can use an mmap as a lazily-read string (well, bytes , but our data are pure ASCII, so that's fine): 您可以将mmap用作延迟读取的字符串(嗯, bytes ,但是我们的数据是纯ASCII的,所以很好):

with open(path, 'rb') as f:
    with mmap.mmap(inf.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        numbers = map(int, isplit(mm, b','))

And let's use a different solution for lazy writing, just for variety: 让我们使用另一种解决方案进行懒惰写作,只是为了多样化:

def icsvwrite(f, seq, sep=','):
    first = next(seq, None)
    if not first: return
    f.write(first)
    for value in seq:
        f.write(sep)
        f.write(value)

So, putting it all together: 因此,将它们放在一起:

with open(inpath, 'rb') as inf, open(outpath, 'w') as outf:
    with mmap.mmap(inf.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        numbers = map(int, isplit(mm, b','))
        icsvwrite(outf, map(str, interpolate(numbers)))

But, even though I was able to slap this together pretty quickly, and all of the pieces are nicely reusable, I'd still probably use NumPy for your specific problem. 但是,即使我能够很快将其拍打在一起,并且所有片段都可以很好地重用,但我仍然可以使用NumPy解决您的特定问题。 You're not going to read a row of a billion numbers. 您不会读到十亿个数字的行。 You already have NumPy installed on the only machine that's ever going to run this script. 您已经在唯一要运行此脚本的计算机上安装了NumPy。 The cost of importing it every 8 seconds (which you can solve by just having the script sleep between runs). 每8秒导入一次的成本(您可以通过让脚本在两次运行之间处于睡眠状态来解决)。 So, it's hard to beat an elegant 3-line solution. 因此,很难击败优雅的三线解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM