简体   繁体   English

如何使用 Python 将时间序列转换为显示时间序列每个元素计数的两列数据框

[英]How to transform a time series into a two-column dataframe showing the count for each element of the time series, using Python

I have data in a file that takes the form of a list of array: each line correspond to an array of integers, with the first element of each array (it is a time series) corresponding to an index.我的文件中的数据采用数组列表的形式:每一行对应一个整数数组,每个数组的第一个元素(它是一个时间序列)对应一个索引。 Here is an example :这是一个例子:

1 101 103 238 156 48 78
2 238 420 156 103 26
3 220 103 154 48 101 238 156 26 420
4 26 54 43 103 156 238 48

there isn't the same number of element in each line and some elements are present in more than one line, but others are not.每行中的元素数量不同,有些元素出现在不止一行中,但其他元素则不然。

I would like, using python, to transform the data so that I have 2 columns: the first corresponds to the list of all the integers appearing in the original dataset and the other is the count of the number of occurences.我想使用 python 转换数据,以便我有 2 列:第一个对应于原始数据集中出现的所有整数的列表,另一个是出现次数的计数。 ie in the example given:即在给出的示例中:

26 3
43 1
48 3
54 1
78 1
101 2
103 4
154 1
156 4
220 1
238 4
420 2

Could anyone please let me know how I could do that?谁能让我知道我怎么能做到这一点? Is there a straightfoward way to do this using Pandas or Numpy for example?例如,有没有一种直接的方法可以使用 Pandas 或 Numpy 来做到这一点? Many thanks in advance!提前谢谢了!

import pandas as pd
array1 =  [1, 101, 103, 238, 156, 48, 78]
array2 = [2, 238, 420, 156, 103, 26]
array3 = [3, 220, 103, 154, 48, 101, 238, 156, 26, 420]
array4 = [4, 26, 54, 43, 103, 156, 238, 48]
pd.Series(list(array1 + array2 + array3 + array4)).value_counts()

What you are asking is how to create a ferquenzy distribution from multiple arrays.您要问的是如何从多个数组创建 ferquenzy 分布。 There are many solutions to this problem.这个问题有很多解决方案。 You can solve it using numpy.您可以使用 numpy 解决它。 Lets say you have the following multidimensional array假设您有以下多维数组

time_series = numpy.array([[0,1,2],[3,4],[5,6,7,8]])

Then you can concatenate the multi-dimensional list into a one-dimensional array, and use numpy.unique to find the frequency distribution.然后就可以将多维列表拼接成一维数组,使用numpy.unique求频数分布。 numpy.unique returns two arrays, unique and counts , which is concatenated using vstack. numpy.unique返回两个数组, uniquecounts ,它们使用 vstack 连接。

temp=numpy.concatenate(time_series).ravel().tolist()
distribution = pandas.DataFrame(data=numpy.vstack([numpy.unique(temp, return_counts=True)]).transpose())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM