简体   繁体   English

如何使用 pandas 或 numpy(python)将文本文件中的 integer 值从一列拆分为两列

[英]how to split an integer value from one column to two columns in text file using pandas or numpy (python)

I have a text file which has a number of integer values like this.我有一个文本文件,其中包含许多 integer 值,如下所示。

  20180701  20180707  52  11   1   2   4   1   0   0  10   7   1   3   1   0   4   5   2
  20180708  20180714 266   8  19   3   2   9   7  25  20  17  12   9   9  27  34  54  11
  20180715  20180721 654  52  34  31  20  16  12  25  84  31  38  37  38  69  66  87  14
  20180722  201807281017 110  72  46  52  29  29  22 204  41  46  51  57  67  82  92  17
  20180729  201808041106 276  37  11  87  20  10   8 284  54  54  72  38  49  41  53  12
  20180805  20180811 624  78  19  15  55  16   8   9 172  15  31  35  38  47  29  36  21
  20180812  20180818 488  63  17   7  26  10   9   7 116  17  14  39  31  34  27  64   7
  20180819  20180825  91   4   7   0   4   5   1   3  16   3   4   5  10  10   7  11   1
  20180826  20180901  49   2   2   1   0   4   0   1   2   0   1   4   8   2   6   6  10

I have to make a file by merging several files like this but you guys can see a problem with this data.我必须通过合并几个这样的文件来制作一个文件,但是你们可以看到这个数据有问题。 In 4 and 5 lines, the first values, 1017 and 1106, right next to period index make a problem.在第 4 行和第 5 行中,紧邻周期索引的第一个值 1017 和 1106 造成了问题。

When I try to read these two lines, I always have had this result.当我尝试阅读这两行时,我总是得到这样的结果。 It came out that first values in first column next to index columns couldn't recognized as first values themselves.结果表明,索引列旁边的第一列中的第一个值本身无法识别为第一个值。

In [14]: fw.iloc[80,:]
Out[14]: 
3      72.0
4      46.0
5      52.0
6      29.0
7      29.0
8      22.0
9     204.0
10     41.0
11     46.0
12     51.0
13     57.0
14     67.0
15     82.0
16     92.0
17     17.0
18      NaN
Name: (20180722, 201807281017), dtype: float64

I tried to make it correct with indexing but failed.我试图通过索引使其正确但失败了。 The desirable result is,理想的结果是,

In [14]: fw.iloc[80,:]
Out[14]: 
2    1017.0
3     110.0
4      72.0
5      46.0
6      52.0
7      29.0
8      29.0
9      22.0
10    204.0
11     41.0
12     46.0
13     51.0
14     57.0
15     67.0
16     82.0
17     92.0
18     17.0
Name: (20180722, 201807281017), dtype: float64

How can I solve this problem?我怎么解决这个问题?

+ I used this code to read this file. + 我用这段代码来读取这个文件。

fw = pd.read_csv('warm_patient.txt', index_col=[0,1], header=None, delim_whitespace=True)

A better fit for this would be pandas.read_fwf .更适合的是pandas.read_fwf For your example:对于你的例子:

df = pd.read_fwf(filename, index_col=[0,1], header=None, widths=2*[10]+17*[4])

I don't know if the column widths can be inferred for all your data or need to be hardcoded.我不知道是否可以为您的所有数据推断出列宽或是否需要进行硬编码。

One possibility would be to manually construct the dataframe, this way we can parse the text by splitting the values every 4 characters.一种可能性是手动构造 dataframe,这样我们就可以通过将值每 4 个字符拆分来解析文本。

from textwrap import wrap

import pandas as pd


def read_file(f_name):
    data = []
    with open(f_name) as f:
        for line in f.readlines():
            idx1 = line[0:8]
            idx2 = line[10:18]
            points = map(lambda x: int(x.replace(" ", "")), wrap(line.rstrip()[18:], 4))
            data.append([idx1, idx2, *points])
    return pd.DataFrame(data).set_index([0, 1])

It could be made somewhat more efficient (in particular if this is a particularly long text file), but here's one solution.它可以变得更有效率(特别是如果这是一个特别长的文本文件),但这是一个解决方案。

fw = pd.read_csv('test.txt',  header=None, delim_whitespace=True)
for i in fw[pd.isna(fw.iloc[:,-1])].index:
    num_str = str(fw.iat[i,1])
    a,b = map(int,[num_str[:-4],num_str[-4:]])
    fw.iloc[i,3:] = fw.iloc[i,2:-1]
    fw.iloc[i,:3] = [fw.iat[i,0],a,b]
fw = fw.set_index([0,1])

The result of print(fw) from there is print(fw)的结果是

                     2    3   4   5   6   7   8   9    10  11  12  13  14  15  \
0        1                                                                      
20180701 20180707    52   11   1   2   4   1   0   0   10   7   1   3   1   0   
20180708 20180714   266    8  19   3   2   9   7  25   20  17  12   9   9  27   
20180715 20180721   654   52  34  31  20  16  12  25   84  31  38  37  38  69   
20180722 20180728  1017  110  72  46  52  29  29  22  204  41  46  51  57  67   
20180729 20180804  1106  276  37  11  87  20  10   8  284  54  54  72  38  49   
20180805 20180811   624   78  19  15  55  16   8   9  172  15  31  35  38  47   
20180812 20180818   488   63  17   7  26  10   9   7  116  17  14  39  31  34   
20180819 20180825    91    4   7   0   4   5   1   3   16   3   4   5  10  10   
20180826 20180901    49    2   2   1   0   4   0   1    2   0   1   4   8   2   

                   16  17    18  
0        1                       
20180701 20180707   4   5   2.0  
20180708 20180714  34  54  11.0  
20180715 20180721  66  87  14.0  
20180722 20180728  82  92  17.0  
20180729 20180804  41  53  12.0  
20180805 20180811  29  36  21.0  
20180812 20180818  27  64   7.0  
20180819 20180825   7  11   1.0  
20180826 20180901   6   6  10.0  

Here's the result of the print after applying your initial solution of fw = pd.read_csv('test.txt', index_col=[0,1], header=None, delim_whitespace=True) for comparison.这是应用fw = pd.read_csv('test.txt', index_col=[0,1], header=None, delim_whitespace=True)的初始解决方案进行比较后的打印结果。

                        2   3   4   5   6   7   8    9    10  11  12  13  14  \
0        1                                                                     
20180701 20180707       52  11   1   2   4   1   0    0   10   7   1   3   1   
20180708 20180714      266   8  19   3   2   9   7   25   20  17  12   9   9   
20180715 20180721      654  52  34  31  20  16  12   25   84  31  38  37  38   
20180722 201807281017  110  72  46  52  29  29  22  204   41  46  51  57  67   
20180729 201808041106  276  37  11  87  20  10   8  284   54  54  72  38  49   
20180805 20180811      624  78  19  15  55  16   8    9  172  15  31  35  38   
20180812 20180818      488  63  17   7  26  10   9    7  116  17  14  39  31   
20180819 20180825       91   4   7   0   4   5   1    3   16   3   4   5  10   
20180826 20180901       49   2   2   1   0   4   0    1    2   0   1   4   8   

                       15  16  17    18  
0        1                               
20180701 20180707       0   4   5   2.0  
20180708 20180714      27  34  54  11.0  
20180715 20180721      69  66  87  14.0  
20180722 201807281017  82  92  17   NaN  
20180729 201808041106  41  53  12   NaN  
20180805 20180811      47  29  36  21.0  
20180812 20180818      34  27  64   7.0  
20180819 20180825      10   7  11   1.0  
20180826 20180901       2   6   6  10.0  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM