简体   繁体   English

无法对txt文件中的数据进行分组

[英]Having trouble with grouping data from a txt file

I'm a beginner coder and the project I have requires me to categorize a text file. 我是一个初学者,我需要一个项目来对文本文件进行分类。 The txt file I'm opening is something like this: (this isn't fully how the txt file looks like it's just that when I copied and past it, it looked too messy. there was just another column that was just filled with the word 'map' for some reason) 我正在打开的txt文件是这样的:(这并不完全是txt文件的样子,只是当我复制并粘贴它时,它看起来太乱了。只有另一列充满了单词“地图”由于某种原因)

 MAG UTC DATE-TIME LAT LON DEPTH Region 4.3 2014/03/12 20:16:59 25.423 -109.730 10.0 GULF OF CALIFORNIA 5.2 2014/03/12 20:09:55 36.747 144.050 24.2 JAPAN 5.0 2014/03/12 20:08:25 35.775 141.893 24.5 JAPAN 4.8 2014/03/12 19:59:01 38.101 142.840 17.6 Japan 4.6 2014/03/12 19:55:28 37.400 142.384 24.7 JAPAN 5.0 2014/03/12 19:45:19 -6.187 154.385 62.0 GUINEA 

I want the output to be something like this: 我希望输出是这样的:

[[japan,'4.3','5.2','5.0','4.8','4.6'],[Gulf of California,4.3],[Guinea,5.0]] [[日本,'4.3','5.2','5.0','4.8','4.6'],[加利福尼亚湾,4.3],[几内亚,5.0]]

my current coding: (The vlist[7:] in the first for loop gives me the region name and the j[1] in the second for loop gives me the magtitude number.) 我当前的编码:(第一个for循环中的vlist [7:]为我提供了区域名称,第二个for循环中的j [1]为我提供了纬度数。)

def myOpen(filepointer):
    header = filepointer.readline()
    regions = []#gathers up all the names of the regions without repeating them
    maglist = []#matchs with naems and numbers
    filelines = []#list of lines in txt file


    for aline in filepointer:#reades each line
       vlist = aline.split()#turns lines into lists
       filelines.append(vlist)
       if not vlist[7:] in regions:#makes list of names without repeat
            regions.append(vlist[7:])
            regions.sort()

   for j in filelines:#gets each file line
        for names in regions:#each name
           if names == j[7:]:
               num = j[1]
               names.append(float(num))
               mags.append(names)
   return maglist
def main():
    myFile = open('earthquakes.txt','r')
    quakes = myOpen(myFile)
    myFile.close()
    print(quakes)

main()

gives an output of this: 给出了这样的输出:

[[japan,'4.3'],[Gulf of California,4.3],[Guinea,5.0]] [[日本,'4.3'],[加利福尼亚湾,4.3],[几内亚,5.0]]

I'm wondering why it only gets the first magnitude number that appears for the other regions and not the rest. 我想知道为什么它只获得出现在其他地区而不是其他地区的第一个震级数。

Here you go: using itertools.groupby , lambda , map , str.split , str.lower and str.join 在这里,您可以使用itertools.groupbylambdamapstr.splitstr.lowerstr.join

if your file look like this: 如果您的文件如下所示:

MAG     UTC DATE-TIME             LAT         LON      DEPTH    Region
4.3    2014/03/12 20:16:59       25.423     -109.730   10.0     GULF OF CALIFORNIA
5.2    2014/03/12 20:09:55       36.747      144.050   24.2     JAPAN
5.0    2014/03/12 20:08:25       35.775      141.893   24.5     JAPAN
4.8    2014/03/12 19:59:01       38.101      142.840   17.6     Japan
4.6    2014/03/12 19:55:28       37.400      142.384   24.7     JAPAN
5.0    2014/03/12 19:45:19       -6.187      154.385   62.0     GUINEA

here is the working code: 这是工作代码:

>>> import itertools
>>> f = open('file.txt')
>>> [[" ".join(x),list(map(lambda z:z[0],list(y)))] for x,y in itertools.groupby(sorted(list(map(str.split,map(str.lower,list(f)[1:]))),key=lambda x:" ".join(x[6:])),key=lambda x:x[6:])]
[['guinea', ['5.0']], ['gulf of california', ['4.3']], ['japan', ['5.2', '5.0', '4.8', '4.6']]]

Let me explain you: 让我解释一下:

>>> f = open('file.txt')
>>> k = list(map(str.lower,list(f)[1:]))  # convert all lines to lower case and leave 1st line
>>> k
['4.3    2014/03/12 20:16:59       25.423     -109.730   10.0     gulf of california\n', '5.2    2014/03/12 20:09:55       36.747      144.050   24.2     japan\n', '5.0    2014/03/12 20:08:25       35.775      141.893   24.5     japan\n', '4.8    2014/03/12 19:59:01       38.101      142.840   17.6     japan\n', '4.6    2014/03/12 19:55:28       37.400      142.384   24.7     japan\n', '5.0    2014/03/12 19:45:19       -6.187      154.385   62.0     guinea\n']
>>> k = list(map(str.split,k))   # it will split the lines on whitespaces
>>> k
[['4.3', '2014/03/12', '20:16:59', '25.423', '-109.730', '10.0', 'gulf', 'of', 'california'], ['5.2', '2014/03/12', '20:09:55', '36.747', '144.050', '24.2', 'japan'], ['5.0', '2014/03/12', '20:08:25', '35.775', '141.893', '24.5', 'japan'], ['4.8', '2014/03/12', '19:59:01', '38.101', '142.840', '17.6', 'japan'], ['4.6', '2014/03/12', '19:55:28', '37.400', '142.384', '24.7', 'japan'], ['5.0', '2014/03/12', '19:45:19', '-6.187', '154.385', '62.0', 'guinea']] 
>>> k = sorted(k,key = lambda x:" ".join(x[6:]))  # it will sort the k on Region
>>> k
[['5.0', '2014/03/12', '19:45:19', '-6.187', '154.385', '62.0', 'guinea'], ['4.3', '2014/03/12', '20:16:59', '25.423', '-109.730', '10.0', 'gulf', 'of', 'california'], ['5.2', '2014/03/12', '20:09:55', '36.747', '144.050', '24.2', 'japan'], ['5.0', '2014/03/12', '20:08:25', '35.775', '141.893', '24.5', 'japan'], ['4.8', '2014/03/12', '19:59:01', '38.101', '142.840', '17.6', 'japan'], ['4.6', '2014/03/12', '19:55:28', '37.400', '142.384', '24.7', 'japan']]
>>> [[" ".join(x),list(map(lambda z:z[0],list(y)))] for x,y in itertools.groupby(k,key = lambda x:x[6:])]
[['guinea', ['5.0']], ['gulf of california', ['4.3']], ['japan', ['5.2', '5.0', '4.8', '4.6']]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM