简体   繁体   English

在python中快速访问/查询大分隔文本文件

[英]Quickly accessing/querying large delimited text file in python

After searching for a while, I have found many related questions/answers to this problem but nothing that really addresses what I am looking for. 在搜索了一段时间后,我发现了许多与此问题相关的问题/答案,但没有真正解决我正在寻找的问题。 Basically, I am implementing code in python to be able to query information from a star catalogue (in particular the tycho 2 star catalogue). 基本上,我在python中实现代码,以便能够从星型目录(特别是tycho 2星级目录)中查询信息。

This data is stored in a largish (~0.5 gigabyte) text file where each row corresponds to an entry for a star. 该数据存储在较大的(~0.5千兆字节)文本文件中,其中每行对应于星形的条目。

A few example rows are 一些示例行是

0001 00008 1| |  2.31750494|  2.23184345|  -16.3|   -9.0| 68| 73| 1.7| 1.8|1958.89|1951.94| 4|1.0|1.0|0.9|1.0|12.146|0.158|12.146|0.223|999| |         |  2.31754222|  2.23186444|1.67|1.54| 88.0|100.8| |-0.2
0001 00013 1| |  1.12558209|  2.26739400|   27.7|   -0.5|  9| 12| 1.2| 1.2|1990.76|1989.25| 8|1.0|0.8|1.0|0.7|10.488|0.038| 8.670|0.015|999|T|         |  1.12551889|  2.26739556|1.81|1.52|  9.3| 12.7| |-0.2
0001 00016 1| |  1.05686490|  1.89782870|  -25.9|  -44.4| 85| 99| 2.1| 2.4|1959.29|1945.16| 3|0.4|0.5|0.4|0.5|12.921|0.335|12.100|0.243|999| |         |  1.05692417|  1.89793306|1.81|1.54|108.5|150.2| |-0.1
0001 00017 1|P|  0.05059802|  1.77144349|   32.1|  -20.7| 21| 31| 1.6| 1.6|1989.29|1985.38| 5|1.4|0.6|1.4|0.6|11.318|0.070|10.521|0.051| 18|T|         |  0.05086583|  1.77151389|1.78|1.55| 30.0| 45.6|D|-0.2

The information is both delimited and fixed width. 信息是分隔的和固定的宽度。 Each column contains a different piece of information about the star. 每列包含有关恒星的不同信息。 Now, for my python utility I would like to be able to quickly search through this information and retrieve the entries for stars that match a set of criteria specified by the user. 现在,对于我的python实用程序,我希望能够快速搜索此信息并检索符合用户指定的一组条件的星星条目。

For instance, I would like to be able to find all stars with magnitude brighter than 5.5 (col 18 or 19) that have a right ascension between 0 and 30 degrees (col 3) and a declination between -45 and -35 degrees (col 4) efficiently. 例如,我希望能够找到亮度大于5.5(第18或第19列)的所有恒星,它们在0到30度(第3列)之间有一个右上升,在-45到-35度之间有一个偏角(col 4)有效率。 Now, if I could store all this information in memory it would be easy to read the file into a numpy structured array or pandas dataframe and retrieve the stars I want using logical indexing. 现在,如果我可以将所有这些信息存储在内存中,那么很容易将文件读入一个numpy结构化数组或pandas数据帧,并使用逻辑索引检索我想要的星星。 Unfortunately, the machine I am working on does not have enough memory to do this (I only ever have about 0.5 gigbytes of memory free at any given time and the rest of the program I am using takes up a good chunk of memory). 不幸的是,我正在处理的机器没有足够的内存来执行此操作(在任何给定时间我只有大约0.5千兆字节的内存空闲,而我正在使用的其余程序占用了大量内存)。

My current solution involves walking through each line of the text file, interpreting the data, and storing the entry in memory only if it matches the criteria that was specified. 我当前的解决方案涉及遍历文本文件的每一行,解释数据,并且仅当条目符合指定的条件时才将条目存储在内存中。 The method I have to do this is 我必须这样做的方法是

def getallwithcriteria(self, min_vmag=1., max_vmag=17., min_bmag=1., max_bmag=17., min_ra=0., max_ra=360.,
                       min_dc=-90., max_dc=90., min_prox=3, search_center=None, search_radius=None):
    """
    This method returns entire star records for each star that meets the specified criterion.  The defaults for each
    criteria specify the entire range of the catalogue.  Do not call this without changing the defaults as this will
    likely overflow memory and cause your system to drastically slow down or crash!

    Note that all of the keyword argument do not need to be specified.  For instance, we could run

        import tychopy as tp

        tyc = tp.Tycho('/path/to/catalogue')

        star_records = tyc.getallwithcritera(min_vmag=3,max_vmag=4)

    to return all stars that have a visual magnitude between 3 and 4.

    This method returns a numpy structured array where each element contains the complete record for a star that
    matches the criterion specified by the user.  The output array has the following dtype:

            [('starid', 'U12'),
             ('pflag', 'U1'),
             ('starBearing', [('rightAscension', float), ('declination', float)]),
             ('properMotion', [('rightAscension', float), ('declination', float)]),
             ('uncertainty', [('rightAscension', int), ('declination', int), ('pmRA', float), ('pmDc', float)]),
             ('meanEpoch', [('rightAscension', float), ('declination', float)]),
             ('numPos', int),
             ('fitGoodness', [('rightAscension', float), ('declination', float), ('pmRA', float), ('pmDc', float)]),
             ('magnitude', [('BT', [('mag', float), ('err', float)]), ('VT', [('mag', float), ('err', float)])]),
             ('starProximity', int),
             ('tycho1flag', 'U1'),
             ('hipparcosNumber', 'U9'),
             ('observedPos', [('rightAscension', float), ('declination', float)]),
             ('observedEpoch', [('rightAscension', float), ('declination', float)]),
             ('observedError', [('rightAscension', float), ('declination', float)]),
             ('solutionType', 'U1'),
             ('correlation', float)]

    see the readme of the Tycho 2 catalogue for a more formal description of each field.

    If no stars are found that match the specified input then an empty numpy array with the above dtype is returned.

    Note that both a rectangular and a circular area can be specified.  The rectangular search area is specified
    using the min_ra/dc max_ra/dc keyword arguments while the circular search area is specified using the
    search_center and search_radius keyword arguments where the search_center is a tuple, list, numpy array, or
    other array like object which contains the center right ascension in element 0 and the center declination in
    element 1.  It is not recommended to specify both the circular and rectangular search areas.  If the search
    areas do not overlap then no stars will be returned.

    :param min_vmag:  the minimum (brightest) visual magnitude to return
    :param max_vmag:  the maximum (dimmest) visual magnitude to return
    :param min_bmag:  the minimum (brightest) blue magnitude to return
    :param max_bmag:  the maximum (dimmest) blue magnitude to return
    :param min_ra:  the minimum right ascension to return
    :param max_ra:  the maximum right ascension to return
    :param min_dc:  the minimum declination to return
    :param max_dc:  the maximum declination to return
    :param min_prox:  the closest proximity to a star to return
    :param search_center: An array like object containing the center point from which to search radially for stars.
    :param search_radius: A float specifying the radial search distance to use
    :return: A numpy structure array containing the star records for stars that meet the specified criteria
    """

    # form the dtype list that genfromtxt will use to interpret the star records
    dform = [('starid', 'U12'),
             ('pflag', 'U1'),
             ('starBearing', [('rightAscension', float), ('declination', float)]),
             ('properMotion', [('rightAscension', float), ('declination', float)]),
             ('uncertainty', [('rightAscension', int), ('declination', int), ('pmRA', float), ('pmDc', float)]),
             ('meanEpoch', [('rightAscension', float), ('declination', float)]),
             ('numPos', int),
             ('fitGoodness', [('rightAscension', float), ('declination', float), ('pmRA', float), ('pmDc', float)]),
             ('magnitude', [('BT', [('mag', float), ('err', float)]), ('VT', [('mag', float), ('err', float)])]),
             ('starProximity', int),
             ('tycho1flag', 'U1'),
             ('hipparcosNumber', 'U9'),
             ('observedPos', [('rightAscension', float), ('declination', float)]),
             ('observedEpoch', [('rightAscension', float), ('declination', float)]),
             ('observedError', [('rightAscension', float), ('declination', float)]),
             ('solutionType', 'U1'),
             ('correlation', float)]

    # initialize a list which will contain the star record strings for stars that match the input criteria
    records = []

    # loop through each record in the Tycho2 catlogue
    for record in self._catalogueFile:

        # interpret the record as simply as we can
        split_record = record.split(sep="|")

        # check that we are examining a good star, that it falls within the bearing bounds, and that it is far
        # enough away from other stars
        if ("X" not in split_record[1]) and min_ra <= float(split_record[2]) <= max_ra \
                and min_dc <= float(split_record[3]) <= max_dc and int(split_record[21]) >= min_prox:

            # perform the radial search if the user has specified a center and radius
            if search_center is None or pow(pow(float(split_record[2])-search_center[0], 2) +
                                            pow(float(split_record[3])-search_center[1], 2), 1/2.) < search_radius:

                # Check to see if we have values for both blue and visual magnitudes, and check to see if these
                # magnitudes fall within the specified magnitude bounds
                # We need to split this up like this in order to make sure that either the bmag or the vmag exist
                if bool(split_record[17].strip()) and bool(split_record[19].strip()) \
                        and min_bmag <= float(split_record[17]) <= max_bmag \
                        and min_vmag <= float(split_record[19]) <= max_vmag:

                    records.append(record+'\n')

                # if only the visual magnitude exists then check its bounds - also check if the user has specified
                # its bounds
                elif not bool(split_record[17].strip()) and bool(split_record[19].strip()) \
                        and min_vmag <= float(split_record[19]) <= max_vmag and (max_vmag != 17. or min_vmag != 1.):

                    records.append(record+'\n')

                # if only the blue magnitude exists the check its bounds - also check if the user has specified its
                # bounds
                elif not bool(split_record[19].strip()) and bool(split_record[17].strip()) \
                        and min_bmag <= float(split_record[17]) <= max_bmag and (max_bmag != 17. or min_bmag != 1.):

                    records.append(record+'\n')

                # otherwise check to see if the use has changed the defaults.  If they haven't then store the star
                elif max_bmag == 17. and max_vmag == 17. and min_bmag == 1. and min_vmag == 1.:

                    records.append(record+'\n')

    # check to see if any stars met the criteria.  If they didn't then return an empty array.  If they did then use
    # genfromtxt to interpret the string of star records
    if not bool(records):
        nprecords = np.empty((1,), dtype=dform)

        warnings.warn('No stars were found meeting your criteria.  Please try again.')
    else:
        nprecords = np.genfromtxt(BytesIO("".join(records).encode()), dtype=dform, delimiter='|', converters={
            0: lambda s: s.strip(),
            1: lambda s: s.strip(),
            22: lambda s: s.strip(),
            23: lambda s: s.strip(),
            30: lambda s: s.strip()})

        if self._includeProperMotion:
            applypropermotion(nprecords, self.newEpoch, copy=False)

    # reset the catalogue back to the beginning for future searches
    self._catalogueFile.seek(0, os.SEEK_SET)

    return nprecords

This is still very slow (although faster than using up all of the memory and pushing everything else into swap). 这仍然非常慢(虽然比耗尽所有内存并将其他内容推送到交换中更快)。 For comparison, this takes about 2-3 minutes each time I need to retrieve stars and I need to retrieve stars from this 40 or so times (with different criteria each time) in the program I am writing this for. 为了比较,每次我需要检索星星时需要大约2-3分钟,我需要在我写这个程序的过程中从这40次左右(每次都有不同的标准)检索星星。 The rest of the program takes about 5 seconds total. 该计划的其余部分总共需要5秒钟。

My question is now, what is the best way to speed up this process (outside of getting a better computer with more memory). 我现在的问题是,加速这个过程的最佳方法是什么(除了获得更好内存的更好的计算机之外)。 I am open to any suggestions as long as they are explained well and won't take me months to implement. 我愿意接受任何建议,只要它们得到很好的解释,并且不会花费我数月的时间来实施。 I am even willing to write a function that goes through and modifies the original catalogue file into a better format (fixed width binary file sorted by a specific column) in order to speed things up. 我甚至愿意编写一个功能,将原始目录文件修改为更好的格式(按特定列排序的固定宽度二进制文件),以加快速度。

So far I have considered memmap'ing the file but decided against it because I didn't really think it would help with what I need to do. 到目前为止,我已经考虑过memmap'ing文件,但决定反对它,因为我真的认为它不会对我需要做的事情有所帮助。 I have also considered creating a database from the data and then using sqlalchemy or something similar to query the data that way; 我还考虑过从数据创建数据库,然后使用sqlalchemy或类似的东西来查询数据; however, I am not super familiar with databases and don't know if that would offer any real speed improvement. 但是,我对数据库并不是很熟悉,也不知道这是否会提供真正的速度提升。

As @wflynny has already mentioned PyTables (HDF5 store) - is much more efficient compared to text/CSV/etc. 由于@wflynny已经提到PyTables(HDF5商店) - 与text / CSV /等相比效率更高。 files. 文件。 Beside that you can read from PyTables conditionally using .read_hdf(where='<where condition>') . 除此之外,你可以使用.read_hdf(where='<where condition>')从条件读取PyTables。

You may want to check this comparison . 您可能想要检查此比较 If your machine is UNIX or Linux you may want to check Feather-Format , which should be extremely fast. 如果您的计算机是UNIX或Linux,您可能需要检查羽毛格式 ,这应该非常快。

Beside that i would check whether using some RDBMS (MySQL/PostgreSQL/SQLite) plus proper indexes - would speed it up. 除此之外,我会检查是否使用一些RDBMS(MySQL / PostgreSQL / SQLite)加上适当的索引 - 会加快速度。 But it might be problematic if you have only 0.5 GB RAM free and want to use both Pandas and RDBMS 但是如果你只有0.5 GB的RAM可用并希望同时使用Pandas和RDBMS,则可能会出现问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM