简体   繁体   English

在python中动态解析研究数据

[英]Dynamically parsing research data in python

The long (winded) version: I'm gathering research data using Python. 漫长的(啰嗦)版本:我正在使用Python收集研究数据。 My initial parsing is ugly (but functional) code which gives me some basic information and turns my raw data into a format suitable for heavy duty statistical analysis using SPSS. 我最初的解析是丑陋(但功能性)的代码,它给了我一些基本信息,并将我的原始数据转换成适合使用SPSS进行重型统计分析的格式。 However, every time I modify the experiment, I have to dive into the analysis code. 但是,每次修改实验时,我都要深入研究分析代码。

For a typical experiment, I'll have 30 files, each for a unique user. 对于典型的实验,我将有30个文件,每个文件用于一个唯一的用户。 Field count is fixed for each experiment (but can vary from one to another 10-20). 每个实验的场数是固定的(但可以从一个到另一个10-20不等)。 Files are typically 700-1000 records long with a header row. 文件通常有700-1000条记录长,标题行。 Record format is tab separated (see sample which is 4 integers, 3 strings, and 10 floats). 记录格式是制表符分隔的(参见样本,即4个整数,3个字符串和10个浮点数)。

I need to sort my list into categories. 我需要将列表分类。 In a 1000 line file, I could have 4-256 categories. 在1000行文件中,我可以有4-256个类别。 Rather than trying to pre-determine how many categories each file has, I'm using the code below to count them. 我没有尝试预先确定每个文件有多少类别,而是使用下面的代码来计算它们。 The integers at the beginning of each line dictate what category the float values in the row correspond to. 每行开头的整数决定了行中浮点值对应的类别。 Integer combinations can be modified by the string values to produce wildly different results, and multiple combinations can sometimes be lumped together. 可以通过字符串值修改整数组合以产生截然不同的结果,并且有时可以将多个组合集中在一起。

Once they're in categories, number crunching begins. 一旦他们在类别中,数字运算开始。 I get statistical info (mean, sd, etc. for each category for each file). 我得到统计信息(每个文件的每个类别的均值,sd等)。

The essentials: I need to parse data like the sample below into categories. 要点:我需要将类似下面示例的数据解析为类别。 Categories are combos of the non-floats in each record. 类别是每个记录中非浮点数的组合。 I'm also trying to come up with a dynamic (graphical) way to associate column combinations with categories. 我还试图想出一种将列组合与类别相关联的动态(图形)方法。 Will make a new post fot this. 将发布一个新帖子。

I'm looking for suggestions on how to do both. 我正在寻找关于如何做到这两点的建议。

    # data is a list of tab separated records
    # fields is a list of my field names

    # get a list of fieldtypes via gettype on our first row
    # gettype is a function to get type from string without changing data
    fieldtype = [gettype(n) for n in data[1].split('\t')]

    # get the indexes for fields that aren't floats
    mask =  [i for i, field in enumerate(fieldtype) if field!="float"]

    # for each row of data[skipping first and last empty lists] we split(on tabs)
    # and take the ith element of that split where i is taken from the list mask
    # which tells us which fields are not floats
    records = [[row.split('\t')[i] for i in mask] for row in data[1:-1]]

    # we now get a unique set of combos
    # since set doesn't happily take a list of lists, we join each row of values
    # together in a comma seperated string. So we end up with a list of strings.
    uniquerecs = set([",".join(row) for row in records])


    print len(uniquerecs)
    quit()

def gettype(s):
    try:
        int(s)
        return "int"
    except ValueError:
        pass
    try:
        float(s)
        return "float"
    except ValueError:
        return "string"

Sample Data: 样本数据:

field0  field1  field2  field3  field4  field5  field6  field7  field8  field9  field10 field11 field12 field13 field14 field15
10  0   2   1   Right   Right   Right   5.76765674196   0.0310912272139 0.0573603238282 0.0582901376612 0.0648936500524 0.0655294305058 0.0720571099855 0.0748289246137 0.446033755751
3   1   3   0   Left    Left    Right   8.00982745764   0.0313840132052 0.0576521406854 0.0585844966069 0.0644905497442 0.0653386429438 0.0712603578765 0.0740345755708 0.2641076191
5   19  1   0   Right   Left    Left    4.69440026591   0.0313852052224 0.0583165354345 0.0592403274967 0.0659404609478 0.0666070804916 0.0715314027001 0.0743022054775 0.465994962101
3   1   4   2   Left    Right   Left    9.58648184552   0.0303649003017 0.0571579895338 0.0580911765412 0.0634304670863 0.0640132919609 0.0702920967445 0.0730697946335 0.556525293
9   0   0   7   Left    Left    Left    7.65374257547   0.030318719717  0.0568551744109 0.0577785415066 0.0640577002605 0.0647226582655 0.0711459854908 0.0739256050784 1.23421547397

Not sure if I understand your question, but here are a few thoughts: 不确定我是否理解你的问题,但这里有一些想法:

For parsing the data files, you usually use the Python csv module . 要解析数据文件,通常使用Python csv模块

For categorizing the data you could use a defaultdict with the non-float fields joined as a key for the dict. 对于数据分类,您可以使用defaultdict ,并将非浮点字段作为dict的键加入。 Example: 例:

from collections import defaultdict
import csv

reader = csv.reader(open('data.file', 'rb'), delimiter='\t')
data_of_category = defaultdict(list)
lines = [line for line in reader]
mask =  [i for i, n in enumerate(lines[1]) if gettype(n)!="float"]
for line in lines[1:]:
    category = ','.join([line[i] for i in mask])
    data_of_category[category].append(line)

This way you don't have to calculate the categories in the first place an can process the data in one pass. 这样您就不必首先计算类别,并且可以一次处理数据。

And I didn't understand the part about "a dynamic (graphical) way to associate column combinations with categories". 而且我不理解关于“将列组合与类别相关联的动态(图形)方式”的部分。

For at least part of your question, have a look at Named Tuples 至少在部分问题中,请查看命名元组

Step 1 : Use something like csv.DictReader to turn the text file into an iterable of rows. 步骤1 :使用类似csv.DictReader的内容将文本文件转换为可迭代的行。

Step 2 : Turn that into a dict of first entry: rest of entries. 第2步 :将其转换为第一个条目的词典:其余条目。

with open("...", "rb") as data_file:
    lines = csv.Reader(data_file, some_custom_dialect)
    categories = {line[0]: line[1:] for line in lines}

Step 3 : Iterate over the items() of the data and do something with each line. 第3步 :迭代数据的items()并对每一行做一些事情。

for category, line in categories.items():
    do_stats_to_line(line)

Some useful answers already but I'll throw mine in as well. 已经有一些有用的答案,但我也会把它扔进去。 Key points: 关键点:

  1. Use the csv module 使用csv模块
  2. Use collections.namedtuple for each row 对每行使用collections.namedtuple
  3. Group the rows using a tuple of int field values as the key 使用int字段值元组作为键对行进行分组

If your source rows are sorted by the keys (the integer column values), you could use itertools.groupby . 如果源行按键(整数列值)排序,则可以使用itertools.groupby This would likely reduce memory consumption. 这可能会减少内存消耗。 Given your example data, and the fact that your files contain >= 1000 rows, this is probably not an issue to worry about. 鉴于您的示例数据以及您的文件包含> = 1000行的事实,这可能不是一个需要担心的问题。

def coerce_to_type(value):
    _types = (int, float)
    for _type in _types:
        try:
            return _type(value)
        except ValueError:
            continue
    return value

def parse_row(row):
    return [coerce_to_type(field) for field in row]

with open(datafile) as srcfile:
    data    = csv.reader(srcfile, delimiter='\t')

    ## Read headers, create namedtuple
    headers = srcfile.next().strip().split('\t')
    datarow = namedtuple('datarow', headers)

    ## Wrap with parser and namedtuple
    data = (parse_row(row) for row in data)
    data = (datarow(*row) for row in data)

    ## Group by the leading integer columns
    grouped_rows = defaultdict(list)
    for row in data:
        integer_fields = [field for field in row if isinstance(field, int)]
        grouped_rows[tuple(integer_fields)].append(row)

    ## DO SOMETHING INTERESTING WITH THE GROUPS
    import pprint
    pprint.pprint(dict(grouped_rows))

EDIT You may find the code at https://gist.github.com/985882 useful. 编辑您可以在https://gist.github.com/985882上找到有用的代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM