简体   繁体   English

用python中的非均匀行解析数据

[英]Parsing Data with non-uniform rows in python

I have a dataset that I would like to parse in order to analyze it. 我有一个要分析以分析它的数据集。 I want to pull out specific columns, and then separate them before and after a non-uniform row. 我想拉出特定的列,然后在非均匀行之前和之后将它们分开。 Here is an example of what my data looks like: Note the three rows in the middle that do not match the format of the other rows: 这是我的数据的示例:注意中间的三行与其他行的格式不匹配:

1386865618963   1   M   subject_avatar  3.636229    1.000000    5.422941    30.200327   0.000000    0.000000
1386865618965   1   M   subject_avatar  3.631835    1.000000    5.415390    30.200327   0.000000    0.000000
1386865618966   2   M   subject_avatar  3.627432    1.000000    5.407826    30.200327   0.000000    0.000000
1386865618968   1   M   subject_avatar  3.625223    1.000000    5.404030    30.200327   0.000000    0.000000
1386865618970   1   M   subject_avatar  3.620788    1.000000    5.396411    30.200327   0.000000    0.000000
1386865618970   0   D   4345048336
1386865618970   0   D   4345763672
1386865618971   0   I   BOXGEOM (45.0, 0.0, -45.0, 19.0, 3.5, 19.0) {'callback': <bound method YCEnvironment.dropoff of <navigate.YCEnvironment instance at 0x103065440>>, 'cbargs': (0, {'width': 1.75, 'image': <pyepl.display.Image object at 0x102f9da90>, 'height': 4.75, 'volbitSize': (0.5, 0.71999999999999997), 'name': 'Julia'}, {'width': 0.69999999999999996, 'name': 'Flower Patch', 'realpos': (45.0, 0.0, -45.0), 'image': <pyepl.display.Image object at 0x102fc3f50>, 'realsize': (7.0, 3.5, 7.0), 'type': 'store', 'volbitSize': (0.5, 0.5), 'height': 0.34999999999999998}), 'permiable': True}  4926595152
1386865618972   1   M   subject_avatar  3.621182    1.000000    5.396492    30.200327   0.000000    0.000000
1386865618992   2   M   subject_avatar  3.621182    1.000000    5.396492    30.200327   0.000000    0.000000
1386865618996   1   M   subject_avatar  3.621182    1.000000    5.396492    30.200327   0.000000    0.000000
1386865618998   2   M   subject_avatar  3.621182    1.000000    5.396492    30.200327   0.000000    0.000000
1386865619002   1   M   subject_avatar  3.621182    1.000000    5.396492    30.200327   0.000000    0.000000
1386865619005   1   M   subject_avatar  3.621182    1.000000    5.396492    30.200327   0.000000    0.000000
1386865619008   1   M   subject_avatar  3.621182    1.000000    5.396492    30.200327   0.000000    0.000000

I previously asked a question ( Parsing specific columns from a dataset in python ) to parse this data into columns, However, the columns only display the number of items in the column and not the items themselves. 我之前曾问过一个问题( 从python中的数据集中解析特定的列 )以将这些数据解析为列,但是,列仅显示列中的项目数,而不显示项目本身。

I realize these are two different questions (separating into columns, separating before and after the non-uniform row), but any help with the parsing would be appreciated! 我意识到这是两个不同的问题(分为几列,在非均匀行之前和之后分开),但是对解析的任何帮助将不胜感激!

A straight forward idea: 直截了当的想法:

You can preprocess the raw file to skip all irrelevant lines, maybe: 您可以预处理原始文件以跳过所有不相关的行,也许是:

with open('raw.txt', 'r') as infile:
    f = infile.readlines()
    with open('filtered.txt', 'w') as outfile:
        for line in f:
            if 'subject_avatar' in line: # or other better rules
                outfile.write(line)

Then you process the filtered.txt the clean data using pandas or else. 然后,使用pandas或其他方式处理filtered.txt的干净数据。


with open('d.txt', 'r') as infile:
    f = infile.readlines()
    with open('filtered_part1.txt', 'w') as outfile:
        for i in range(len(f)):
            line = f[i]
            if line[16] == '0':
                i += 1
                break
            outfile.write(line)
    while f[i][16] == '0': # skip a few lines
        i += 1
    with open('filtered_part2.txt', 'w') as outfile:
        while i < len(f):
            outfile.write(f[i])
            i += 1

Ugly yet workable separation provided here. 这里提供了丑陋但可行的分隔。 Basically to find the 0's and skip the lines. 基本上找到0并跳过行。

If you would like to omit the non-uniform lines, you can simply check the length of each row: 如果您想省略不均匀的行,则只需检查每行的长度即可:

rows = []
for line in lines:
    row = line.split()
    if len(row) == 10:
        rows.append(row)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM