简体   繁体   English

读取pandas中的非结构化数据

[英]Read unstructured data in pandas

I have the following unstructured data in a text file, which is message log data from Discord.我在文本文件中有以下非结构化数据,这是来自 Discord 的消息日志数据。

[06-Nov-19 03:36 PM] Dyno#0000

{Embed}
Server
**Message deleted in #reddit-feed**
Author: ? | Message ID: 171111183099756545

[12-Nov-19 01:35 PM] Dyno#0000

{Embed}
Member Left
@Unknown User
ID: 171111183099756545

[16-Nov-19 11:25 PM] Dyno#0000

{Embed}
Member Joined
@User
ID: 171111183099756545

Essentially my goal is to parse the data and extract all the join and leave messages then plot the growth of members in the server.基本上我的目标是解析数据并提取所有加入和离开消息,然后 plot 服务器中成员的增长。 Some of the messages are irrelevant, and each message block has varying length of rows too.有些消息是无关紧要的,每个消息块也有不同长度的行。

Date        Member-change
4/24/2020   2
4/25/2020   -1
4/26/2020   3

I've tried parsing the data in a loop but because the data is unstructured and has varying lengths of rows, I'm confused on how to set it up.我试过在一个循环中解析数据,但因为数据是非结构化的并且行的长度不同,所以我对如何设置它感到困惑。 Is there a way to ignore all blocks without "Member Joined" and "Member Left"?有没有办法忽略所有没有“成员加入”和“成员离开”的块?

It is structured text, just not in the way you are expecting.它是结构化文本,只是与您期望的方式不同。 A file can be structured if the text is written in a consistent format even though normally we think of structured text as field-based.如果文本以一致的格式编写,则文件可以是结构化的,即使通常我们认为结构化文本是基于字段的。

The fields are separated by a date-based header, followed by the {embed} keyword, followed by the command you are interested in.这些字段由基于日期的 header 分隔,后跟{embed}关键字,然后是您感兴趣的命令。

#! /usr/bin/env python
# -*- coding: utf-8 -*-

import re
from itertools import count

# Get rid of the newlines for convenience
message = message_log.replace("\n", " ")

# Use a regular expression to split the log file into records
rx = r"(\[\d{2}-\w{3}-\d{2})"
replaced = re.split(rx, message)

# re.split will leave a blank entry as the first entry
replaced.pop(0)

# Each record will be a separate entry in a list 
# Unfortunately the date component gets put in a different section of the list
# from the record is refers to and needs to be merged back together
merge_list = list()

for x, y in zip(count(step=2), replaced):
    try:
        merge_list.append(replaced[x] + replaced[x+1])
    except:
        continue

# Now a nice clean record list exists, it is possible to get the user count
n = 0
for z in merge_list:
    # Split the record into date and context
    log_date = re.split("(\d{2}-\w{3}-\d{2})", z)
    # Work out whether the count should be incremented or decremented
    if "{Embed} Member Joined" in z:
        n = n + 1
    elif "{Embed} Member Left" in z:
        n = n - 1
    else:
        continue
    # log_date[1] is needed to get the date from the record
    print(log_date[1] + " " + str(n))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM