简体   繁体   中英

Read unstructured data in pandas

I have the following unstructured data in a text file, which is message log data from Discord.

[06-Nov-19 03:36 PM] Dyno#0000

{Embed}
Server
**Message deleted in #reddit-feed**
Author: ? | Message ID: 171111183099756545

[12-Nov-19 01:35 PM] Dyno#0000

{Embed}
Member Left
@Unknown User
ID: 171111183099756545

[16-Nov-19 11:25 PM] Dyno#0000

{Embed}
Member Joined
@User
ID: 171111183099756545

Essentially my goal is to parse the data and extract all the join and leave messages then plot the growth of members in the server. Some of the messages are irrelevant, and each message block has varying length of rows too.

Date        Member-change
4/24/2020   2
4/25/2020   -1
4/26/2020   3

I've tried parsing the data in a loop but because the data is unstructured and has varying lengths of rows, I'm confused on how to set it up. Is there a way to ignore all blocks without "Member Joined" and "Member Left"?

It is structured text, just not in the way you are expecting. A file can be structured if the text is written in a consistent format even though normally we think of structured text as field-based.

The fields are separated by a date-based header, followed by the {embed} keyword, followed by the command you are interested in.

#! /usr/bin/env python
# -*- coding: utf-8 -*-

import re
from itertools import count

# Get rid of the newlines for convenience
message = message_log.replace("\n", " ")

# Use a regular expression to split the log file into records
rx = r"(\[\d{2}-\w{3}-\d{2})"
replaced = re.split(rx, message)

# re.split will leave a blank entry as the first entry
replaced.pop(0)

# Each record will be a separate entry in a list 
# Unfortunately the date component gets put in a different section of the list
# from the record is refers to and needs to be merged back together
merge_list = list()

for x, y in zip(count(step=2), replaced):
    try:
        merge_list.append(replaced[x] + replaced[x+1])
    except:
        continue

# Now a nice clean record list exists, it is possible to get the user count
n = 0
for z in merge_list:
    # Split the record into date and context
    log_date = re.split("(\d{2}-\w{3}-\d{2})", z)
    # Work out whether the count should be incremented or decremented
    if "{Embed} Member Joined" in z:
        n = n + 1
    elif "{Embed} Member Left" in z:
        n = n - 1
    else:
        continue
    # log_date[1] is needed to get the date from the record
    print(log_date[1] + " " + str(n))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM