[英]Read unstructured data in pandas
我在文本文件中有以下非结构化数据,这是来自 Discord 的消息日志数据。
[06-Nov-19 03:36 PM] Dyno#0000
{Embed}
Server
**Message deleted in #reddit-feed**
Author: ? | Message ID: 171111183099756545
[12-Nov-19 01:35 PM] Dyno#0000
{Embed}
Member Left
@Unknown User
ID: 171111183099756545
[16-Nov-19 11:25 PM] Dyno#0000
{Embed}
Member Joined
@User
ID: 171111183099756545
基本上我的目标是解析数据并提取所有加入和离开消息,然后 plot 服务器中成员的增长。 有些消息是无关紧要的,每个消息块也有不同长度的行。
Date Member-change
4/24/2020 2
4/25/2020 -1
4/26/2020 3
我试过在一个循环中解析数据,但因为数据是非结构化的并且行的长度不同,所以我对如何设置它感到困惑。 有没有办法忽略所有没有“成员加入”和“成员离开”的块?
它是结构化文本,只是与您期望的方式不同。 如果文本以一致的格式编写,则文件可以是结构化的,即使通常我们认为结构化文本是基于字段的。
这些字段由基于日期的 header 分隔,后跟{embed}
关键字,然后是您感兴趣的命令。
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import re
from itertools import count
# Get rid of the newlines for convenience
message = message_log.replace("\n", " ")
# Use a regular expression to split the log file into records
rx = r"(\[\d{2}-\w{3}-\d{2})"
replaced = re.split(rx, message)
# re.split will leave a blank entry as the first entry
replaced.pop(0)
# Each record will be a separate entry in a list
# Unfortunately the date component gets put in a different section of the list
# from the record is refers to and needs to be merged back together
merge_list = list()
for x, y in zip(count(step=2), replaced):
try:
merge_list.append(replaced[x] + replaced[x+1])
except:
continue
# Now a nice clean record list exists, it is possible to get the user count
n = 0
for z in merge_list:
# Split the record into date and context
log_date = re.split("(\d{2}-\w{3}-\d{2})", z)
# Work out whether the count should be incremented or decremented
if "{Embed} Member Joined" in z:
n = n + 1
elif "{Embed} Member Left" in z:
n = n - 1
else:
continue
# log_date[1] is needed to get the date from the record
print(log_date[1] + " " + str(n))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.