簡體   English   中英

讀取pandas中的非結構化數據

[英]Read unstructured data in pandas

我在文本文件中有以下非結構化數據,這是來自 Discord 的消息日志數據。

[06-Nov-19 03:36 PM] Dyno#0000

{Embed}
Server
**Message deleted in #reddit-feed**
Author: ? | Message ID: 171111183099756545

[12-Nov-19 01:35 PM] Dyno#0000

{Embed}
Member Left
@Unknown User
ID: 171111183099756545

[16-Nov-19 11:25 PM] Dyno#0000

{Embed}
Member Joined
@User
ID: 171111183099756545

基本上我的目標是解析數據並提取所有加入和離開消息,然后 plot 服務器中成員的增長。 有些消息是無關緊要的,每個消息塊也有不同長度的行。

Date        Member-change
4/24/2020   2
4/25/2020   -1
4/26/2020   3

我試過在一個循環中解析數據,但因為數據是非結構化的並且行的長度不同,所以我對如何設置它感到困惑。 有沒有辦法忽略所有沒有“成員加入”和“成員離開”的塊?

它是結構化文本,只是與您期望的方式不同。 如果文本以一致的格式編寫,則文件可以是結構化的,即使通常我們認為結構化文本是基於字段的。

這些字段由基於日期的 header 分隔,后跟{embed}關鍵字,然后是您感興趣的命令。

#! /usr/bin/env python
# -*- coding: utf-8 -*-

import re
from itertools import count

# Get rid of the newlines for convenience
message = message_log.replace("\n", " ")

# Use a regular expression to split the log file into records
rx = r"(\[\d{2}-\w{3}-\d{2})"
replaced = re.split(rx, message)

# re.split will leave a blank entry as the first entry
replaced.pop(0)

# Each record will be a separate entry in a list 
# Unfortunately the date component gets put in a different section of the list
# from the record is refers to and needs to be merged back together
merge_list = list()

for x, y in zip(count(step=2), replaced):
    try:
        merge_list.append(replaced[x] + replaced[x+1])
    except:
        continue

# Now a nice clean record list exists, it is possible to get the user count
n = 0
for z in merge_list:
    # Split the record into date and context
    log_date = re.split("(\d{2}-\w{3}-\d{2})", z)
    # Work out whether the count should be incremented or decremented
    if "{Embed} Member Joined" in z:
        n = n + 1
    elif "{Embed} Member Left" in z:
        n = n - 1
    else:
        continue
    # log_date[1] is needed to get the date from the record
    print(log_date[1] + " " + str(n))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM