繁体   English   中英

如何从多个文本中提取数据并将结果保存到 JSON?

[英]How can I extract data from multiple text and save result to JSON?

假设我有来自日志文件的文本,格式如下:

DEBUG: {\"id\":12311,\"pool_num\":\"4125441212441893\",\"full_name\":\"john doe\",\"mobile\":\"000000\","image_1\":\"upload\\/d7379280d549499dd9c948341298703ee.jpeg\",\"image_2\":\"upload\\/4a190fb8941a3d746cff01aa945b.jpeg\",\"image_3\":\"upload\\/3afd55aebb4d1461a4e15b9ac335dd92380.jpeg\"}
DEBUG: {\"id\":12312,\"pool_num\":\"89451222214511221\",\"full_name\":\"jane doe\",\"mobile\":\"000000\","image_1\":\"upload\\/d7379280d5494asdasd9c948341298123.jpeg\",\"image_2\":\"upload\\/4a190fb89asd123746cff01aa945b.jpeg\",\"image_3\":\"upload\\/3afd55aebb4dadasd15b9ac335dd9236661.jpeg\"}
DEBUG: {\"id\":12313,\"pool_num\":\"12312345612312312\",\"full_name\":\"smith doe\",\"mobile\":\"000000\","image_1\":\"upload\\/d7379280d549499dd9c948341298701551.jpeg\",\"image_2\":\"upload\\/123easfdsdagdfhdf213432123123.jpeg\",\"image_3\":\"upload\\/3afd55aebb4d1461a4e15b9ac335dd92380.jpeg\"}
DEBUG: {\"id\":12314,\"pool_num\":\"82123423444112345\",\"full_name\":\"adam doe\",\"mobile\":\"000000\","image_1\":\"upload\\/d7379280d549499dd9c9483412987666.jpeg\",\"image_2\":\"upload\\/asfda1234235we3rtsdasdasdah456.jpeg\",\"image_3\":\"upload\\/3afd55aebb4d1461a4e15b9ac335dd94216.jpeg\"}

目前我可以用这个正则表达式提取一些数据:

\b(?:pool_num|full_name|image_1|image_2|image_3)\\\":\\\"([^\"]+)

演示: https://regex101.com/r/ZmXaVl/1

但最后的文本包含"\\" ,但还不干净。

问题

我想从pool_numfull_nameimage_1image_2image_3中提取干净的值,并以 JSON 格式保存到.txt文件中。

我预期的 output 是:

[
    {
        "pool_num" : 4125441212441893,
        "full_name" : "john doe",
        "image_1" : "d7379280d549499dd9c948341298703ee.jpeg",
        "image_2" : "4a190fb8941a3d746cff01aa945b.jpeg",
        "image_3" : "3afd55aebb4d1461a4e15b9ac335dd92380.jpeg"
    },
    {
        "pool_num" : 89451222214511221,
        "full_name" : "jane doe",
        "image_1" : "d7379280d5494asdasd9c948341298123.jpeg",
        "image_2" : "4a190fb89asd123746cff01aa945b.jpeg",
        "image_3" : "3afd55aebb4dadasd15b9ac335dd9236661.jpeg"
    },
    {
        "pool_num" : 12312345612312312,
        "full_name" : "smith doe",
        "image_1" : "d7379280d549499dd9c948341298701551.jpeg",
        "image_2" : "123easfdsdagdfhdf213432123123.jpeg",
        "image_3" : "3afd55aebb4d1461a4e15b9ac335dd92380.jpeg"
    },
    {
        "pool_num" : 82123423444112345,
        "full_name" : "adam doe",
        "image_1" : "d7379280d549499dd9c9483412987666.jpeg",
        "image_2" : "asfda1234235we3rtsdasdasdah456.jpeg",
        "image_3" : "3afd55aebb4d1461a4e15b9ac335dd94216.jpeg"
    }
]

如何使用最佳 Python 方法获得所需的 output?

这是一个可能的解决方案,它从日志中提取以 'DEBUG: ' 开头的行,然后获取该行的 json 部分并按照@Tomerikoo 的评论建议导入它。

这会产生问题中列出的预期 output 格式。

此解决方案取决于前面带有“DEBUG:”的行。 它也可以调整为解析带有附加前缀的行。

如果这种方法可以解决问题,那么它将比一些基于正则表达式的解决方案更具弹性。

import json
import pprint
pp = pprint.PrettyPrinter(indent=4)

    mydata = []
    lines = log.split("\n")
    for line in lines:
        if line.startswith("DEBUG: {"):
            json_string = line.split("DEBUG: ")[1]
            mydata.append(json.loads(json_string))
    
    pp.pprint(mydata)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM