Python Collections.counter 并从 JSON 中排除内容

Question

我想在 Facebook 上创建“ my ”和“ my gf ”之间常用词的可视化。 我直接从 FB 下载了一个 JSON 文件中的所有消息，并且计数器工作正常

但：

Counter 还计算来自 JSON 的元素名称，例如“ sender_name ”或13位数字的时间戳
JSON 文件缺少UTF 编码- 我有像\Å 、 \a 、 \Å 、 \a硬编码到单词中

我如何排除像“ you, I, a, but ”等简短的无意义词？

对于第一个问题，我尝试创建要排除的单词字典，但我什至不知道如何排除它们。 此外，问题在于删除时间戳数字，因为它们不是恒定的。

对于第二个问题，我尝试在文字编辑器中打开文件并替换符号代码，但由于文件的大小（超过 150 万行），它每次都崩溃。

这是我用来打印最常用单词的代码：

import re
import collections
import json

file = open('message.json', encoding="utf8")
a = file.read()

words = re.findall(r'\w+', a)

most_common = collections.Counter(map(str.lower, words)).most_common(50)
print(most_common)

JSON 文件结构如下所示：

{
      "sender_name": "xxxxxx",
      "timestamp_ms": 1540327935616,
      "content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
      "type": "Generic"
    },

Answer 1

问题是您在整个文件中使用findall ，请执行以下操作：

import re
import collections
import json


def words(s):
    return re.findall('\w+', s, re.UNICODE | re.IGNORECASE)

file = open('message.json', encoding="utf8")
data = json.load(file)

counts = collections.Counter((w.lower() for e in data for w in words(e.get('content', ''))))
most_common = counts.most_common(50)
print(most_common)

输出

[('siä', 1), ('ci', 1), ('podobajä', 1)]

输出适用于具有以下内容的文件（JSON 对象列表）：

[{
      "sender_name": "xxxxxx",
      "timestamp_ms": 1540327935616,
      "content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
      "type": "Generic"
}]

解释

使用json.load将文件内容加载为字典data列表，然后遍历字典的元素并使用功能words和Counter计算'content'字段的words

更远

要删除诸如 I、a 和但之类的词，请参阅此

更新

鉴于您需要更改行的文件格式： data = json.load(file)到data = json.load(file)["messages"] ，对于以下内容：

{
  "participants":[],
  "messages": [
    {
      "sender_name": "xxxxxx",
      "timestamp_ms": 1540327935616,
      "content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
      "type": "Generic"
    },
    {
      "sender_name": "aaa",
      "timestamp_ms": 1540329382942,
      "content": "aaa",
      "type": "Generic"
    },
    {
      "sender_name": "aaa",
      "timestamp_ms": 1540329262248,
      "content": "aaa",
      "type": "Generic"
    }
  ]
}

输出是：

[('aaa', 2), ('siä', 1), ('podobajä', 1), ('ci', 1)]

Answer 2

您是否尝试过将 json 作为字典阅读并检查类型？ 您还可以事后查找不需要的词并将其删除。

import json
from collections import Counter

def get_words(string):
    return [word.lower() for word in string.split() if word.lower()]

def count_words(json_item):
    if isinstance(json_item, dict):
        for key, value in json_item.items():
            return count_words(key) + count_words(value)
    elif isinstance(value, str):
        return get_words(value)
    elif isinstance(value, list):
        return [word for string in value for word in count_words(string)]
    else:
        return []

with open('message.json', encoding="utf-8") as f:
    json_input = json.load(f)
counter = Counter(count_words(json_input))
result = { key: value for key, value in counter.items() if key not in UNWANTED_WORDS}

Python Collections.counter 并从 JSON 中排除内容

问题描述

2 个解决方案

解决方案1
2 2018-10-24 14:24:26

解决方案2
0 2018-10-24 14:26:10

Python Collections.counter 并从 JSON 中排除内容

问题描述

2 个解决方案

解决方案1 2 2018-10-24 14:24:26

解决方案2 0 2018-10-24 14:26:10

解决方案1
2 2018-10-24 14:24:26

解决方案2
0 2018-10-24 14:26:10