如何使用熊猫Series / DataFrame从类字典类的对象中提取数据

Question

It's a homework from school I was doing... 这是我正在上学的功课...

So basically I was asked to scan a given directory and find all the .py files in it, and count given attributes, which are classes and functions(including methods in classes) defined in the file, and total lines and characters for each file. 因此，基本上，我被要求扫描给定目录并查找其中的所有.py文件，并计算给定属性，即文件中定义的类和函数（包括类中的方法）以及每个文件的总行和字符。 And print all of the data in a table on the terminal. 并在终端上的表中打印所有数据。

To print the table, it was suggested by my lecturer to use a package called prettytable , although for me it's not pretty at all. 为了打印表格，我的讲师建议使用一个名为prettytable的程序包，尽管对我而言这根本不漂亮。

I want to use pandas . 我想用熊猫 。
The reason is simple: for each file count its 4 attributes --> a nested-dict is so naturally recalled here. 原因很简单：每个文件都要计数其4个属性->这里很自然地会调用一个嵌套字典。 And pandas.DataFrame is 100% perfect for record a nested-dict. pandas.DataFrame非常适合记录嵌套字典。

Scanning and summarizing are the easy part, what actually got me stuck is how to make the data container flexible and scalable. 扫描和汇总是简单的部分，实际上让我陷入困境的是如何使数据容器灵活且可扩展。

Built-in dict can't initialize with 4 existed key-value pairs in it, so I build a class CountAttr(MutableMapping) and use another class FileCounter to create and count every attribute for every file. 内置dict无法使用其中的4个现有键值对进行初始化，因此我构建了一个CountAttr（MutableMapping） 类，并使用另一个类FileCounter为每个文件创建和计算每个属性。

However, pandas.DataFrame only recognize the first layer of this dict-like object. 但是，pandas.DataFrame只识别此dict类对象的第一层。 And I have read the source files of DataFrame and Series, still unable to figure out how to solve this. 而且我已经阅读了DataFrame和Series的源文件，但仍然无法找出解决方法。

So my question is, 所以我的问题是
how to make pandas.DataFrame/Series extract the data from a dictionary whose values are dict-like objects? 如何使pandas.DataFrame / Series从其字典类型为dict的字典中提取数据？

PS I'm open for every advice for the following code, coding style, implementing way, everything. PS我愿意接受以下代码，编码风格，实现方式以及所有内容的所有建议。 Much appreciate! 非常感谢！

from collections.abc import MutableMapping
from collections import defaultdict
import pandas as pd
import os

class CounterAttr(MutableMapping):
""" Initialize a dictionary with 4 keys whose values are all 0,

    keys:value
    - 'class': 0
    - 'function': 0
    - 'line': 0
    - 'char': 0

    interfaces to get and set these attributes """

    def __init__(self):
        """ Initially there are 4 attributes in the storage"""
        # key: counted attributes | value: counting number
        self.__dict__ = {'class': 0, 'function': 0, 'line': 0, 'char': 0}

    def __getitem__(self, key):
        if key in self.__dict__:
            return self.__dict__[key]
        else:
            raise KeyError

    def get(self, key, defaut = None):
        if key in self.__dict__:
            return self.__dict__[key]
        else:
            return defaut

    def __setitem__(self, key, value):
        self.__dict__[key] = value

    def __delitem__(self, key):
        del self.__dict__[key]

    def __len__(self):
        return len(self.__dict__)

    def __iter__(self):
        return iter(self.__dict__)

    def get_all(self):
        """ return a copy of the self._storagem, in case the internal data got polluted"""
        copy = self.__dict__.copy()
        return copy

    def to_dict(self):
        return self.__dict__

    def __repr__(self):
        return '{0.__class__.__name__}()'.format(self)

class FileCounter(MutableMapping):
""" Discribe the object the store all the counters for all .py files

    Attributes:
    - 

"""
    def __init__(self):
        self._storage = dict()

    def __setitem__(self, key, value = CounterAttr()):
        if key not in self._storage.keys():
            self._storage[key] = value
        else:
            print("Attribute exist!")

    def __getitem__(self, key):
        if key in self._storage.keys():
            return self._storage[key]
        else:
            self._storage[key] = CounterAttr()

    def __delitem__(self, key):
        del self._storage[key]

    def __len__(self):
        return len(self._storage)

    def __iter__(self):
        return iter(self._storage)






def scan_summerize_pyfile(directory, give_me_dict = False):
""" Scan the passing directory, find all .py file, count the classes, funcs, lines, chars in each file
    and print out with a table
"""
    file_counter = FileCounter()


    if os.path.isdir(directory):                                            # if the given directory is a valid one

        os.chdir(directory)                                                 # change the CWD
        print("\nThe current working directory is {}\n".format(os.getcwd()))

        file_lst = os.listdir(directory)                                    # get all files in the CWD

        for a_file in file_lst:                                             # traverse the list and find all pyfiles
            if a_file.endswith(".py"):

                file_counter[a_file] 

                try:
                    open_file = open(a_file, 'r')
                except FileNotFoundError:
                    print("File {0} can't be opened!".format(a_file))

                else:

                    with open_file:
                        for line in open_file:

                            if line.lstrip().startswith("class"):           # count the classes
                                file_counter[a_file]['class'] += 1

                            if line.lstrip().startswith("def"):             # count the functions
                                file_counter[a_file]['function'] += 1

                            file_counter[a_file]['line'] += 1               # count the lines

                            file_counter[a_file]['char'] += len(line)       # count the chars, no whitespace

    else:
        print("The directory", directory, "is not existed.\nI'm sorry, program ends.")


    return file_counter

# Haven't had the pandas codes part yet

Answer 1

I don't know why you would need something like what you wrote.. it all seems over engineered to me. 我不知道您为什么需要类似您编写的内容。.在我看来，这一切似乎都过分设计了。

Assume read_file() returns the 4 attribute you want class, function, line, chars and you have a list of python file in a list_of_files , you can just do this: 假设read_file()返回想要的class, function, line, chars的4个属性class, function, line, chars并且list_of_files有一个python文件列表，您可以执行以下操作：

result = []
for file in list_of_files:
    c, f, l, num_c = read_file(file)
    curr_dict = {'class':c, 'function':f, 'line':l, 'chars':num_c}
    result.append(curr_dict)
your_table = pd.DataFrame(result)

That's all you need. 这就是您所需要的。

You should generate the list of file and the function to read them separately, each different thing should live in it's own function - It definitely helps to separate the logic. 您应该生成文件列表和函数以分别读取它们，每个不同的事物都应包含在它自己的函数中-绝对有助于分离逻辑。

Answer 2

So this is my solution for the question. 这就是我对这个问题的解决方案。 Instead of struggling for what pandas does, I try to figure out how to adjust my solution and make it easy for pandas to read my data. 我没有为熊猫所做的事情而苦苦挣扎，而是尝试找出如何调整解决方案并使熊猫能够轻松读取我的数据的方法。 Thanks for the advice from @RockyLi 感谢@RockyLi的建议

class FileCounter(object):
""" A class that contains the .py files counted 
    - .py files that are found in the given directory
    - attributes counted for each .py file
    - methods that scan and sumerized .py file
"""
def __init__(self, directory):
    self._directory = directory
    self._data = dict()        # key: file name | value: dict of counted attributes
    self._update_data()

def _read_file(self, filename):
    """ return a dictionary of attributes statistical data

        return type: dictionary
            - key: attributes' name
            - value: counting number of attributes

        it's not available to add a counting attributes interactively
    """

    class_, function_, line_, char_ = 0, 0, 0, 0
    try:
        open_file = open(filename, 'r')
    except FileNotFoundError:
        print("File {0} can't be opened!".format(filename))
    else:

        with open_file:
            for line in open_file:

                if line.lstrip().startswith("class "):           # count the classes
                    class_ += 1

                if line.lstrip().startswith("def "):             # count the functions
                    function_ += 1

                line_ += 1                                       # count the lines

                char_ += len(line)                               # count the chars, no whitespace
    return {'class': class_, 'function': function_, 'line': line_, 'char': char_}

def _scan_dir(self):
    """ return all of the file in the directory
        if the directory is not valid, raise and OSError
    """
    if os.path.isdir(self._directory):
        os.chdir(self._directory)
        return os.listdir(self._directory)

    else:
        raise OSError("The directory doesn't exist!")

def _find_py(self, lst_of_file):
    """ find all of the .py files in the directory"""
    lst_of_pyfile = list()

    for filename in lst_of_file:
        if filename.endswith('.py'):
            lst_of_pyfile.append(filename)

    return lst_of_pyfile

def _update_data(self):
    """ manipulate the _data\n
        this is the ONLY method that manipulate _data
    """
    lst_of_pyfile = self._find_py(self._scan_dir())

    for filename in lst_of_pyfile:
        self._data[filename] = self._read_file(filename)        # only place manipulate _data

def pretty_print(self):
    """ Print the data!"""

    df_prettyprint = pd.DataFrame.from_dict(self._data, orient = 'index')

    if not df_prettyprint.empty:
        print(df_prettyprint)
    else:
        print("Oops, seems like you don't get any .py file.\n You must be Java people :p")

def get_data(self):
    return self._data.copy()                                    # never give them the original data!

This class builds two interface to A. print table B. get the data for further use, also protect the data to be reached and modified directly. 此类为A建立两个接口。打印表B。获取数据以备将来使用，还保护要直接访问和修改的数据。

如何使用熊猫Series / DataFrame从类字典类的对象中提取数据

问题描述

2 个解决方案

解决方案1
0 已采纳 2018-10-19 17:30:51

解决方案2
0 2018-10-20 16:08:14

如何使用熊猫Series / DataFrame从类字典类的对象中提取数据

问题描述

2 个解决方案

解决方案1 0 已采纳 2018-10-19 17:30:51

解决方案2 0 2018-10-20 16:08:14

解决方案1
0 已采纳 2018-10-19 17:30:51

解决方案2
0 2018-10-20 16:08:14