简体   繁体   English

Pandas/Python 读取具有字典结构的 TXT 文件并将它们分类为数据帧

[英]Pandas/Python Reading TXT file that has structure like dictionary and sort them into dataframes

Original title:Pandas/Python Drop columns base on syntax ':' existence in cell原标题:Pandas/Python Drop columns based on syntax ':' 存在于单元格中

Since the the solution went a totally different approach I changed the title由于解决方案采用了完全不同的方法,我更改了标题

============ =============

Consider a dataframe from txt file.考虑来自 txt 文件的数据框。

Data txt file: txt数据txt文件: txt

I removed the unwanted lines that contains "Iteration"我删除了包含“迭代”的不需要的行

import pandas as pd
data = pd.read_csv('results.txt', sep=" ", header=None)
data = data[~data[1].astype(str).str.startswith('Iteration')]

Now its like this: pd现在是这样的: pd

I wanted to remove all the columns that contains ":" to result something like: {2, 7}{2, 7}{2, 7}{2, 7}{1, 2, 7}{1, 2, 6, 7} {1, 2, 6, 7}{1, 2, 3, 6, 7}{1, 2, 3, 6, 7} {1, 3, 4, 6, 7}{1, 3, 4, 5, 6} {3, 4, 5, 6, 7}{1, 3, 4, 5, 7} {1, 4, 5, 6, 7}, .......etc.我想删除所有包含“:”的列,结果如下:{2, 7}{2, 7}{2, 7}{2, 7}{1, 2, 7}{1, 2, 6 , 7} {1, 2, 6, 7}{1, 2, 3, 6, 7}{1, 2, 3, 6, 7} {1, 3, 4, 6, 7}{1, 3, 4, 5, 6} {3, 4, 5, 6, 7}{1, 3, 4, 5, 7} {1, 4, 5, 6, 7}, .......等等。

and to merge columns such as: "{2," " 7}," into one column "{2,7}" Then I can replace all the column names with timestamps (6:30, 7:00 etc) to make it a proper table like :并将诸如“{2”、“7}”之类的列合并为一列“{2,7}”然后我可以用时间戳(6:30、7:00 等)替换所有列名以使其一个合适的表,如:

6:30 7:00 7:30 8:00 8:30 9:00 6:30 7:00 7:30 8:00 8:30 9:00

{2,7} {2,7} {2,7} {2,7} {1,2,7} {1,2,6,7} {2,7} {2,7} {2,7} {2,7} {1,2,7} {1,2,6,7}

{5,6} {5,6} {5,6} {5,6} {1,5,6} {1,4,5,6} {5,6} {5,6} {5,6} {5,6} {1,5,6} {1,4,5,6}

.... ……

My first step now is to remove columns with ":" , I tried these:我现在的第一步是删除带有 ":" 的列,我尝试了这些:

data.loc[:, ~(data[0:122].astype(str).apply(lambda x: x.str.contains(':'))).any()]
data.loc[:, ~(data.contains(':')).any()]

obviously they won't work since contains() cannot be use on data frames, I am not very familier with pandas.显然它们不起作用,因为 contains() 不能用于数据帧,我对熊猫不是很熟悉。 I searched for many other answers and wasn't able to get it right.我搜索了许多其他答案,但无法正确回答。 I wonder what is the correct way to remove columns with cells that contains ":"我想知道删除包含“:”的单元格的列的正确方法是什么

Or if my approch full of unnecessary steps, its there a better approach to turn the txt file into a dataframe of或者,如果我的方法充满了不必要的步骤,那么有一种更好的方法可以将 txt 文件转换为

6:30 7:00 7:30 8:00 8:30 9:00 6:30 7:00 7:30 8:00 8:30 9:00

{2,7} {2,7} {2,7} {2,7} {1,2,7} {1,2,6,7} {2,7} {2,7} {2,7} {2,7} {1,2,7} {1,2,6,7}

{5,6} {5,6} {5,6} {5,6} {1,5,6,} {1,4,5,6} {5,6} {5,6} {5,6} {5,6} {1,5,6,} {1,4,5,6}

like sort it when reading the txt file at first place and loop through lines and make dictionaries instead?喜欢在首先读取txt文件时对其进行排序并遍历行并制作字典? '6:30': {2, 7} .... '6:30': {2, 7} ....

Preprocess your file before create dataframe:在创建数据框之前预处理您的文件:

import pandas as pd
import ast

data = []
with open('results.txt') as fp:
    for line in fp:
        if line.startswith('Shifts:'):
            data.append(ast.literal_eval(line.split(':', maxsplit=1)[1].strip()))
df = pd.DataFrame.from_dict(data)

Important update: BUG FIX重要更新:错误修复

ast.literal_eval works well here but I can guarantee the order of the sets after evaluation ast.literal_eval在这里效果很好,但我可以保证评估后集合的顺序

>>> ast.literal_eval('{8, 2, 3, 7}')
{2, 3, 7, 8}  # We don't want that!

Maybe you can use re.findall to get the same output:也许您可以使用re.findall获得相同的输出:

import pandas as pd
import re

data = []
with open('results.txt') as fp:
    for line in fp:
        if line.startswith('Shifts:'):
            data.append(dict(re.findall(r"'([^']+)': (\{[^\}]+\})", line[1:-1])))
df = pd.DataFrame.from_dict(data)

The output still the same but the cell values are not set but str :输出仍然相同,但未set单元格值,但str

>>> df
       6:30    7:00    7:30    8:00       8:30          9:00          9:30  ...            15:30            16:00         16:30   17:00   17:30   18:00   18:30
0    {2, 7}  {2, 7}  {2, 7}  {2, 7}  {1, 2, 7}  {1, 2, 6, 7}  {1, 2, 6, 7}  ...  {1, 3, 4, 5, 8}  {1, 3, 4, 5, 8}  {8, 3, 4, 5}  {8, 5}  {8, 5}  {8, 5}  {8, 5}
1    {5, 6}  {5, 6}  {5, 6}  {5, 6}  {1, 5, 6}  {1, 4, 5, 6}  {1, 4, 5, 6}  ...  {1, 2, 3, 7, 8}  {1, 2, 3, 7, 8}  {8, 2, 3, 7}  {8, 3}  {8, 3}  {8, 3}  {8, 3}
2    {8, 1}  {8, 1}  {8, 1}  {8, 1}  {8, 1, 7}  {8, 1, 2, 7}  {8, 1, 2, 7}  ...  {3, 4, 5, 6, 7}  {3, 4, 5, 6, 7}  {3, 4, 5, 6}  {4, 5}  {4, 5}  {4, 5}  {4, 5}
3    {1, 6}  {1, 6}  {1, 6}  {1, 6}  {1, 5, 6}  {1, 4, 5, 6}  {1, 4, 5, 6}  ...  {2, 3, 5, 7, 8}  {2, 3, 5, 7, 8}  {8, 2, 3, 7}  {2, 7}  {2, 7}  {2, 7}  {2, 7}
4    {3, 4}  {3, 4}  {3, 4}  {3, 4}  {3, 4, 5}  {2, 3, 4, 5}  {2, 3, 4, 5}  ...  {1, 5, 6, 7, 8}  {1, 5, 6, 7, 8}  {8, 1, 6, 7}  {8, 7}  {8, 7}  {8, 7}  {8, 7}
..      ...     ...     ...     ...        ...           ...           ...  ...              ...              ...           ...     ...     ...     ...     ...
142  {5, 6}  {5, 6}  {5, 6}  {5, 6}  {5, 6, 7}  {2, 5, 6, 7}  {2, 5, 6, 7}  ...  {1, 3, 4, 7, 8}  {1, 3, 4, 7, 8}  {8, 1, 3, 4}  {1, 4}  {1, 4}  {1, 4}  {1, 4}
143  {5, 6}  {5, 6}  {5, 6}  {5, 6}  {3, 5, 6}  {2, 3, 5, 6}  {2, 3, 5, 6}  ...  {1, 3, 4, 7, 8}  {1, 3, 4, 7, 8}  {8, 1, 4, 7}  {8, 1}  {8, 1}  {8, 1}  {8, 1}
144  {1, 2}  {1, 2}  {1, 2}  {1, 2}  {1, 2, 5}  {8, 1, 2, 5}  {8, 1, 2, 5}  ...  {3, 4, 5, 6, 7}  {3, 4, 5, 6, 7}  {3, 4, 6, 7}  {4, 7}  {4, 7}  {4, 7}  {4, 7}
145  {8, 7}  {8, 7}  {8, 7}  {8, 7}  {8, 5, 7}  {8, 5, 6, 7}  {8, 5, 6, 7}  ...  {1, 2, 3, 4, 5}  {1, 2, 3, 4, 5}  {1, 2, 3, 4}  {2, 3}  {2, 3}  {2, 3}  {2, 3}
146  {1, 4}  {1, 4}  {1, 4}  {1, 4}  {1, 4, 5}  {1, 4, 5, 6}  {1, 4, 5, 6}  ...  {2, 3, 5, 7, 8}  {2, 3, 5, 7, 8}  {8, 2, 3, 7}  {8, 7}  {8, 7}  {8, 7}  {8, 7}

[147 rows x 25 columns]

Update 2 :更新 2

The order of set after the evaluation of a string representation of a set is not guarantee except (???) for numbers between -5 and 256 (CPython only).在评估set的字符串表示之后的set顺序不能保证,除了 (???) 用于 -5 和 256 之间的数字(仅限 CPython)。 Even if numbers are in this interval, set is an unordered collection即使数字在这个区间内, set也是一个无序集合

>>> print(eval('{8000, 2000, 3000, 7000}'))
{8000, 7000, 3000, 2000}

>>> print(ast.literal_eval('{8000, 2000, 3000, 7000}'))
{8000, 7000, 3000, 2000}

>>> print(eval('{8, 2, 3, 7}'))
{8, 2, 3, 7}

>>> print(ast.literal_eval('{8, 2, 3, 7}'))
{8, 2, 3, 7}

>>> print(ast.literal_eval('{8, 2000, 3000, 7}'))
{8, 7, 3000, 2000}

>>> print(eval('{8, 2000, 3000, 7}'))
{8, 2000, 3000, 7}

A set may not maintain order, but a list will, so let's change all the sets to lists before evaluating them:一个集合可能不会保持顺序,但一个列表会,所以让我们在评估它们之前将所有集合更改为列表:

import ast

data = []
with open('results.txt') as f:
    for line in f:
        if line.startswith('Shifts: '):
            line = line.strip()[9:-1]
            line = line.replace('{', '[').replace('}', ']')
            line = '{' + line + '}'
            data.append(ast.literal_eval(line))

df = pd.DataFrame(data)
print(df.head())

Output:输出:

       6:30    7:00    7:30    8:00       8:30          9:00          9:30  ...            15:30            16:00         16:30   17:00   17:30   18:00   18:30
0    [2, 7]  [2, 7]  [2, 7]  [2, 7]  [1, 2, 7]  [1, 2, 6, 7]  [1, 2, 6, 7]  ...  [1, 3, 4, 5, 8]  [1, 3, 4, 5, 8]  [8, 3, 4, 5]  [8, 5]  [8, 5]  [8, 5]  [8, 5]
1    [5, 6]  [5, 6]  [5, 6]  [5, 6]  [1, 5, 6]  [1, 4, 5, 6]  [1, 4, 5, 6]  ...  [1, 2, 3, 7, 8]  [1, 2, 3, 7, 8]  [8, 2, 3, 7]  [8, 3]  [8, 3]  [8, 3]  [8, 3]
2    [8, 1]  [8, 1]  [8, 1]  [8, 1]  [8, 1, 7]  [8, 1, 2, 7]  [8, 1, 2, 7]  ...  [3, 4, 5, 6, 7]  [3, 4, 5, 6, 7]  [3, 4, 5, 6]  [4, 5]  [4, 5]  [4, 5]  [4, 5]
3    [1, 6]  [1, 6]  [1, 6]  [1, 6]  [1, 5, 6]  [1, 4, 5, 6]  [1, 4, 5, 6]  ...  [2, 3, 5, 7, 8]  [2, 3, 5, 7, 8]  [8, 2, 3, 7]  [2, 7]  [2, 7]  [2, 7]  [2, 7]
4    [3, 4]  [3, 4]  [3, 4]  [3, 4]  [3, 4, 5]  [2, 3, 4, 5]  [2, 3, 4, 5]  ...  [1, 5, 6, 7, 8]  [1, 5, 6, 7, 8]  [8, 1, 6, 7]  [8, 7]  [8, 7]  [8, 7]  [8, 7]
..      ...     ...     ...     ...        ...           ...           ...  ...              ...              ...           ...     ...     ...     ...     ...
142  [5, 6]  [5, 6]  [5, 6]  [5, 6]  [5, 6, 7]  [2, 5, 6, 7]  [2, 5, 6, 7]  ...  [1, 3, 4, 7, 8]  [1, 3, 4, 7, 8]  [8, 1, 3, 4]  [1, 4]  [1, 4]  [1, 4]  [1, 4]
143  [5, 6]  [5, 6]  [5, 6]  [5, 6]  [3, 5, 6]  [2, 3, 5, 6]  [2, 3, 5, 6]  ...  [1, 3, 4, 7, 8]  [1, 3, 4, 7, 8]  [8, 1, 4, 7]  [8, 1]  [8, 1]  [8, 1]  [8, 1]
144  [1, 2]  [1, 2]  [1, 2]  [1, 2]  [1, 2, 5]  [8, 1, 2, 5]  [8, 1, 2, 5]  ...  [3, 4, 5, 6, 7]  [3, 4, 5, 6, 7]  [3, 4, 6, 7]  [4, 7]  [4, 7]  [4, 7]  [4, 7]
145  [8, 7]  [8, 7]  [8, 7]  [8, 7]  [8, 5, 7]  [8, 5, 6, 7]  [8, 5, 6, 7]  ...  [1, 2, 3, 4, 5]  [1, 2, 3, 4, 5]  [1, 2, 3, 4]  [2, 3]  [2, 3]  [2, 3]  [2, 3]
146  [1, 4]  [1, 4]  [1, 4]  [1, 4]  [1, 4, 5]  [1, 4, 5, 6]  [1, 4, 5, 6]  ...  [2, 3, 5, 7, 8]  [2, 3, 5, 7, 8]  [8, 2, 3, 7]  [8, 7]  [8, 7]  [8, 7]  [8, 7]

[147 rows x 25 columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM