[英]Pandas/Python Reading TXT file that has structure like dictionary and sort them into dataframes
Original title:Pandas/Python Drop columns base on syntax ':' existence in cell原标题:Pandas/Python Drop columns based on syntax ':' 存在于单元格中
Since the the solution went a totally different approach I changed the title由于解决方案采用了完全不同的方法,我更改了标题
============ =============
Consider a dataframe from txt file.考虑来自 txt 文件的数据框。
Data txt file: txt数据txt文件: txt
I removed the unwanted lines that contains "Iteration"我删除了包含“迭代”的不需要的行
import pandas as pd
data = pd.read_csv('results.txt', sep=" ", header=None)
data = data[~data[1].astype(str).str.startswith('Iteration')]
Now its like this: pd现在是这样的: pd
I wanted to remove all the columns that contains ":" to result something like: {2, 7}{2, 7}{2, 7}{2, 7}{1, 2, 7}{1, 2, 6, 7} {1, 2, 6, 7}{1, 2, 3, 6, 7}{1, 2, 3, 6, 7} {1, 3, 4, 6, 7}{1, 3, 4, 5, 6} {3, 4, 5, 6, 7}{1, 3, 4, 5, 7} {1, 4, 5, 6, 7}, .......etc.我想删除所有包含“:”的列,结果如下:{2, 7}{2, 7}{2, 7}{2, 7}{1, 2, 7}{1, 2, 6 , 7} {1, 2, 6, 7}{1, 2, 3, 6, 7}{1, 2, 3, 6, 7} {1, 3, 4, 6, 7}{1, 3, 4, 5, 6} {3, 4, 5, 6, 7}{1, 3, 4, 5, 7} {1, 4, 5, 6, 7}, .......等等。
and to merge columns such as: "{2," " 7}," into one column "{2,7}" Then I can replace all the column names with timestamps (6:30, 7:00 etc) to make it a proper table like :并将诸如“{2”、“7}”之类的列合并为一列“{2,7}”然后我可以用时间戳(6:30、7:00 等)替换所有列名以使其一个合适的表,如:
6:30 7:00 7:30 8:00 8:30 9:00 6:30 7:00 7:30 8:00 8:30 9:00
{2,7} {2,7} {2,7} {2,7} {1,2,7} {1,2,6,7} {2,7} {2,7} {2,7} {2,7} {1,2,7} {1,2,6,7}
{5,6} {5,6} {5,6} {5,6} {1,5,6} {1,4,5,6} {5,6} {5,6} {5,6} {5,6} {1,5,6} {1,4,5,6}
.... ……
My first step now is to remove columns with ":" , I tried these:我现在的第一步是删除带有 ":" 的列,我尝试了这些:
data.loc[:, ~(data[0:122].astype(str).apply(lambda x: x.str.contains(':'))).any()]
data.loc[:, ~(data.contains(':')).any()]
obviously they won't work since contains() cannot be use on data frames, I am not very familier with pandas.显然它们不起作用,因为 contains() 不能用于数据帧,我对熊猫不是很熟悉。 I searched for many other answers and wasn't able to get it right.我搜索了许多其他答案,但无法正确回答。 I wonder what is the correct way to remove columns with cells that contains ":"我想知道删除包含“:”的单元格的列的正确方法是什么
Or if my approch full of unnecessary steps, its there a better approach to turn the txt file into a dataframe of或者,如果我的方法充满了不必要的步骤,那么有一种更好的方法可以将 txt 文件转换为
6:30 7:00 7:30 8:00 8:30 9:00 6:30 7:00 7:30 8:00 8:30 9:00
{2,7} {2,7} {2,7} {2,7} {1,2,7} {1,2,6,7} {2,7} {2,7} {2,7} {2,7} {1,2,7} {1,2,6,7}
{5,6} {5,6} {5,6} {5,6} {1,5,6,} {1,4,5,6} {5,6} {5,6} {5,6} {5,6} {1,5,6,} {1,4,5,6}
like sort it when reading the txt file at first place and loop through lines and make dictionaries instead?喜欢在首先读取txt文件时对其进行排序并遍历行并制作字典? '6:30': {2, 7} .... '6:30': {2, 7} ....
Preprocess your file before create dataframe:在创建数据框之前预处理您的文件:
import pandas as pd
import ast
data = []
with open('results.txt') as fp:
for line in fp:
if line.startswith('Shifts:'):
data.append(ast.literal_eval(line.split(':', maxsplit=1)[1].strip()))
df = pd.DataFrame.from_dict(data)
Important update: BUG FIX重要更新:错误修复
ast.literal_eval
works well here but I can guarantee the order of the sets after evaluation ast.literal_eval
在这里效果很好,但我可以保证评估后集合的顺序
>>> ast.literal_eval('{8, 2, 3, 7}')
{2, 3, 7, 8} # We don't want that!
Maybe you can use re.findall
to get the same output:也许您可以使用re.findall
获得相同的输出:
import pandas as pd
import re
data = []
with open('results.txt') as fp:
for line in fp:
if line.startswith('Shifts:'):
data.append(dict(re.findall(r"'([^']+)': (\{[^\}]+\})", line[1:-1])))
df = pd.DataFrame.from_dict(data)
The output still the same but the cell values are not set
but str
:输出仍然相同,但未set
单元格值,但str
:
>>> df
6:30 7:00 7:30 8:00 8:30 9:00 9:30 ... 15:30 16:00 16:30 17:00 17:30 18:00 18:30
0 {2, 7} {2, 7} {2, 7} {2, 7} {1, 2, 7} {1, 2, 6, 7} {1, 2, 6, 7} ... {1, 3, 4, 5, 8} {1, 3, 4, 5, 8} {8, 3, 4, 5} {8, 5} {8, 5} {8, 5} {8, 5}
1 {5, 6} {5, 6} {5, 6} {5, 6} {1, 5, 6} {1, 4, 5, 6} {1, 4, 5, 6} ... {1, 2, 3, 7, 8} {1, 2, 3, 7, 8} {8, 2, 3, 7} {8, 3} {8, 3} {8, 3} {8, 3}
2 {8, 1} {8, 1} {8, 1} {8, 1} {8, 1, 7} {8, 1, 2, 7} {8, 1, 2, 7} ... {3, 4, 5, 6, 7} {3, 4, 5, 6, 7} {3, 4, 5, 6} {4, 5} {4, 5} {4, 5} {4, 5}
3 {1, 6} {1, 6} {1, 6} {1, 6} {1, 5, 6} {1, 4, 5, 6} {1, 4, 5, 6} ... {2, 3, 5, 7, 8} {2, 3, 5, 7, 8} {8, 2, 3, 7} {2, 7} {2, 7} {2, 7} {2, 7}
4 {3, 4} {3, 4} {3, 4} {3, 4} {3, 4, 5} {2, 3, 4, 5} {2, 3, 4, 5} ... {1, 5, 6, 7, 8} {1, 5, 6, 7, 8} {8, 1, 6, 7} {8, 7} {8, 7} {8, 7} {8, 7}
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
142 {5, 6} {5, 6} {5, 6} {5, 6} {5, 6, 7} {2, 5, 6, 7} {2, 5, 6, 7} ... {1, 3, 4, 7, 8} {1, 3, 4, 7, 8} {8, 1, 3, 4} {1, 4} {1, 4} {1, 4} {1, 4}
143 {5, 6} {5, 6} {5, 6} {5, 6} {3, 5, 6} {2, 3, 5, 6} {2, 3, 5, 6} ... {1, 3, 4, 7, 8} {1, 3, 4, 7, 8} {8, 1, 4, 7} {8, 1} {8, 1} {8, 1} {8, 1}
144 {1, 2} {1, 2} {1, 2} {1, 2} {1, 2, 5} {8, 1, 2, 5} {8, 1, 2, 5} ... {3, 4, 5, 6, 7} {3, 4, 5, 6, 7} {3, 4, 6, 7} {4, 7} {4, 7} {4, 7} {4, 7}
145 {8, 7} {8, 7} {8, 7} {8, 7} {8, 5, 7} {8, 5, 6, 7} {8, 5, 6, 7} ... {1, 2, 3, 4, 5} {1, 2, 3, 4, 5} {1, 2, 3, 4} {2, 3} {2, 3} {2, 3} {2, 3}
146 {1, 4} {1, 4} {1, 4} {1, 4} {1, 4, 5} {1, 4, 5, 6} {1, 4, 5, 6} ... {2, 3, 5, 7, 8} {2, 3, 5, 7, 8} {8, 2, 3, 7} {8, 7} {8, 7} {8, 7} {8, 7}
[147 rows x 25 columns]
Update 2 :更新 2 :
The order of set
after the evaluation of a string representation of a set
is not guarantee except (???) for numbers between -5 and 256 (CPython only).在评估set
的字符串表示之后的set
顺序不能保证,除了 (???) 用于 -5 和 256 之间的数字(仅限 CPython)。 Even if numbers are in this interval, set
is an unordered collection即使数字在这个区间内, set
也是一个无序集合
>>> print(eval('{8000, 2000, 3000, 7000}'))
{8000, 7000, 3000, 2000}
>>> print(ast.literal_eval('{8000, 2000, 3000, 7000}'))
{8000, 7000, 3000, 2000}
>>> print(eval('{8, 2, 3, 7}'))
{8, 2, 3, 7}
>>> print(ast.literal_eval('{8, 2, 3, 7}'))
{8, 2, 3, 7}
>>> print(ast.literal_eval('{8, 2000, 3000, 7}'))
{8, 7, 3000, 2000}
>>> print(eval('{8, 2000, 3000, 7}'))
{8, 2000, 3000, 7}
A set may not maintain order, but a list will, so let's change all the sets to lists before evaluating them:一个集合可能不会保持顺序,但一个列表会,所以让我们在评估它们之前将所有集合更改为列表:
import ast
data = []
with open('results.txt') as f:
for line in f:
if line.startswith('Shifts: '):
line = line.strip()[9:-1]
line = line.replace('{', '[').replace('}', ']')
line = '{' + line + '}'
data.append(ast.literal_eval(line))
df = pd.DataFrame(data)
print(df.head())
Output:输出:
6:30 7:00 7:30 8:00 8:30 9:00 9:30 ... 15:30 16:00 16:30 17:00 17:30 18:00 18:30
0 [2, 7] [2, 7] [2, 7] [2, 7] [1, 2, 7] [1, 2, 6, 7] [1, 2, 6, 7] ... [1, 3, 4, 5, 8] [1, 3, 4, 5, 8] [8, 3, 4, 5] [8, 5] [8, 5] [8, 5] [8, 5]
1 [5, 6] [5, 6] [5, 6] [5, 6] [1, 5, 6] [1, 4, 5, 6] [1, 4, 5, 6] ... [1, 2, 3, 7, 8] [1, 2, 3, 7, 8] [8, 2, 3, 7] [8, 3] [8, 3] [8, 3] [8, 3]
2 [8, 1] [8, 1] [8, 1] [8, 1] [8, 1, 7] [8, 1, 2, 7] [8, 1, 2, 7] ... [3, 4, 5, 6, 7] [3, 4, 5, 6, 7] [3, 4, 5, 6] [4, 5] [4, 5] [4, 5] [4, 5]
3 [1, 6] [1, 6] [1, 6] [1, 6] [1, 5, 6] [1, 4, 5, 6] [1, 4, 5, 6] ... [2, 3, 5, 7, 8] [2, 3, 5, 7, 8] [8, 2, 3, 7] [2, 7] [2, 7] [2, 7] [2, 7]
4 [3, 4] [3, 4] [3, 4] [3, 4] [3, 4, 5] [2, 3, 4, 5] [2, 3, 4, 5] ... [1, 5, 6, 7, 8] [1, 5, 6, 7, 8] [8, 1, 6, 7] [8, 7] [8, 7] [8, 7] [8, 7]
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
142 [5, 6] [5, 6] [5, 6] [5, 6] [5, 6, 7] [2, 5, 6, 7] [2, 5, 6, 7] ... [1, 3, 4, 7, 8] [1, 3, 4, 7, 8] [8, 1, 3, 4] [1, 4] [1, 4] [1, 4] [1, 4]
143 [5, 6] [5, 6] [5, 6] [5, 6] [3, 5, 6] [2, 3, 5, 6] [2, 3, 5, 6] ... [1, 3, 4, 7, 8] [1, 3, 4, 7, 8] [8, 1, 4, 7] [8, 1] [8, 1] [8, 1] [8, 1]
144 [1, 2] [1, 2] [1, 2] [1, 2] [1, 2, 5] [8, 1, 2, 5] [8, 1, 2, 5] ... [3, 4, 5, 6, 7] [3, 4, 5, 6, 7] [3, 4, 6, 7] [4, 7] [4, 7] [4, 7] [4, 7]
145 [8, 7] [8, 7] [8, 7] [8, 7] [8, 5, 7] [8, 5, 6, 7] [8, 5, 6, 7] ... [1, 2, 3, 4, 5] [1, 2, 3, 4, 5] [1, 2, 3, 4] [2, 3] [2, 3] [2, 3] [2, 3]
146 [1, 4] [1, 4] [1, 4] [1, 4] [1, 4, 5] [1, 4, 5, 6] [1, 4, 5, 6] ... [2, 3, 5, 7, 8] [2, 3, 5, 7, 8] [8, 2, 3, 7] [8, 7] [8, 7] [8, 7] [8, 7]
[147 rows x 25 columns]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.