[英]Regex text to pandas dataframe
I have a text file that contains multiple lines in the format given below: 我有一个文本文件,其中包含多行,格式如下:
real 0m0.020s
user 0m0.000s
sys 0m0.000s
Round 1 completed. with matrix size of 1200 x 1200 with threads 8
real 0m0.022s
user 0m0.000s
sys 0m0.001s
Round 2 completed. with matrix size of 1200 x 1200 with threads 8
There are about 500 entries of the this sort(above is an example of 2). 大约有500个此类条目(以上是2个示例)。 I can't seem to figure out how to get them into a pandas dataframe that might look something like this:
我似乎无法弄清楚如何将它们放入可能如下所示的pandas数据框中:
Matrix Size Threads Round Real User Sys
1200 x 1200 8 1 0.0020 0.0000 0.0000
1200 x 1200 8 2 0.0022 0.0000 0.0001
Is there a way using regex or some other way to convert the test output into a dataframe. 有没有一种使用正则表达式或其他方法将测试输出转换为数据帧的方法。 Additionally I don't know if I interpreted the times correctly either as they are in 0m(I think 0 minutes) and the 0.02 (I think 0.02 seconds)
另外我不知道我是否正确解释了时间,因为它们是0m(我认为是0分钟)和0.02(我认为是0.02秒)
You can use a regex: 您可以使用正则表达式:
import re
import pandas as pd
regex = re.compile(r'real +(\dm\d\.\d+s)\nuser +(\dm\d\.\d+s)\nsys +(\dm\d\.\d+s)\nRound +(\d+).+of +(\d+ x \d+).+threads (\d+)')
df = pd.DataFrame(regex.findall(data), columns=['real', 'user', 'sys', 'round', 'matrix size', 'threads'])
print(df)
Output: 输出:
real user sys round matrix size threads
0 0m0.020s 0m0.000s 0m0.000s 1 1200 x 1200 8
1 0m0.022s 0m0.000s 0m0.001s 2 1200 x 1200 8
If you want to solve the problem using only pandas
you can use str.split()
: 如果您只想使用
pandas
来解决问题,则可以使用str.split()
:
# data
s = """real 0m0.020s
user 0m0.000s
sys 0m0.000s
Round 1 completed. with matrix size of 1200 x 1200 with threads 8
real 0m0.022s
user 0m0.000s
sys 0m0.001s
Round 2 completed. with matrix size of 1200 x 1200 with threads 8"""
# str.split on two line breaks for rows then split on the text
df = pd.DataFrame(s.split('\n\n'))[0].str.split(' |real | with |user |sys |matrix size of |threads |\n')\
.apply(lambda x: [s for s in x if s]).apply(pd.Series)
# split col 3 on round and completed to get number of rounds
df[3] = df[3].str.strip('Round | completed.')
# rename columns
df.columns = ['real', 'user', 'sys', 'round', 'matrix size', 'threads']
out 出
real user sys round matrix size threads
0 0m0.020s 0m0.000s 0m0.000s 1 1200 x 1200 8
1 0m0.022s 0m0.000s 0m0.001s 2 1200 x 1200 8
note that it will be slower gmds' example: 请注意,这将是更慢的gmds示例:
1000 loops, best of 3: 4.42 ms per loop
vs 1000 loops, best of 3: 1.84 ms per loop
1000 loops, best of 3: 4.42 ms per loop
与1000 loops, best of 3: 1.84 ms per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.