I have some messy sensor reading data looking like this. Each record (not the same length) is separated by a "----" and stacked together. Is there any way to flatten it into a dataframe in which every row is a record?
test = pd.DataFrame({"Messy":["21/12/2017 11:12:48","Port:4","Reading 1: 1","----","21/12/2017 11:13:48","Port:4","Reading 1: 2","Reading 2: 2.5","----"]})
test
Messy
0 21/12/2017 11:12:48
1 Port:4
2 Reading 1: 1
3 ----
4 21/12/2017 11:13:48
5 Port:4
6 Reading 1: 2
7 Reading 2: 2.5
8 ----
What I want to have is something like this:
target = pd.DataFrame({"Time":["21/12/2017 11:12:48","21/12/2017 11:13:48"],"Port":["Port:4","Port:4"],"Field1":['Reading 1: 1','Reading 1: 2'],"Field2":['','Reading 2: 2.5']})
target
Field1 Feild2 Port Time
0 Reading 1: 1 Port:4 21/12/2017 11:12:48
1 Reading 1: 2 Reading 2: 2.5 Port:4 21/12/2017 11:13:48
Obviously it is really data dependent, but you can try:
#check separator
m = test['Messy'].str.startswith('----')
#create groups
test['g'] = m.cumsum()
#filter separator rows
df = test[~m].copy()
#count groups
df['c'] = df.groupby('g').cumcount()
print (df)
Messy g c
0 21/12/2017 11:12:48 0 0
1 Port:4 0 1
2 Reading 1: 1 0 2
4 21/12/2017 11:13:48 1 0
5 Port:4 1 1
6 Reading 1: 2 1 2
7 Reading 2: 2.5 1 3
#pivoting
df = df.pivot('g','c','Messy')
print (df)
c 0 1 2 3
g
0 21/12/2017 11:12:48 Port:4 Reading 1: 1 None
1 21/12/2017 11:13:48 Port:4 Reading 1: 2 Reading 2: 2.5
Below is one solution. Your data is messy. This method assumes that all your data is structured in groups of 4 columns.
import numpy as np, pandas as pd
test = pd.DataFrame({"Messy":["21/12/2017 11:12:48","Port:4","Reading 1: 1","----","21/12/2017 11:13:48","Port:4","Reading 1: 2","Reading 2: 2.5","----"]})
lst = [np.hstack(np.hstack(i)) for i in zip((test.iloc[4*i:4*i+4].values \
for i in range(int(len(test.index)/4))))]
df = pd.DataFrame(lst, columns=['Date', 'Port', 'Field1', 'Field2']).replace({'----': ''})
# Date Port Field1 Field2
# 0 21/12/2017 11:12:48 Port:4 Reading 1: 1
# 1 21/12/2017 11:13:48 Port:4 Reading 1: 2 Reading 2: 2.5
Assuming you have a maximum of 4 columns and all records are coming in the same order, here is another solution using re
, io
and pandas
:
import pandas as pd
import io
import re
d = {"Messy":["21/12/2017 11:12:48","Port:4","Reading 1: 1","----",
"21/12/2017 11:13:48","Port:4","Reading 1: 2","Reading 2: 2.5",
"----"]}
test = pd.read_csv(io.StringIO(re.sub(r',----,?','\n', ','.join(d['Messy']))),
names=['Time','Port','Field1','Field2'])
In [13]:
print(test)
Out[13]:
Time Port Field1 Field2
0 21/12/2017 11:12:48 Port:4 Reading 1: 1 NaN
1 21/12/2017 11:13:48 Port:4 Reading 1: 2 Reading 2: 2.5
You can scale this solution by adding more column names in the names list
attribute in the pd.read_csv()
function, eg if you have 10 columns maximum in a record in your data, just map them to 10 column names.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.