I have a pandas like so:
df =
cid pid purl tid turl sid surl
c1 p1 urlp1 t1 urlt1 s1 urls1
c1 p1 urlp1 t1 urlt1 s2 urls2
c1 p1 urlp1 t1 urlt1 s3 urls3
c2 p2 urlp2 t2 urlt2 s5 urls5
c2 p2 urlp2 t2 urlt2 s6 urls6
What I want is a result something like this:
cid uid url
c1 p1 urlp1
c1 t1 urlt1
c1 s1 urls1
c1 s2 urls2
c1 s3 urls3
c2 p2 urlp2
c2 t2 urlt2
c2 s5 urls5
c2 s6 urls6
I tried using pd.melt
to achieve this but I could only do it partially like so:
df2 = pd.melt(df, id_vars = 'cid', value_vars = ['pid','tid','purl'], value_name = 'userid')
How to get more columns? I need more columns for value_name
? How can I achieve this?
One manual solution is to use a list comprehension. The steps are:
cid
. concat
with your list of dataframes and rename columns via pipe
[add other formatting if required]. reset_index
(to elevate the index to a series) and drop_duplicates
. Here's a demo:
df = df.set_index('cid')
def formatter(df):
df.columns = ['uid', 'url']
return df
n = int(len(df.columns) / 2)
L = [df.iloc[:, 2*i:2*(i+1)].pipe(formatter) for i in range(n)]
res = pd.concat(L, axis=0).reset_index().drop_duplicates()
print(res)
cid uid url
0 c1 p1 urlp1
3 c2 p2 urlp2
5 c1 t1 urlt1
8 c2 t2 urlt2
10 c1 s1 urls1
11 c1 s2 urls2
12 c1 s3 urls3
13 c2 s5 urls5
14 c2 s6 urls6
Not sure this is the most straightforward way to do it but this is what I can think of:
import pandas as pd
from io import StringIO
s = """cid pid purl tid turl sid surl
c1 p1 urlp1 t1 urlt1 s1 urls1
c1 p1 urlp1 t1 urlt1 s2 urls2
c1 p1 urlp1 t1 urlt1 s3 urls3
c2 p2 urlp2 t2 urlt2 s5 urls5
c2 p2 urlp2 t2 urlt2 s6 urls6"""
df = pd.read_table(StringIO(s), sep='\\s+', header=0)
df2 = df.set_index('cid')
df3 = pd.concat([df2[['pid', 'purl']].rename(columns={'pid': 'uid', 'purl': 'url'}),
df2[['tid', 'turl']].rename(columns={'tid': 'uid', 'turl': 'url'}),
df2[['sid', 'surl']].rename(columns={'sid': 'uid', 'surl': 'url'})],
axis=0)
result = df3.drop_duplicates().sort_index().reset_index()
print(result)
Output:
cid uid url
0 c1 p1 urlp1
1 c1 t1 urlt1
2 c1 s1 urls1
3 c1 s2 urls2
4 c1 s3 urls3
5 c2 p2 urlp2
6 c2 t2 urlt2
7 c2 s5 urls5
8 c2 s6 urls6
The columns have a pattern to them -> some end in id
, some with url
.
We can use this pattern to reshape the data, and remove duplicates.
This is only an alternative, and relies on pivot_longer from pyjanitor to help in the reshaping process:
# pip install pyjanitor
import pandas as pd
import janitor as jn
(df.pivot_longer(index = 'cid',
names_to = ('uid', 'url'),
names_pattern = (r".+id", r".+url"),
sort_by_appearance = True)
.drop_duplicates(ignore_index = True)
)
cid uid url
0 c1 p1 urlp1
1 c1 t1 urlt1
2 c1 s1 urls1
3 c1 s2 urls2
4 c1 s3 urls3
5 c2 p2 urlp2
6 c2 t2 urlt2
7 c2 s5 urls5
8 c2 s6 urls6
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.