简体   繁体   中英

Pair each item with the last non-null value in a row

I'm trying to make a function where I feed it a list of URLs which go through a 301 hop and it flattens it for me. I want to save the resulting list as a CSV so I can hand it to the developers who can implement it and get rid of 301 hops.

For example, my crawler will produce this list of 301 hops:

    URL1          | URL2              | URL3              | URL4
example.com/url1  | example.com/url2  |                   | 
example.com/url3  | example.com/url4  | example.com/url5  | 
example.com/url6  | example.com/url7  | example.com/url8  | example.com/10
example.com/url9  | example.com/url7  | example.com/url8  | 
example.com/url23 | example.com/url10 |                   | 
example.com/url24 | example.com/url45 | example.com/url46 | 
example.com/url25 | example.com/url45 | example.com/url46 | 
example.com/url26 | example.com/url45 | example.com/url46 | 
example.com/url27 | example.com/url45 | example.com/url46 | 
example.com/url28 | example.com/url45 | example.com/url46 | 
example.com/url29 | example.com/url45 | example.com/url46 | 
example.com/url30 | example.com/url45 | example.com/url46 | 

The output I'm trying to get is

URL1              | URL2 
example.com/url1  | example.com/url2
example.com/url3  | example.com/url5
example.com/url4  | example.com/url5
example.com/url6  | example.com/10
example.com/url7  | example.com/10
example.com/url8  | example.com/10
example.com/url23 | example.com/url10
...

I've converted the Pandas dataframe to a list of lists using the below code:

import pandas as pd
import numpy as np

csv1 = pd.read_csv('Example_301_sheet.csv', header=None)
outlist = []

def link_flat(csv):

    for row in csv.iterrows():
        index, data = row
        outlist.append(data.tolist())

    return outlist

This returns each row as a list, and they are all nested together in a list, like below:

[['example.com/url1', 'example.com/url2', nan, nan],
 ['example.com/url3', 'example.com/url4', 'example.com/url5', nan],
 ['example.com/url6',
  'example.com/url7',
  'example.com/url8',
  'example.com/10'],
 ['example.com/url9', 'example.com/url7', 'example.com/url8', nan],
 ['example.com/url23', 'example.com/url10', nan, nan],
 ['example.com/url24', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url25', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url26', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url27', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url28', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url29', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url30', 'example.com/url45', 'example.com/url46', nan]]

How do I match each URL in each nested list with the last URL in the same list to produce the above list?

You'll need to determine the last valid item per row using groupby + last , and then reshape your dataFrame and build a two-column mapping using melt .

df.columns = range(len(df.columns))
df = (
    df.assign(URL2=df.stack().groupby(level=0).last())
      .melt('URL2', value_name='URL1')  
      .drop('variable', 1)
      .dropna()
      .drop_duplicates()
      .query('URL1 != URL2')
      .sort_index(axis=1)
      .reset_index(drop=True)
)

df
                 URL1               URL2
0    example.com/url1   example.com/url2
1    example.com/url3   example.com/url5
2    example.com/url6     example.com/10
3    example.com/url9   example.com/url8
4   example.com/url23  example.com/url10
5   example.com/url24  example.com/url46
6   example.com/url25  example.com/url46
7   example.com/url26  example.com/url46
8   example.com/url27  example.com/url46
9   example.com/url28  example.com/url46
10  example.com/url29  example.com/url46
11  example.com/url30  example.com/url46
12   example.com/url4   example.com/url5
13   example.com/url7     example.com/10
14   example.com/url7   example.com/url8
15  example.com/url45  example.com/url46
16   example.com/url8     example.com/10

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM