简体   繁体   中英

Python - Web-Scraping - Parsing HTML Table - Concat multiple href into one column

I am extracting a table from my customer website and I need to parse this HTML into a Pandas dataframe. However, on the table I want to store all the HREFs into my dataframe. My HTML has the following schema:

<table>
    <tr>
         <th>Col_1</th>
         <th>Col_2</th>
         <th>Col_3</th>
         <th>Col_4</th>
         <th>Col_5</th>
         <th>Col_6</th>
         <th>Col_7</th>
         <th>Col_8</th>
         <th>Col_9</th>
    </tr>
    <tr>
         <td>Office</td>
         <td>Office2</td>
         <td>Customer</td>
         <td></td>
         <td><a href="test12345_163">New Doc</a><br><a href="test12345_163">my_work.jpg</a></td>
         <td><a href="test12345_163">Person_2</a><br><a href="test12345_163">Person_3</a><br><a href="test12345_163">Person 3</a></td>
         <td><a href="test12345_163">Person_1</a><br><a href="test12345_163">Person_1</a><br><a href="test12345_163">Person_1</a><br><a href="test12345_163">Person_1</a></td>
         <td>STATUS</td>
         <td>9030303</td>
    </tr>
</table>

I have this code:

soup = BeautifulSoup(page.content, "html.parser")

html_table = soup.find('table')

df = pd.read_html(str(html_table), header=0)[0]
df['Link'] = [link.get('href') for link in html_table.find_all('a')]

I am just trying to create a column with all the links from each index (if has more than one, then group it). But when I run this code I got:

Length of values (1102) does not match length of index (435)

What I am doing wrong?

Thanks!

You don't need read_html , and the Dataframe should be defined like this:

html_table = soup.find('table')
hyperlinks=soup.find_all("a")
l=[]
for a in hyperlinks:
    l.append([a.text,a.get("href")])
pd.DataFrame(l,columns=["Names","Links"])

在此处输入图像描述

Update:

#here we get headers:
headers=[]
html_table = soup.find('table')
trs=html_table.find_all("tr")
headers=[th.text for th in trs[0].find_all("th")]
#an empty dataframe with all headers as columns and one row index:
df=pd.DataFrame(columns=headers,index=[0])
#here we get contents:
body_td=trs[1].find_all("td")
i=0
for td in body_td:
    HyperLinks=td.find_all("a")
    cell=[a.get("href") for a in HyperLinks]
    df.iloc[0,i]=cell
    i+=1

在此处输入图像描述

You could grab the links before looping the tds using a list comprehension to get all hrefs for a given row; grab all the td text into a list and extend that list with a nested list of one item, which is the list of hrefs you previously collected:

from bs4 import BeautifulSoup as bs
import pandas as pd

html = '''<table>
    <tr>
         <th>Col_1</th>
         <th>Col_2</th>
         <th>Col_3</th>
         <th>Col_4</th>
         <th>Col_5</th>
         <th>Col_6</th>
         <th>Col_7</th>
         <th>Col_8</th>
         <th>Col_9</th>
    </tr>
    <tr>
         <td>Office</td>
         <td>Office2</td>
         <td>Customer</td>
         <td></td>
         <td><a href="test12345_163">New Doc</a><br><a href="test12345_163">my_work.jpg</a></td>
         <td><a href="test12345_163">Person_2</a><br><a href="test12345_163">Person_3</a><br><a href="test12345_163">Person 3</a></td>
         <td><a href="test12345_163">Person_1</a><br><a href="test12345_163">Person_1</a><br><a href="test12345_163">Person_1</a><br><a href="test12345_163">Person_1</a></td>
         <td>STATUS</td>
         <td>9030303</td>
    </tr>
</table>'''
soup = bs(html, 'lxml')
results = []
headers = [i.text for i in soup.select('table th')]
headers.append('Links')

for _row in soup.select('table tr')[1:]:
    row = []
    links = [i['href'] for i in _row.select('a')]
    for _td in _row.select('td'):
        row.append(_td.text)
    row.extend([links])
    results.append(row)

df = pd.DataFrame(results, columns = headers)
df

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM