简体   繁体   中英

Obtaining \r\n\r\n while scraping from web in Python

I am workin on scraping text using Python from the link; tournament link

Here is my code to get the tabular data;

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr') ## find the table rows

Now, the goal is to obtain the data as a dataframe.

listnew=[]
for row in rows:
    row_td = row.find_all('td')
    str_cells = str(row_td)
    cleantext = BeautifulSoup(str_cells, "lxml").get_text() ##obtain text part
    listnew.append(cleantext) ## append to list

df = pd.DataFrame(listnew)
df.head(10)

Then we get following output;

0   []
1   [Finishers:, 577]
2   [Male:, 414]
3   [Female:, 163]
4   []
5   [1, 814, \r\n\r\n JARED WIL...
6   [2, 573, \r\n\r\n NATHAN A ...
7   [3, 687, \r\n\r\n FRANCISCO...
8   [4, 623, \r\n\r\n PAUL MORR...
9   [5, 569, \r\n\r\n DEREK G O..

I don't know why there is a new line character and carriage return character; \r\n\r\n ? how can I remove them and get a dataframe in the proper format? Thanks in advance.

Seems like some cells in the HTML code has a lot of leading and trailing spaces and new lines:

<td>

                    JARED WILSON

                </td>

Use str.strip to remove all leading and trailing whitespace, like this: BeautifulSoup(str_cells, "lxml").get_text().strip() .

Well looking at the url you provided, you can see the new lines in the:

...
<td>814</td>
<td>
JARED WILSON
</td>
...

so that's what you get when you scrape. These can easily be removed by the very convenient .strip() string method.

Your DataFrame is not formatted correctly because you are giving it a list of lists, which are not all of the same size (see the first 4 lines), which come from another table located on the top right. One easy fix is to remove the first 4 lines, though it would be way more robust to select the table you want based on its id ( "individualResults" ).

df = pd.DataFrame(listnew[4:])
df.head(10)

Have a look here: BeautifulSoup table to dataframe

Pandas can parse HTML tables, give this a try:

from urllib.request import urlopen

import pandas as pd
from bs4 import BeautifulSoup

url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')

table_1_html = soup.find('table', attrs={'id': 'individualResults'})

t_1 = pd.read_html(table_1_html.prettify())[0]

print(t_1)

Output:

     Place  Bib                Name  ... Chip Pace Gun Time          Team
0        1  814        JARED WILSON  ...      5:51    36:24           NaN
1        2  573  NATHAN A SUSTERSIC  ...      5:55    36:45  INTEL TEAM F
2        3  687      FRANCISCO MAYA  ...      6:05    37:48           NaN
3        4  623         PAUL MORROW  ...      6:13    38:37           NaN
4        5  569     DEREK G OSBORNE  ...      6:20    39:24  INTEL TEAM F
..     ...  ...                 ...  ...       ...      ...           ...
572    573  273      RACHEL L VANEY  ...     15:51  1:38:34           NaN
573    574  467      ROHIT B DSOUZA  ...     15:53  1:40:32  INTEL TEAM I
574    575  471      CENITA D'SOUZA  ...     15:53  1:40:34           NaN
575    576  338      PRANAVI APPANA  ...     16:15  1:42:01           NaN
576    577  443    LIBBY B MITCHELL  ...     16:20  1:42:10           NaN

[577 rows x 10 columns]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM