I am workin on scraping text using Python from the link; tournament link
Here is my code to get the tabular data;
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr') ## find the table rows
Now, the goal is to obtain the data as a dataframe.
listnew=[]
for row in rows:
row_td = row.find_all('td')
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text() ##obtain text part
listnew.append(cleantext) ## append to list
df = pd.DataFrame(listnew)
df.head(10)
Then we get following output;
0 []
1 [Finishers:, 577]
2 [Male:, 414]
3 [Female:, 163]
4 []
5 [1, 814, \r\n\r\n JARED WIL...
6 [2, 573, \r\n\r\n NATHAN A ...
7 [3, 687, \r\n\r\n FRANCISCO...
8 [4, 623, \r\n\r\n PAUL MORR...
9 [5, 569, \r\n\r\n DEREK G O..
I don't know why there is a new line character and carriage return character; \r\n\r\n
? how can I remove them and get a dataframe in the proper format? Thanks in advance.
Seems like some cells in the HTML code has a lot of leading and trailing spaces and new lines:
<td>
JARED WILSON
</td>
Use str.strip to remove all leading and trailing whitespace, like this: BeautifulSoup(str_cells, "lxml").get_text().strip()
.
Well looking at the url you provided, you can see the new lines in the:
...
<td>814</td>
<td>
JARED WILSON
</td>
...
so that's what you get when you scrape. These can easily be removed by the very convenient .strip()
string method.
Your DataFrame is not formatted correctly because you are giving it a list of lists, which are not all of the same size (see the first 4 lines), which come from another table located on the top right. One easy fix is to remove the first 4 lines, though it would be way more robust to select the table you want based on its id ( "individualResults"
).
df = pd.DataFrame(listnew[4:])
df.head(10)
Have a look here: BeautifulSoup table to dataframe
Pandas can parse HTML tables, give this a try:
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup
url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
table_1_html = soup.find('table', attrs={'id': 'individualResults'})
t_1 = pd.read_html(table_1_html.prettify())[0]
print(t_1)
Output:
Place Bib Name ... Chip Pace Gun Time Team
0 1 814 JARED WILSON ... 5:51 36:24 NaN
1 2 573 NATHAN A SUSTERSIC ... 5:55 36:45 INTEL TEAM F
2 3 687 FRANCISCO MAYA ... 6:05 37:48 NaN
3 4 623 PAUL MORROW ... 6:13 38:37 NaN
4 5 569 DEREK G OSBORNE ... 6:20 39:24 INTEL TEAM F
.. ... ... ... ... ... ... ...
572 573 273 RACHEL L VANEY ... 15:51 1:38:34 NaN
573 574 467 ROHIT B DSOUZA ... 15:53 1:40:32 INTEL TEAM I
574 575 471 CENITA D'SOUZA ... 15:53 1:40:34 NaN
575 576 338 PRANAVI APPANA ... 16:15 1:42:01 NaN
576 577 443 LIBBY B MITCHELL ... 16:20 1:42:10 NaN
[577 rows x 10 columns]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.