[英]Obtaining \r\n\r\n while scraping from web in Python
I am workin on scraping text using Python from the link;我正在使用链接中的 Python 抓取文本; tournament link比赛链接
Here is my code to get the tabular data;这是我获取表格数据的代码;
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr') ## find the table rows
Now, the goal is to obtain the data as a dataframe.现在,目标是以 dataframe 的形式获取数据。
listnew=[]
for row in rows:
row_td = row.find_all('td')
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text() ##obtain text part
listnew.append(cleantext) ## append to list
df = pd.DataFrame(listnew)
df.head(10)
Then we get following output;然后我们得到以下output;
0 []
1 [Finishers:, 577]
2 [Male:, 414]
3 [Female:, 163]
4 []
5 [1, 814, \r\n\r\n JARED WIL...
6 [2, 573, \r\n\r\n NATHAN A ...
7 [3, 687, \r\n\r\n FRANCISCO...
8 [4, 623, \r\n\r\n PAUL MORR...
9 [5, 569, \r\n\r\n DEREK G O..
I don't know why there is a new line character and carriage return character;不知道为什么会有换行符和回车符; \r\n\r\n
? \r\n\r\n
? how can I remove them and get a dataframe in the proper format?如何删除它们并获得正确格式的 dataframe? Thanks in advance.提前致谢。
Seems like some cells in the HTML code has a lot of leading and trailing spaces and new lines:似乎 HTML 代码中的某些单元格有很多前导和尾随空格以及新行:
<td>
JARED WILSON
</td>
Use str.strip to remove all leading and trailing whitespace, like this: BeautifulSoup(str_cells, "lxml").get_text().strip()
.使用str.strip删除所有前导和尾随空格,如下所示: BeautifulSoup(str_cells, "lxml").get_text().strip()
。
Well looking at the url you provided, you can see the new lines in the:看看您提供的 url,您可以在以下内容中看到新行:
...
<td>814</td>
<td>
JARED WILSON
</td>
...
so that's what you get when you scrape.所以这就是你刮的时候得到的。 These can easily be removed by the very convenient .strip()
string method.这些可以通过非常方便的.strip()
字符串方法轻松删除。
Your DataFrame is not formatted correctly because you are giving it a list of lists, which are not all of the same size (see the first 4 lines), which come from another table located on the top right.您的 DataFrame 格式不正确,因为您给它一个列表列表,这些列表的大小并不相同(请参见前 4 行),这些列表来自右上角的另一个表。 One easy fix is to remove the first 4 lines, though it would be way more robust to select the table you want based on its id ( "individualResults"
).一个简单的解决方法是删除前 4 行,尽管它对 select 基于其 id( "individualResults"
)的表更健壮。
df = pd.DataFrame(listnew[4:])
df.head(10)
Have a look here: BeautifulSoup table to dataframe看看这里: BeautifulSoup 表到 dataframe
Pandas can parse HTML tables, give this a try: Pandas 可以解析 HTML 表,试试这个:
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup
url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
table_1_html = soup.find('table', attrs={'id': 'individualResults'})
t_1 = pd.read_html(table_1_html.prettify())[0]
print(t_1)
Output: Output:
Place Bib Name ... Chip Pace Gun Time Team
0 1 814 JARED WILSON ... 5:51 36:24 NaN
1 2 573 NATHAN A SUSTERSIC ... 5:55 36:45 INTEL TEAM F
2 3 687 FRANCISCO MAYA ... 6:05 37:48 NaN
3 4 623 PAUL MORROW ... 6:13 38:37 NaN
4 5 569 DEREK G OSBORNE ... 6:20 39:24 INTEL TEAM F
.. ... ... ... ... ... ... ...
572 573 273 RACHEL L VANEY ... 15:51 1:38:34 NaN
573 574 467 ROHIT B DSOUZA ... 15:53 1:40:32 INTEL TEAM I
574 575 471 CENITA D'SOUZA ... 15:53 1:40:34 NaN
575 576 338 PRANAVI APPANA ... 16:15 1:42:01 NaN
576 577 443 LIBBY B MITCHELL ... 16:20 1:42:10 NaN
[577 rows x 10 columns]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.