简体   繁体   English

从 Python 中的 web 抓取时获取 \r\n\r\n

[英]Obtaining \r\n\r\n while scraping from web in Python

I am workin on scraping text using Python from the link;我正在使用链接中的 Python 抓取文本; tournament link比赛链接

Here is my code to get the tabular data;这是我获取表格数据的代码;

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr') ## find the table rows

Now, the goal is to obtain the data as a dataframe.现在,目标是以 dataframe 的形式获取数据。

listnew=[]
for row in rows:
    row_td = row.find_all('td')
    str_cells = str(row_td)
    cleantext = BeautifulSoup(str_cells, "lxml").get_text() ##obtain text part
    listnew.append(cleantext) ## append to list

df = pd.DataFrame(listnew)
df.head(10)

Then we get following output;然后我们得到以下output;

0   []
1   [Finishers:, 577]
2   [Male:, 414]
3   [Female:, 163]
4   []
5   [1, 814, \r\n\r\n JARED WIL...
6   [2, 573, \r\n\r\n NATHAN A ...
7   [3, 687, \r\n\r\n FRANCISCO...
8   [4, 623, \r\n\r\n PAUL MORR...
9   [5, 569, \r\n\r\n DEREK G O..

I don't know why there is a new line character and carriage return character;不知道为什么会有换行符和回车符; \r\n\r\n ? \r\n\r\n ? how can I remove them and get a dataframe in the proper format?如何删除它们并获得正确格式的 dataframe? Thanks in advance.提前致谢。

Seems like some cells in the HTML code has a lot of leading and trailing spaces and new lines:似乎 HTML 代码中的某些单元格有很多前导和尾随空格以及新行:

<td>

                    JARED WILSON

                </td>

Use str.strip to remove all leading and trailing whitespace, like this: BeautifulSoup(str_cells, "lxml").get_text().strip() .使用str.strip删除所有前导和尾随空格,如下所示: BeautifulSoup(str_cells, "lxml").get_text().strip()

Well looking at the url you provided, you can see the new lines in the:看看您提供的 url,您可以在以下内容中看到新行:

...
<td>814</td>
<td>
JARED WILSON
</td>
...

so that's what you get when you scrape.所以这就是你刮的时候得到的。 These can easily be removed by the very convenient .strip() string method.这些可以通过非常方便的.strip()字符串方法轻松删除。

Your DataFrame is not formatted correctly because you are giving it a list of lists, which are not all of the same size (see the first 4 lines), which come from another table located on the top right.您的 DataFrame 格式不正确,因为您给它一个列表列表,这些列表的大小并不相同(请参见前 4 行),这些列表来自右上角的另一个表。 One easy fix is to remove the first 4 lines, though it would be way more robust to select the table you want based on its id ( "individualResults" ).一个简单的解决方法是删除前 4 行,尽管它对 select 基于其 id( "individualResults" )的表更健壮。

df = pd.DataFrame(listnew[4:])
df.head(10)

Have a look here: BeautifulSoup table to dataframe看看这里: BeautifulSoup 表到 dataframe

Pandas can parse HTML tables, give this a try: Pandas 可以解析 HTML 表,试试这个:

from urllib.request import urlopen

import pandas as pd
from bs4 import BeautifulSoup

url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')

table_1_html = soup.find('table', attrs={'id': 'individualResults'})

t_1 = pd.read_html(table_1_html.prettify())[0]

print(t_1)

Output: Output:

     Place  Bib                Name  ... Chip Pace Gun Time          Team
0        1  814        JARED WILSON  ...      5:51    36:24           NaN
1        2  573  NATHAN A SUSTERSIC  ...      5:55    36:45  INTEL TEAM F
2        3  687      FRANCISCO MAYA  ...      6:05    37:48           NaN
3        4  623         PAUL MORROW  ...      6:13    38:37           NaN
4        5  569     DEREK G OSBORNE  ...      6:20    39:24  INTEL TEAM F
..     ...  ...                 ...  ...       ...      ...           ...
572    573  273      RACHEL L VANEY  ...     15:51  1:38:34           NaN
573    574  467      ROHIT B DSOUZA  ...     15:53  1:40:32  INTEL TEAM I
574    575  471      CENITA D'SOUZA  ...     15:53  1:40:34           NaN
575    576  338      PRANAVI APPANA  ...     16:15  1:42:01           NaN
576    577  443    LIBBY B MITCHELL  ...     16:20  1:42:10           NaN

[577 rows x 10 columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM