简体   繁体   English

将 html URL 解析为 Pandas 表

[英]Parsing html URL into pandas table

I have the following URL link .我有以下 URL链接
Is there a simple way to create a pandas table in Jupyter notebook directly from the URL?有没有一种简单的方法可以直接从 URL 在 Jupyter notebook 中创建一个 Pandas 表?
Where the first column correspond to the word (eg Eachwhere), the second column correspond the what's inside the parentheses (eg adv), and the third column correspond to what's following the parentheses (eg Everywhere)?其中第一列对应单词(例如 Everywhere),第二列对应括号内的内容(例如 adv),第三列对应括号后面的内容(例如 Everywhere)?

From the link:从链接:

E () The fifth letter of the English alphabet.
E () E is the third tone of the model diatonic scale. E/ (E flat) is a tone which is intermediate between D and E.
E- () A Latin prefix meaning out, out of, from; also, without. See Ex-.
Each (a. / a. pron.) Every one of the two or more individuals composing a number of objects, considered separately from the rest. It is used either with or without a following noun; as, each of you or each one of you.
Each (a. / a. pron.) Every; -- sometimes used interchangeably with every.
Eachwhere (adv.) Everywhere.
Eadish (n.) See Eddish.

As mentioned in the comments, you can use a library like beautifulsoup or lxml to get the job done.正如评论中提到的,您可以使用诸如 beautifulsoup 或 lxml 之类的库来完成工作。 There are several way to approach it.有几种方法可以接近它。 Here's one, using beautifulsoup, for example:这是一个,使用beautifulsoup,例如:

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

url = "http://www.mso.anu.edu.au/%7Eralph/OPTED/v003/wb1913_e.html"
req=requests.get(url)

soup = bs(req.text,'lxml')
columns = ['word','part','meaning']
entries = []
for p in soup.select('p'):
    entry = []
    prt = p.select_one('i').text if len(p.select_one('i').text)>0 else "na"
    entry.extend([p.select_one('b').text, prt, p.text.split(') ')[-1]])
    entries.append(entry)
pd.DataFrame(entries, columns=columns)

Output:输出:

   word part    meaning
0   E   na  The fifth letter of the English alphabet.
1   E   na  is a tone which is intermediate between D and E.
2   E-  na  A Latin prefix meaning out, out of, from; also...
3   Each    a. / a. pron.   Every one of the two or more individuals compo...

etc.等等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM