繁体   English   中英

将 html URL 解析为 Pandas 表

[英]Parsing html URL into pandas table

我有以下 URL链接
有没有一种简单的方法可以直接从 URL 在 Jupyter notebook 中创建一个 Pandas 表?
其中第一列对应单词(例如 Everywhere),第二列对应括号内的内容(例如 adv),第三列对应括号后面的内容(例如 Everywhere)?

从链接:

E () The fifth letter of the English alphabet.
E () E is the third tone of the model diatonic scale. E/ (E flat) is a tone which is intermediate between D and E.
E- () A Latin prefix meaning out, out of, from; also, without. See Ex-.
Each (a. / a. pron.) Every one of the two or more individuals composing a number of objects, considered separately from the rest. It is used either with or without a following noun; as, each of you or each one of you.
Each (a. / a. pron.) Every; -- sometimes used interchangeably with every.
Eachwhere (adv.) Everywhere.
Eadish (n.) See Eddish.

正如评论中提到的,您可以使用诸如 beautifulsoup 或 lxml 之类的库来完成工作。 有几种方法可以接近它。 这是一个,使用beautifulsoup,例如:

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

url = "http://www.mso.anu.edu.au/%7Eralph/OPTED/v003/wb1913_e.html"
req=requests.get(url)

soup = bs(req.text,'lxml')
columns = ['word','part','meaning']
entries = []
for p in soup.select('p'):
    entry = []
    prt = p.select_one('i').text if len(p.select_one('i').text)>0 else "na"
    entry.extend([p.select_one('b').text, prt, p.text.split(') ')[-1]])
    entries.append(entry)
pd.DataFrame(entries, columns=columns)

输出:

   word part    meaning
0   E   na  The fifth letter of the English alphabet.
1   E   na  is a tone which is intermediate between D and E.
2   E-  na  A Latin prefix meaning out, out of, from; also...
3   Each    a. / a. pron.   Every one of the two or more individuals compo...

等等。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM