简体   繁体   中英

Parsing html URL into pandas table

I have the following URL link .
Is there a simple way to create a pandas table in Jupyter notebook directly from the URL?
Where the first column correspond to the word (eg Eachwhere), the second column correspond the what's inside the parentheses (eg adv), and the third column correspond to what's following the parentheses (eg Everywhere)?

From the link:

E () The fifth letter of the English alphabet.
E () E is the third tone of the model diatonic scale. E/ (E flat) is a tone which is intermediate between D and E.
E- () A Latin prefix meaning out, out of, from; also, without. See Ex-.
Each (a. / a. pron.) Every one of the two or more individuals composing a number of objects, considered separately from the rest. It is used either with or without a following noun; as, each of you or each one of you.
Each (a. / a. pron.) Every; -- sometimes used interchangeably with every.
Eachwhere (adv.) Everywhere.
Eadish (n.) See Eddish.

As mentioned in the comments, you can use a library like beautifulsoup or lxml to get the job done. There are several way to approach it. Here's one, using beautifulsoup, for example:

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

url = "http://www.mso.anu.edu.au/%7Eralph/OPTED/v003/wb1913_e.html"
req=requests.get(url)

soup = bs(req.text,'lxml')
columns = ['word','part','meaning']
entries = []
for p in soup.select('p'):
    entry = []
    prt = p.select_one('i').text if len(p.select_one('i').text)>0 else "na"
    entry.extend([p.select_one('b').text, prt, p.text.split(') ')[-1]])
    entries.append(entry)
pd.DataFrame(entries, columns=columns)

Output:

   word part    meaning
0   E   na  The fifth letter of the English alphabet.
1   E   na  is a tone which is intermediate between D and E.
2   E-  na  A Latin prefix meaning out, out of, from; also...
3   Each    a. / a. pron.   Every one of the two or more individuals compo...

etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM