简体繁体中英

Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text?

原文 2011-01-03 23:20:38 4 1 python/ html-content-extraction/ text-extraction

Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text?

I'd like to figure out a way of extracting links that are in the body of text.

1.) I use readability in python https://github.com/gfxmonk/python-readability

2.) I'd like to somehow compare the extracted text to the original html text in order to extract links in the actual body of an article.

1 answers

Well, it looks like it returns a BeautifulSoup tree. So you should be able to do something like:

article = page.summary()   # Extract article using readability
article.findAll("a")       # Return a list of all links in the article

Is there a way to use readability and python to extract just text, not HTML?

How to use own algorithm to extract features in scikit-learn ( text feature extraction)

Python - extract a text from string after the initial extraction of the number

Extract Links from HTML In Line with Text with Python/BeautifulSoup

Open, save and extract text PDFs from links in python dataframe

Lexrank Text Summarization Algorithm source code (python)

Python: Algorithm to renumbering footnotes in a page (file) of text

Program error in text prediction algorithm in Python 2.7

python: algorithm to filter almost same text

Data extraction from hierachial text using Python

暂无

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Is there a way to use readability and python to extract just text, not HTML? How to use own algorithm to extract features in scikit-learn ( text feature extraction) Python - extract a text from string after the initial extraction of the number Extract Links from HTML In Line with Text with Python/BeautifulSoup Open, save and extract text PDFs from links in python dataframe Lexrank Text Summarization Algorithm source code (python) Python: Algorithm to renumbering footnotes in a page (file) of text Program error in text prediction algorithm in Python 2.7 python: algorithm to filter almost same text Data extraction from hierachial text using Python

Related Tags

粤ICP备18138465号 © 2020-2024 STACKOOM.COM