简体   繁体   中英

extracting text from multiple urls

I'm very green when it comes to Python but I see how powerful it is. I'd like to try a few things with it but I'm pretty much teaching myself so please, feel free to explain things in their most basic terms. :/

I tried the goose extraction tool to pull some text from a URL and it work pretty well. I was pretty simple...

from goose import Goose

url = 'http://example.com'
g = Goose()
article = g.extract(url=url)

article.cleaned_text

I'd like to replicate the process so I can extract text from hundreds of URLs. Is there a way to set this up so I can enter a list of URLs, extract text, and then (my guess) I could join them together for NLP or whatever else I want to do? Thanks in advance...

Simply put all the urls in a text file like:

http://example1.com
http://example2.com
http://example3.com

Then, use this list to loop across like,

from goose import Goose

# Read list of hundreds of urls from a file
url_list = open("url_list.txt", "r").read().split("\n")

# loop for each url
for url in url_list:
    g = Goose()
    article = g.extract(url=url)

    # process/store ...
    article.cleaned_text

Later, as you have the text required for analysis, go ahead with storing and then processing in a separate code blocks.

Yes, You can either iterate on a "list" (which is a python object) of urls, or get those urls from a file:

Get Urls from a list:

from goose import Goose
list_of_urls = ['url1','url2','url1000'] #etc
g = Goose()
for url in list_of_urls:
     article = g.extract(url=url)
     article.cleaned_text
     #do more stuff

Read urls from file:

with open(url_filename_here) as url_file:
lines = url_file.readlines()
#each line should contain a different url
for line in lines:
    article = g.extract(url=line)
    #do_more_stuff

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM