简体繁体中英

Extract anything that looks like links from large amount of data in python

原文 2010-04-18 14:46:36 8 1 python/ html/ screen-scraping/ hyperlink

I have around 5 GB of html data which I want to process to find links to a set of websites and perform some additional filtering. Right now I use simple regexp for each site and iterate over them, searching for matches. In my case links can be outside of "a" tags and be not well formed in many ways(like "\\n" in the middle of link) so I try to grab as much "links" as I can and check them later in other scripts(so no BeatifulSoup\\lxml\\etc). The problem is that my script is pretty slow, so I am thinking about any ways to speed it up. I am writing a set of test to check different approaches, but hope to get some advices :)

Right now I am thinking about getting all links without filtering first(maybe using C module or standalone app, which doesn't use regexp but simple search to get start and end of every link) and then using regexp to match ones I need.

1 answers

Ways out.

Parallelise
Profile your code to see where the bottleneck is. The result are often surprising.
Use a single regexp (concatenate using |) rather than multiple ones.

Python Processing large amount of data from excel

How to extract data from a website like justdial using beautiful soup python which has a very large list?

Python DataFrame Data Analysis of Large Amount of Data from a Text File

Processing large amount of data in Python

Extract data from table with different amount of spaces in between in Python

How to deal with large amount of data in Python

TimeoutError: Large amount of data in requests python

Google Appengine large amount of index data (python)

Python dataframes Cartesian operation on large amount of data

File handling in python for large amount of data

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Python Processing large amount of data from excel How to extract data from a website like justdial using beautiful soup python which has a very large list? Python DataFrame Data Analysis of Large Amount of Data from a Text File Processing large amount of data in Python Extract data from table with different amount of spaces in between in Python How to deal with large amount of data in Python TimeoutError: Large amount of data in requests python Google Appengine large amount of index data (python) Python dataframes Cartesian operation on large amount of data File handling in python for large amount of data

Related Tags

Extract anything that looks like links from large amount of data in python

Question

1 answers

solution1 1 ACCPTED 2010-04-18 18:00:55

solution1
1 ACCPTED 2010-04-18 18:00:55