Python-使用pandas和枚举进行网络爬网

Question

So I recently found this code segment online, which is in python and uses the enumerate expression in pandas. 因此，我最近在网上找到了此代码段，该代码段位于python中，并在pandas中使用了枚举表达式。

import pandas as pd
url = 'http://myurl.com/mypage/'

for i, df in enumerate(pd.read_html(url)):
    df.to_csv('myfile_%s.csv' % i)

Is there a way to rewrite this so it can go through a list of webpages rather than a single url and put all information from each page's tables into a single .csv file? 有没有办法重写它，以便它可以浏览网页列表而不是单个URL，并将每个页面的表中的所有信息放入单个.csv文件中？ My main guess is somelike like a for loop. 我的主要猜测就像是for循环。

url_base = 'http://myurl.com/mypage/'
count = 1
for i in range(1,5):

    url = '%s%s' %(url_base,count)

    for i, df in enumerate(pd.read_html(url)):
        df.to_csv('myfile_%s.csv' % i)
    count = count + 1

Answer 1

If all your csvs have the same columns you can do this 如果所有csv都具有相同的列，则可以执行此操作

pd.concat([pd.read_html(url) for url in urls], ignore_index=True)

If your urls have the same base like the ones in your example, you would do 如果您的网址与示例中的网址相同，则可以

url_base = 'http://myurl.com/mypage/{}'
df = pd.concat([pd.read_html(base.format(i)) for i in range(num)], ignore_index=True)
df.to_csv('alldata.csv')

Answer 2

How about this? 这个怎么样？

import pandas as pd
from concurrent import futures

urls = [your list of urls]

def read_html(url):
    return pd.read_html(url)

with futures.ThreadPoolExecutor(max_workers=6) as executor:
    fetched_urls = dict((executor.submit(read_html, url), url)
                         for url in urls)

nums = range(1, len(fetched_url)+1)

for future, num in zip(futures.as_completed(fetched_urls), nums):
    if future.result():
        future.result().to_csv('myfile_{}.csv'.format(num), index=False)
    elif future.exception():
        print '{} yielded no results'.format(fetched_urls[future])
    else:
        pass

Python-使用pandas和枚举进行网络爬网

问题描述

2 个解决方案

解决方案1
0 2017-09-20 00:16:53

解决方案2
0 2017-09-20 01:19:37

Python-使用pandas和枚举进行网络爬网

问题描述

2 个解决方案

解决方案1 0 2017-09-20 00:16:53

解决方案2 0 2017-09-20 01:19:37

解决方案1
0 2017-09-20 00:16:53

解决方案2
0 2017-09-20 01:19:37