简体   繁体   English

在 for 循环中创建新的变量/类实例? Python web刮

[英]Create new variables/class instances inside for-loop? Python web scraping

I am currently working on a web scraper that will take urls as inputs, find the page, scrape it, then return results in a CSV.我目前正在研究 web 刮板,它将 url 作为输入,找到页面,刮掉它,然后在 CSV 中返回结果。 The scraper works well for single URL's at a time.刮板一次适用于单个 URL。 But unfortunately whenever it writes a new line to the scrape results CSV it also appends the previous url's scrape results in each column.但不幸的是,每当它向抓取结果 CSV 写入新行时,它也会在每一列中附加上一个 url 的抓取结果。 I need a loop that will essentially create new class variables inside the loop so that this doesn't happen.我需要一个循环,它基本上会在循环内创建新的 class 变量,这样就不会发生这种情况。 Something like that does this: Takes list of urls, then also creates unique class instance.类似的事情是这样的:获取 url 列表,然后还创建唯一的 class 实例。

links = ['www.SomeLink1.com','www.Somelink2.com','www.SomeLink3.com']


person1 = Person('www.SomeLink1.com', driver = driver, close_on_complete = False)
person2 = Person('www.Somelink2.com', driver = driver, close_on_complete = False)
person3 = Person('www.SomeLink3.com', driver = driver, close_on_complete = False) 

I do not have access to the source code to create a new method "person1.reset()" or something.我无权访问源代码来创建新方法“person1.reset()”或其他东西。

Here is also the original code I was using to scrape multiple pages:这也是我用来抓取多个页面的原始代码:

# Import libraries
from linkedin_scraper import Person, actions
from selenium import webdriver
import csv
import os
import pandas as pd
import numpy as np
import smtplib

# Read-in list of contacts:
contacts = pd.read_csv("/Users/Desktop/ContactList.csv")
names = contacts['contact_name'].tolist()
urls = contacts['contact_url'].tolist()
# turn contacts list into dictionary just in case
contact_dict = {names[i]: urls[i] for i in range(len(names))}
print(contact_dict)

# automatically login to LinkedIn
driver = webdriver.Chrome('/Users/Downloads/chromedriver')
email = os.environ.get('LINKEDIN_USER')
password = os.environ.get('LINKEDIN_PASS')
actions.login(driver, email, password)

# create general field names
fields = ['name', 'about', 'job_title', 'location','company',
          'education','accomplishments','linkedin_url']

with open('ScrapeResults.csv', 'w') as f:
    # using csv.writer method from CSV package
    write = csv.writer(f)
    write.writerow(fields)
f.close()

# Loop-through urls to scrape multiple pages at once
for individual,link in contact_dict.items():

    ## assign ##
    the_name = individual
    the_link = link
    # scrape peoples url:
    person = Person(the_link, driver=driver, close_on_complete=False)

    # rows to be written... only index for lists?
    rows = [[person.name, person.about, person.job_title, person.location, person.company,
             person.educations, person.accomplishments, person.linkedin_url]]
    # write
    with open('ScrapeResults.csv', 'a') as f:
    # using csv.writer method from CSV package
        write = csv.writer(f)
        write.writerows(rows)
        f.close()

Could you try instantiating a new driver each time?您可以尝试每次都实例化一个新driver吗? That should reset counters in driver for you.那应该为您重置driver中的计数器。

for individual,link in contact_dict.items():
    the_name = individual
    the_link = link
    driver = Driver() # I don't know how to instantiate this
    person = Person(the_link, driver=driver, close_on_complete=False).

Without access to driver documentation, I cannot speak to how to properly instantiate it.如果无法访问驱动程序文档,我无法谈论如何正确实例化它。 As well, it might even have a helper to clear() or reset() internal variables which would be preferable to recreating the driver from scratch.同样,它甚至可能有一个帮助器来clear()reset()内部变量,这比从头开始重新创建driver更好。 In any case, the scraper should have straightforward documentation for this.在任何情况下,刮板都应该为此提供简单的文档。

Got in touch with creator of "linkedin_scraper" library.与“linkedin_scraper”库的创建者取得联系。 He fixed a bug that cached previous linkedin profile values/accumulated them when scraping multiple at once.他修复了一个错误,该错误会在一次抓取多个时缓存以前的linkedin配置文件值/累积它们。

Issue resolved in version 2.7.5.问题已在 2.7.5 版中解决。

Please see: https://github.com/joeyism/linkedin_scraper/issues/84请参阅: https://github.com/joeyism/linkedin_scraper/issues/84

Thanks all!谢谢大家!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM