[英]Create new variables/class instances inside for-loop? Python web scraping
I am currently working on a web scraper that will take urls as inputs, find the page, scrape it, then return results in a CSV.我目前正在研究 web 刮板,它将 url 作为输入,找到页面,刮掉它,然后在 CSV 中返回结果。 The scraper works well for single URL's at a time.
刮板一次适用于单个 URL。 But unfortunately whenever it writes a new line to the scrape results CSV it also appends the previous url's scrape results in each column.
但不幸的是,每当它向抓取结果 CSV 写入新行时,它也会在每一列中附加上一个 url 的抓取结果。 I need a loop that will essentially create new class variables inside the loop so that this doesn't happen.
我需要一个循环,它基本上会在循环内创建新的 class 变量,这样就不会发生这种情况。 Something like that does this: Takes list of urls, then also creates unique class instance.
类似的事情是这样的:获取 url 列表,然后还创建唯一的 class 实例。
links = ['www.SomeLink1.com','www.Somelink2.com','www.SomeLink3.com']
person1 = Person('www.SomeLink1.com', driver = driver, close_on_complete = False)
person2 = Person('www.Somelink2.com', driver = driver, close_on_complete = False)
person3 = Person('www.SomeLink3.com', driver = driver, close_on_complete = False)
I do not have access to the source code to create a new method "person1.reset()" or something.我无权访问源代码来创建新方法“person1.reset()”或其他东西。
Here is also the original code I was using to scrape multiple pages:这也是我用来抓取多个页面的原始代码:
# Import libraries
from linkedin_scraper import Person, actions
from selenium import webdriver
import csv
import os
import pandas as pd
import numpy as np
import smtplib
# Read-in list of contacts:
contacts = pd.read_csv("/Users/Desktop/ContactList.csv")
names = contacts['contact_name'].tolist()
urls = contacts['contact_url'].tolist()
# turn contacts list into dictionary just in case
contact_dict = {names[i]: urls[i] for i in range(len(names))}
print(contact_dict)
# automatically login to LinkedIn
driver = webdriver.Chrome('/Users/Downloads/chromedriver')
email = os.environ.get('LINKEDIN_USER')
password = os.environ.get('LINKEDIN_PASS')
actions.login(driver, email, password)
# create general field names
fields = ['name', 'about', 'job_title', 'location','company',
'education','accomplishments','linkedin_url']
with open('ScrapeResults.csv', 'w') as f:
# using csv.writer method from CSV package
write = csv.writer(f)
write.writerow(fields)
f.close()
# Loop-through urls to scrape multiple pages at once
for individual,link in contact_dict.items():
## assign ##
the_name = individual
the_link = link
# scrape peoples url:
person = Person(the_link, driver=driver, close_on_complete=False)
# rows to be written... only index for lists?
rows = [[person.name, person.about, person.job_title, person.location, person.company,
person.educations, person.accomplishments, person.linkedin_url]]
# write
with open('ScrapeResults.csv', 'a') as f:
# using csv.writer method from CSV package
write = csv.writer(f)
write.writerows(rows)
f.close()
Could you try instantiating a new driver
each time?您可以尝试每次都实例化一个新
driver
吗? That should reset counters in driver
for you.那应该为您重置
driver
中的计数器。
for individual,link in contact_dict.items():
the_name = individual
the_link = link
driver = Driver() # I don't know how to instantiate this
person = Person(the_link, driver=driver, close_on_complete=False).
Without access to driver documentation, I cannot speak to how to properly instantiate it.如果无法访问驱动程序文档,我无法谈论如何正确实例化它。 As well, it might even have a helper to
clear()
or reset()
internal variables which would be preferable to recreating the driver
from scratch.同样,它甚至可能有一个帮助器来
clear()
或reset()
内部变量,这比从头开始重新创建driver
更好。 In any case, the scraper should have straightforward documentation for this.在任何情况下,刮板都应该为此提供简单的文档。
Got in touch with creator of "linkedin_scraper" library.与“linkedin_scraper”库的创建者取得联系。 He fixed a bug that cached previous linkedin profile values/accumulated them when scraping multiple at once.
他修复了一个错误,该错误会在一次抓取多个时缓存以前的linkedin配置文件值/累积它们。
Issue resolved in version 2.7.5.问题已在 2.7.5 版中解决。
Please see: https://github.com/joeyism/linkedin_scraper/issues/84请参阅: https://github.com/joeyism/linkedin_scraper/issues/84
Thanks all!谢谢大家!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.