[英]Advice on the best way to find intersection of two dictionaries
I have posted a similar question before, but after reworking the project, I've gotten here:我之前发布过类似的问题,但在重新设计项目后,我来到了这里:
With two csv files ( new.csv , scrapers.csv ) -有两个CSV文件(new.csv,scrapers.csv) -
new.csv contains a single column: new.csv包含一列:
'urls' = whole URLs 'urls' = 整个 URL
scrapers.csv contains two columns: scrapers.csv包含两列:
'scraper_dom' = A simplification of specific URL domains 'scraper_dom' = 特定 URL 域的简化
'scraper_id' = An associated scraper_id that is used to import URLs to a separately managed database 'scraper_id' = 关联的 scraper_id,用于将 URL 导入单独管理的数据库
My goal here is to iterate through new.csv (parsing out fnetloc
using urlparse
) and perform a lookup on scrapers.csv to return a set of matching 'scraper_id' given a set of 'urls' (the way a VLOOKUP would work, or a JOIN in SQL), once urlparse
does it's thing to isolate the netloc within the URL (the result of fnetloc
).在这里,我的目标是通过迭代new.csv(解析出
fnetloc
使用urlparse
)和scrapers.csv执行查找返回一组匹配的“scraper_id”给定一组“网址”(一个VLOOKUP会的工作方式,或SQL 中的 JOIN),一旦urlparse
完成,它就会隔离 URL 中的 netloc( fnetloc
的结果)。
My next big issue is that urlparse
does not parse the URLs (from new.csv ) to the exact simplification found in the scrapers.csv file, so I'd be reliant on a sort of partial match until I can figure out the regular expressions to use for that part of it.我的下一个大问题是,
urlparse
不解析的网址(从new.csv)在scrapers.csv文件中找到确切的简化,所以我会在排序部分匹配的依赖,直到我可以计算出的正则表达式用于它的那部分。
I've imported pandas
because previous attempts found me creating DataFrames and performing a pd.merge
but I couldn't get that to work either...我已经导入了
pandas
因为之前的尝试发现我创建了 DataFrames 并执行了pd.merge
但我也无法pd.merge
工作......
Current code, commented out bits at the bottom are failed attempts, just thought I'd include what I've tried thus far.当前代码,在底部注释掉的位是失败的尝试,只是想我会包含迄今为止我尝试过的内容。
( ##
are just intermediate print
lines I put in to check output of the program) (
##
只是我用来检查程序输出的中间print
行)
import pandas as pd, re
from urllib.parse import urlparse
import csv
sd = {}
sid = {}
#INT = []
def fnetloc(any):
try:
p = urlparse(any)
return p.netloc
except IndexError:
return 'Error'
def dom(any):
try:
r = any.split(',')
return r[0]
except IndexError:
return 'Error'
def ids(any):
try:
e = any.split(',')
return e[0]
except IndexError:
return 'Error'
with open('scrapers.csv',encoding='utf-8',newline='') as s:
reader = enumerate(csv.reader(s))
s.readline()
for j, row in reader:
dict1 = dict({'scraper_dom':dom(row[0]), 'scraper_id':ids(row[1])})
sid[j + 1] = dict1
for di in sid.keys():
id = di
##print(sid[di]['scraper_dom'],sid[di]['scraper_id'])
with open('new.csv',encoding='UTF-8',newline='') as f:
reader = enumerate(csv.reader(f))
f.readline()
for i, row in reader:
dict2 = dict({'scraper_domain': fnetloc(row[0])})
sd[i + 1] = dict2
for d in sd.keys():
id = d
##print(sd[d]['scraper_domain'])
#def tryme( ):
#return filter(sd.has_key, sid)
#print(list(filter(sid, sd.keys())))
Sample of desired output.所需输出的示例。
You just need a procedure that can take a fnetloc and a list of scrapers and check to see if there is a scraper that matches that fnetloc:您只需要一个可以获取 fnetloc 和刮板列表的过程,并检查是否有与该 fnetloc 匹配的刮板:
def fnetloc_to_scraperid(fnetloc: str, scrapers: List[Scraper]) -> str:
try:
return next(x.scraper_id for x in scrapers if x.matches(fnetloc))
except:
return "[no scraper id found]"
I also recommend that you use some classes instead of keeping everything in csv row objects--it reduces errors in your code, in the long run, and greatly advances your sanity.我还建议您使用一些类,而不是将所有内容都保存在 csv 行对象中——从长远来看,它可以减少代码中的错误,并大大提高您的理智。
This script worked on the sample data I fed it:这个脚本处理了我提供给它的示例数据:
import csv
from urllib.parse import urlparse
from typing import List
def fnetloc(any) -> str:
try:
p = urlparse(any)
return p.netloc
except IndexError:
return 'Error'
class Scraper:
def __init__(self, scraper_dom: str, scraper_id: str):
self.scraper_dom = scraper_dom
self.scraper_id = scraper_id
def matches(self, fnetloc: str) -> bool:
return fnetloc.endswith(self.scraper_dom)
class Site:
def __init__(self, url: str):
self.url = url
self.fnetloc = fnetloc(url)
def get_scraperid(self, scrapers: List[Scraper]) -> str:
try:
return next(x.scraper_id for x in scrapers if x.matches(self.fnetloc))
except:
return "[no scraper id found]"
sites = [Site(row[0]) for row in csv.reader(open("new.csv"))]
scrapers = [Scraper(row[0], row[1]) for row in csv.reader(open("scrapers.csv"))]
for site in sites:
print(site.url, site.get_scraperid(scrapers), sep="\t")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.