简体   繁体   English

关于找到两个字典交集的最佳方法的建议

[英]Advice on the best way to find intersection of two dictionaries

I have posted a similar question before, but after reworking the project, I've gotten here:我之前发布过类似的问题,但在重新设计项目后,我来到了这里:

With two csv files ( new.csv , scrapers.csv ) -有两个CSV文件(new.csv,scrapers.csv) -

new.csv contains a single column: new.csv包含一列:
'urls' = whole URLs 'urls' = 整个 URL 样本输入

scrapers.csv contains two columns: scrapers.csv包含两列:
'scraper_dom' = A simplification of specific URL domains 'scraper_dom' = 特定 URL 域的简化
'scraper_id' = An associated scraper_id that is used to import URLs to a separately managed database 'scraper_id' = 关联的 scraper_id,用于将 URL 导入单独管理的数据库

刮刀.csv

Question

My goal here is to iterate through new.csv (parsing out fnetloc using urlparse ) and perform a lookup on scrapers.csv to return a set of matching 'scraper_id' given a set of 'urls' (the way a VLOOKUP would work, or a JOIN in SQL), once urlparse does it's thing to isolate the netloc within the URL (the result of fnetloc ).在这里,我的目标是通过迭代new.csv(解析出fnetloc使用urlparse )和scrapers.csv执行查找返回一组匹配的“scraper_id”给定一组“网址”(一个VLOOKUP会的工作方式,或SQL 中的 JOIN),一旦urlparse完成,它就会隔离 URL 中的 netloc( fnetloc的结果)。

My next big issue is that urlparse does not parse the URLs (from new.csv ) to the exact simplification found in the scrapers.csv file, so I'd be reliant on a sort of partial match until I can figure out the regular expressions to use for that part of it.我的下一个大问题是, urlparse不解析的网址(从new.csv)scrapers.csv文件中找到确切的简化,所以我会在排序部分匹配的依赖,直到我可以计算出的正则表达式用于它的那部分。

I've imported pandas because previous attempts found me creating DataFrames and performing a pd.merge but I couldn't get that to work either...我已经导入了pandas因为之前的尝试发现我创建了 DataFrames 并执行了pd.merge但我也无法pd.merge工作......

Current code, commented out bits at the bottom are failed attempts, just thought I'd include what I've tried thus far.当前代码,在底部注释掉的位是失败的尝试,只是想我会包含迄今为止我尝试过的内容。
( ## are just intermediate print lines I put in to check output of the program) ##只是我用来检查程序输出的中间print行)

import pandas as pd, re
from urllib.parse import urlparse
import csv
sd = {}
sid = {}
#INT = []
def fnetloc(any):
    try:
        p = urlparse(any)
        return p.netloc
    except IndexError:
        return 'Error'
def dom(any):
    try:
        r = any.split(',')
        return r[0]
    except IndexError:
        return 'Error'
def ids(any):
    try:
        e = any.split(',')
        return e[0]
    except IndexError:
        return 'Error'

with open('scrapers.csv',encoding='utf-8',newline='') as s:
    reader = enumerate(csv.reader(s))
    s.readline()
    for j, row in reader:
        dict1 = dict({'scraper_dom':dom(row[0]), 'scraper_id':ids(row[1])})
        sid[j + 1] = dict1
for di in sid.keys():
    id = di
    ##print(sid[di]['scraper_dom'],sid[di]['scraper_id'])

with open('new.csv',encoding='UTF-8',newline='') as f:
    reader = enumerate(csv.reader(f))
    f.readline()
    for i, row in reader:
        dict2 = dict({'scraper_domain': fnetloc(row[0])})
        sd[i + 1] = dict2
for d in sd.keys():
    id = d
    ##print(sd[d]['scraper_domain'])

    #def tryme(  ):
        #return filter(sd.has_key, sid)
    #print(list(filter(sid, sd.keys())))

Sample of desired output.所需输出的示例。

样本输出

You just need a procedure that can take a fnetloc and a list of scrapers and check to see if there is a scraper that matches that fnetloc:您只需要一个可以获取 fnetloc 和刮板列表的过程,并检查是否有与该 fnetloc 匹配的刮板:

def fnetloc_to_scraperid(fnetloc: str, scrapers: List[Scraper]) -> str:
    try:
        return next(x.scraper_id for x in scrapers if x.matches(fnetloc))
    except:
        return "[no scraper id found]"

I also recommend that you use some classes instead of keeping everything in csv row objects--it reduces errors in your code, in the long run, and greatly advances your sanity.我还建议您使用一些类,而不是将所有内容都保存在 csv 行对象中——从长远来看,它可以减少代码中的错误,并大大提高您的理智。

This script worked on the sample data I fed it:这个脚本处理了我提供给它的示例数据:

import csv
from urllib.parse import urlparse
from typing import List

def fnetloc(any) -> str:
    try:
        p = urlparse(any)
        return p.netloc
    except IndexError:
        return 'Error'

class Scraper:
    def __init__(self, scraper_dom: str, scraper_id: str):
        self.scraper_dom = scraper_dom
        self.scraper_id = scraper_id
    def matches(self, fnetloc: str) -> bool:
        return fnetloc.endswith(self.scraper_dom)


class Site:
    def __init__(self, url: str):
        self.url = url
        self.fnetloc = fnetloc(url)
    def get_scraperid(self, scrapers: List[Scraper]) -> str:
        try:
            return next(x.scraper_id for x in scrapers if x.matches(self.fnetloc))
        except:
            return "[no scraper id found]"

sites = [Site(row[0]) for row in csv.reader(open("new.csv"))]
scrapers = [Scraper(row[0], row[1]) for row in csv.reader(open("scrapers.csv"))]

for site in sites:
    print(site.url, site.get_scraperid(scrapers), sep="\t")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM