简体   繁体   中英

Filter dataframe with dictionary values while assigning dictionary keys to matching rows?

I have a dataframe with a column 'Links' that contains the URLs to a few thousand online articles. There is one URL for each observation.

urls_list = ['http://www.ajc.com/news/world/atlan...',
             'http://www.seattletimes.com/sports/...',
             'https://www.cjr.org/q_and_a/washing...',
             'https://www.washingtonpost.com/grap...',
             'https://www.nytimes.com/2017/09/01/...',
             'http://www.oregonlive.com/silicon-f...']

df = pd.DataFrame(urls_list,columns=['Links'])

I additionally have a dictionary that contains publication names as keys and domain names as values.

urls_dict = dict({'Atlanta Journal-Constitution':'ajc.com',
                  'The Washington Post':'washingtonpost.com',
                  'The New York Times':'nytimes.com'})

I'd like to filter the dataframe to get only those observations where the 'Links' column contains the domains in the dictionary values, while at the same time assigning the publication name in the dictionary keys to a new column 'Publication.' What I envisioned is using the below code to create the 'Publication' column then dropping None 's from that column to filter the dataframe after the fact.

pub_list = []

for row in df['Links']:
    for k,v in urls_dict.items():
        if row.find(v) > -1:
            publication = k
        else:
            publication = None
        pub_list.append(publication)

However, the list pub_list that I get in return - while appearing to do what I intended - is three times as long as my dataframe. Can someone suggest how to fix the above code? Or, alternatively, suggest a cleaner solution that can both (1) filter the 'Links' column of my dataframe using the dictionary values (domain names) while (2) creating a new 'Publication' column of the dictionary keys (publication names)? (Please note that the df is created here with only one column for brevity; the actual file will have many columns and thereby I have to be able to specify which column to filter on.)

EDIT: I wanted to give some clarification's given RagingRoosevelt's answer. I'd like to avoid using merging as some of the domains may not be exact matches. For example, with ajc.com I'd also like to be able to capture myajc.com , and with washingtonpost.com I'd want to get sub-domains like live.washingtonpost.com as well. Hence, I was hoping for a type of "find substring in string" solution with str.contains() , find() , or the in operator.

Here's what I'd do:

  1. Use DataFrame.apply to add a new column to your dataframe that contains just the domain.

  2. Use DataFrame.merge (with the how='inner' option) to merge your two data frames on your domain field.

It's a bit dirty to use loops to do stuff to dataframes if they're just iterating over columns or rows and generally there's a DataFrame method that does the same thing more cleanly.

If you want, I can expand this with examples.

edit Here's what that would look like. Note that I'm using rather terrible regex for domain capture.

def domain_extract(row):
    s = row['Links']
    p = r'(?:(?:\w+)?(?::\/\/)(?:www\.)?)?([A-z0-9.]+)\/.*'
    m = re.match(p, s)
    if m is not None:
        return m.group(1)
    else:
        return None

df['Domain'] = df.apply(domain_extract, axis=1)

dfo = pd.DataFrame({'Name': ['Atlanta Journal-Constitution', 'The Washington Post', 'The New York Times'], 'Domain': ['ajc.com', 'washingtonpost.com', 'nytimes.com']})

df.merge(dfo, on='Domain', how='inner')[['Links', 'Domain', 'Name']]

I was able to figure it out using a nested dictionary comprehension (and alternatively, using a nested list comprehension) with some additional dataframe manipulation to clean up the columns and drop blank rows.

Using a nested dictionary comprehension (or, more specifically, a dictionary comprehension nested inside of a list comprehension):

df['Publication'] = [{k: k for k,v in urls_dict.items() if v in row} for row in df['Links']]

# Format the 'Publication' column to get rid of duplicate 'key' values
df['Publication'] = df['Publication'].astype(str).str.strip('{}').str.split(':',expand=True)[0]

# Remove blank rows from 'Publication' column
df = df[df['Publication'] != '']

Similarly, using a nested list comprehension :

# First converting dict to a list of lists 
urls_list_of_lists = list(map(list,urls_dict.items()))

# Nested list comprehension using 'in' operator 
df['Publication'] = [[item[0] for item in urls_list_of_lists if item[1] in row] for row in df['Links']]

# Format the 'Publication' column to get rid of duplicate brackets
df['Publication'] = df['Publication'].astype(str).str.strip('[]')

# Remove blank rows from 'Publication' column
df = df[df['Publication'] != '']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM