简体   繁体   English

如何从字符串中提取网址数据

[英]how do I extract the Url data from my string

I have the following string which contains many Url values. 我有以下包含许多网址值的字符串。 How do I extract the Url after the DataUrl term in this string? 如何在此字符串中的DataUrl项后提取Url? So I get a list of Urls for example: americanexpress.com, vice.com, chegg.com 因此,我获得了Urls的列表,例如:americanexpress.com,Vice.com,chegg.com

{'DataUrl': 'americanexpress.com', 'Country': {'Rank': '96', 'Reach': {'PerMillion': '7350'}, 'PageViews': {'PerMillion': '600.2', 'PerUser': '3.6'}}, 'Global': {'Rank': '362'}}, {'DataUrl': 'vice.com', 'Country': {'Rank': '97', 'Reach': {'PerMillion': '15703.61'}, 'PageViews': {'PerMillion': '489.97', 'PerUser': '1.38'}}, 'Global': {'Rank': '208'}}, {'DataUrl': 'chegg.com', 'Country': {'Rank': '98', 'Reach': {'PerMillion': '6280'}, 'PageViews': {'PerMillion': '882.3', 'PerUser': '6.2'}}, 'Global': {'Rank': '402'}}, {'DataUrl': 'mlb.com', 'Country': {'Rank': '99', 'Reach': {'PerMillion': '7280'}, 'PageViews': {'PerMillion': '564.1', 'PerUser': '3.42'}}, 'Global': {'Rank': '427'}}, {'DataUrl': 'xnxx.com', 'Country': {'Rank': '100', 'Reach': {'PerMillion': '5560'}, 'PageViews': {'PerMillion': '1271', 'PerUser': '10.1'}}, 'Global': {'Rank': '95'} {'DataUrl':'americanexpress.com','Country':{'Rank':'96','Reach':{'PerMillion':'7350'},'PageViews':{'PerMillion':'600.2' ,'PerUser':'3.6'}},'Global':{'Rank':'362'}},{'DataUrl':'vice.com','Country':{'Rank':'97', 'Reach':{'PerMillion':'15703.61'},'PageViews':{'PerMillion':'489.97','PerUser':'1.38'}},'Global':{'Rank':'208'} },{'DataUrl':'chegg.com','Country':{'Rank':'98','Reach':{'PerMillion':'6280'},'PageViews':{'PerMillion':' 882.3','PerUser':'6.2'}},'Global':{'Rank':'402'}},{'DataUrl':'mlb.com','Country':{'Rank':'99 ','Reach':{'PerMillion':'7280'},'PageViews':{'PerMillion':'564.1','PerUser':'3.42'}},'Global':{'Rank':'427 '}},{'DataUrl':'xnxx.com','Country':{'Rank':'100','Reach':{'PerMillion':'5560'},'PageViews':{'PerMillion' :'1271','PerUser':'10 .1'}},'Global':{'Rank':'95'}

I have tried various FindAll expressions. 我尝试了各种FindAll表达式。

Python has a built-in package called json, which can be used to work with JSON data. Python有一个名为json的内置程序包,可用于处理JSON数据。

You can convert your python object to a json object and then get DataUrl easily. 您可以将python对象转换为json对象,然后轻松获取DataUrl。

Please refer to https://www.w3schools.com/python/python_json.asp 请参考https://www.w3schools.com/python/python_json.asp

It looks like part of JSON data so if you have complet JSON data then you could use module json to load it and search DataUrl in dictionary. 它看起来像JSON数据的一部分,因此,如果您具有完整的JSON数据,则可以使用json模块加载它并在字典中搜索DataUrl

If you have incomplet JSON data then you can use regex 如果您的JSON数据不完整,则可以使用regex

text = '''{'DataUrl': 'americanexpress.com', 'Country': {'Rank': '96', 'Reach': {'PerMillion': '7350'}, 'PageViews': {'PerMillion': '600.2', 'PerUser': '3.6'}}, 'Global': {'Rank': '362'}}, {'DataUrl': 'vice.com', 'Country': {'Rank': '97', 'Reach': {'PerMillion': '15703.61'}, 'PageViews': {'PerMillion': '489.97', 'PerUser': '1.38'}}, 'Global': {'Rank': '208'}}, {'DataUrl': 'chegg.com', 'Country': {'Rank': '98', 'Reach': {'PerMillion': '6280'}, 'PageViews': {'PerMillion': '882.3', 'PerUser': '6.2'}}, 'Global': {'Rank': '402'}}, {'DataUrl': 'mlb.com', 'Country': {'Rank': '99', 'Reach': {'PerMillion': '7280'}, 'PageViews': {'PerMillion': '564.1', 'PerUser': '3.42'}}, 'Global': {'Rank': '427'}}, {'DataUrl': 'xnxx.com', 'Country': {'Rank': '100', 'Reach': {'PerMillion': '5560'}, 'PageViews': {'PerMillion': '1271', 'PerUser': '10.1'}}, 'Global': {'Rank': '95'}'''

import re

urls = re.findall("'DataUrl': '([^']*)'", text)

print(urls)

Result 结果

['americanexpress.com', 'vice.com', 'chegg.com', 'mlb.com', 'xnxx.com']

You can also try to do it with .split("{'DataUrl': '") and split("',") 您也可以尝试使用.split("{'DataUrl': '")split("',")

text = '''{'DataUrl': 'americanexpress.com', 'Country': {'Rank': '96', 'Reach': {'PerMillion': '7350'}, 'PageViews': {'PerMillion': '600.2', 'PerUser': '3.6'}}, 'Global': {'Rank': '362'}}, {'DataUrl': 'vice.com', 'Country': {'Rank': '97', 'Reach': {'PerMillion': '15703.61'}, 'PageViews': {'PerMillion': '489.97', 'PerUser': '1.38'}}, 'Global': {'Rank': '208'}}, {'DataUrl': 'chegg.com', 'Country': {'Rank': '98', 'Reach': {'PerMillion': '6280'}, 'PageViews': {'PerMillion': '882.3', 'PerUser': '6.2'}}, 'Global': {'Rank': '402'}}, {'DataUrl': 'mlb.com', 'Country': {'Rank': '99', 'Reach': {'PerMillion': '7280'}, 'PageViews': {'PerMillion': '564.1', 'PerUser': '3.42'}}, 'Global': {'Rank': '427'}}, {'DataUrl': 'xnxx.com', 'Country': {'Rank': '100', 'Reach': {'PerMillion': '5560'}, 'PageViews': {'PerMillion': '1271', 'PerUser': '10.1'}}, 'Global': {'Rank': '95'}'''

urls = text.split("{'DataUrl': '")
urls = [item.split("',")[0] for item in urls if item]
print(urls)

Result 结果

['americanexpress.com', 'vice.com', 'chegg.com', 'mlb.com', 'xnxx.com']

if you had complete and correctly formatted JSON - with " instead of ' - then you could use module json 如果你有完整和格式正确无误JSON -用" ,而不是' -那么你可以使用模块json

Here I use complete JSON 在这里我使用完整的JSON

text = '''[{'DataUrl': 'americanexpress.com', 'Country': {'Rank': '96', 'Reach': {'PerMillion': '7350'}, 'PageViews': {'PerMillion': '600.2', 'PerUser': '3.6'}}, 'Global': {'Rank': '362'}}, {'DataUrl': 'vice.com', 'Country': {'Rank': '97', 'Reach': {'PerMillion': '15703.61'}, 'PageViews': {'PerMillion': '489.97', 'PerUser': '1.38'}}, 'Global': {'Rank': '208'}}, {'DataUrl': 'chegg.com', 'Country': {'Rank': '98', 'Reach': {'PerMillion': '6280'}, 'PageViews': {'PerMillion': '882.3', 'PerUser': '6.2'}}, 'Global': {'Rank': '402'}}, {'DataUrl': 'mlb.com', 'Country': {'Rank': '99', 'Reach': {'PerMillion': '7280'}, 'PageViews': {'PerMillion': '564.1', 'PerUser': '3.42'}}, 'Global': {'Rank': '427'}}, {'DataUrl': 'xnxx.com', 'Country': {'Rank': '100', 'Reach': {'PerMillion': '5560'}, 'PageViews': {'PerMillion': '1271', 'PerUser': '10.1'}}, 'Global': {'Rank': '95'}}]'''
text = text.replace("'", '"')

import json

data = json.loads(text)
urls = [item['DataUrl'] for item in data]

print(urls)

Result 结果

['americanexpress.com', 'vice.com', 'chegg.com', 'mlb.com', 'xnxx.com']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM