簡體   English   中英

如何使用python從文本文件中獲取URL?

[英]How to fetch URL's from a text file using python?

我想從我擁有的文本文件中獲取所有hostPageDisplayUrl 下面給出的幾行

{"instrumentation": {"pageLoadPingUrl": "https://www.bingapis.com/api/ping/pageload?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&Type=Event.CPT&DATA=0"}, "_type": "Images", "displayRecipeSourcesBadges": true, "value": [{"contentUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=QWSSSaNP6OdarVmpdZ2TGGupNBCF0-Ue_w2zKVqczwk&v=1&r=http%3a%2f%2fphotos.wikimapia.org%2fp%2f00%2f02%2f91%2f36%2f73_big.jpg&p=DevEx,5008.1", "accentColor": "2B3C71", "height": 375, "hostPageDisplayUrl": "wikimapia.org/1649944/Bahawalpur-Railway-Station", "name": "Bahawalpur Railway Station - Bahawalpur (\u0628\u06c1\u0627\u0648\u0644\u067e\u0648\u0631)", "width": 500, "imageId": "5464C96913992D44983D02E302F166C57BC6DA26", "imageInsightsToken": "ccid_CUojXAsn*mid_5464C96913992D44983D02E302F166C57BC6DA26*simid_608054236795568956", "datePublished": "2010-02-21T22:19:00", "encodingFormat": "jpeg", "webSearchUrl": "https://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=Fbz9jxTPMT44aF3aWlDgNwU7Zhr3qYbOco653N9vnIc&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dBahawalpurRailwayStation%26id%3d5464C96913992D44983D02E302F166C57BC6DA26%26simid%3d608054236795568956&p=DevEx,5006.1", "hostPageUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=MVElDiTqkKkcRJKEQxgr1yxRbwh-DpMNfT7lA6g1ivg&v=1&r=http%3a%2f%2fwikimapia.org%2f1649944%2fBahawalpur-Railway-Station&p=DevEx,5007.1", "thumbnailUrl": "https://tse1.mm.bing.net/th?id=OIP.CUojXAsnV5KRBVF6-RIlLwEsDh&pid=Api", "thumbnail": {"width": 300, "height": 225}, "contentSize": "38571 B"}, {"contentUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=yrOFma0zG8eUzUVY0l7jt_KfBAXPyuTyuXa9jJjeFR0&v=1&r=http%3a%2f%2fstatic.panoramio.com%2fphotos%2flarge%2f84118355.jpg&p=DevEx,5014.1", "accentColor": "A36728", "height": 768, "hostPageDisplayUrl": "panoramio.com/photo/84118355", "name": "Panoramio - Photo of Bahawalpur railway station", "width": 1024, "imageId": "FE04EA82163F27DC0A8449CF2086E4DA4F359DF7", "imageInsightsToken": "ccid_1683LeSg*mid_FE04EA82163F27DC0A8449CF2086E4DA4F359DF7*simid_608010054465029867", "datePublished": "2013-01-01T12:00:00", "encodingFormat": "jpeg", "webSearchUrl": "https://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=0NEX-sC8BaLrZ9HDkSbA_7kztZ1BoVoihkkvnL2tGiQ&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dBahawalpurRailwayStation%26id%3dFE04EA82163F27DC0A8449CF2086E4DA4F359DF7%26simid%3d608010054465029867&p=DevEx,5012.1", "hostPageUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=l9wqPINQPoe9u5N_qiFUtBQ6PrxdwEPiwObrCwBTQ2U&v=1&r=http%3a%2f%2fpanoramio.com%2fphoto%2f84118355&p=DevEx,5013.1", "thumbnailUrl": "https://tse2.mm.bing.net/th?id=OIP.1683LeSgJHoFhxX-tKhGSAEsDh&pid=Api", "thumbnail": {"width": 300, "height": 225}, "contentSize": "125011 B"}, {"contentUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=1OS0LXGeQJbC9gOsRy00e-ae0535j7iNl4qiaNTTG0I&v=1&r=http%3a%2f%2fphotos.wikimapia.org%2fp%2f00%2f05%2f21%2f47%2f89_big.jpg&p=DevEx,5020.1", "accentColor": "5B4F36", "height": 361, "hostPageDisplayUrl": "wikimapia.org/1649944/Bahawalpur-Railway-Station", "name": "Bahawalpur Railway Station - Bahawalpur (\u0628\u06c1\u0627\u0648\u0644\u067e\u0648\u0631)", "width": 500, "imageId": "5464C96913992D44983D6D8CBD36CB6E679FEA3C", "imageInsightsToken": "ccid_JhLSwAc0*mid_5464C96913992D44983D6D8CBD36CB6E679FEA3C*simid_607998234704153808", "datePublished": "2016-12-09T20:58:00", "encodingFormat": "jpeg", "webSearchUrl": "https://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=IJTtTeRFNBA0xr1DyZcz6AMb43pJFV25m3WrDfLhQls&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dBahawalpurRailwayStation%26id%3d5464C96913992D44983D6D8CBD36CB6E679FEA3C%26simid%3d607998234704153808&p=DevEx,5018.1", "hostPageUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=MVElDiTqkKkcRJKEQxgr1yxRbwh-DpMNfT7lA6g1ivg&v=1&r=http%3a%2f%2fwikimapia.org%2f1649944%2fBahawalpur-Railway-Station&p=DevEx,5019.1", "thumbnailUrl": "https://tse1.mm.bing.net/th?id=OIP.JhLSwAc0HwFeWsHjAUYStgEsDY&pid=Api", "thumbnail": {"width": 300, "height": 216}, "contentSize": "28945 B"}, {"contentUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=t6oOsr-23sNP-TFFzn39BVuagjYmXknVGiIWYD_tJv0&v=1&r=http%3a%2f%2fnativepakistan.com%2fwp-content%2fuploads%2fPhoto-of-Bahawalpur-RailwayS-tation-Photos-of-Bahawalpur.jpg&p=DevEx,5026.1", "accentColor": "49418A", "height": 347, "hostPageDisplayUrl": "nativepakistan.com/photos-of-bahawalpur", "name": "Photo of Bahawalpur Railway Station - Photos of Bahawalpur", "width": 500, "imageId": "7A05E50C94144666BFEB7BEECE6FB3DFC3313E18", "imageInsightsToken": "ccid_wS0pep46*mid_7A05E50C94144666BFEB7BEECE6FB3DFC3313E18*simid_607992170213084482", "datePublished": "2012-09-21T23:07:00", "encodingFormat": "jpeg", "webSearchUrl": "https://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=2kFu0Xn07bcJKuZI03iY3Ihq99ZiKFOvd0PXvVWqt94&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dBahawalpurRailwayStation%26id%3d7A05E50C94144666BFEB7BEECE6FB3DFC3313E18%26simid%3d607992170213084482&p=DevEx,5024.1", "hostPageUrl": "http://www.bing.com/cr?IG=D4C9291ACBE848AE9C3C4C6EAED1E6AC&CID=0FD7B016F5926C780285BA44F4806D9E&rd=1&h=ht8SkbUIRgMkFq4yXvbHpmsINok4VTcxu0FiwMayk9A&v=1&r=http%3a%2f%2fnativepakistan.com%2fphotos-of-bahawalpur%2f&p=DevEx,5025.1", "thumbnailUrl": "https://tse3.mm.bing.net/th?id=OIP.wS0pep46eEsGSSY39RNxLQEsDQ&pid=Api", "thumbnail": {"width": 300, "height": 20

我正在使用此代碼但未獲得准確的結果

start = 0
while True:                                                       
  p = data[start:].find('hostPageDisplayUrl')                         
  if p == -1: buffer                                            
  q = data[start+p+12:].find('hostPageDisplayUrl')                           
  r = data[start+p+q+12:].find('.')                             
  print (data[start+p+q+12:start+p+q+r+12] , file = log)        
  start = start+p+q+r+12

如上所述,您的數據似乎是一個JSON文件,但它並沒有完全填充JSON格式。 檢查它確實是一個有效的JSON后在這里 ,你可以這樣做:

import json

def _finditem(obj, key):  # http://stackoverflow.com/a/14962509/2585092
    if key in obj: return obj[key]
    for k, v in obj.items():
        if isinstance(v,dict):
            item = _finditem(v, key)
            if item is not None:
                return item

def get_url(file_name):
    try:
        with open(file_name) as file:
            data = json.load(file)
    except FileNotFoundError:
        return None

    return _finditem(data, 'hostPageDisplayUrl')

或者使用正則表達式:

def find_urls(text):
    import re

    pattern = r'\"hostPageDisplayUrl\":\s*"([^"]*)"'
    return re.findall(pattern, text)

print(find_urls(test))

您的示例的結果:
['wikimapia.org/1649944/Bahawalpur-Railway-Station', 'panoramio.com/photo/84118355', 'wikimapia.org/1649944/Bahawalpur-Railway-Station', 'nativepakistan.com/photos-of-bahawalpur']

警告 :這僅適用於您的網址不包含(轉義)雙引號"


編輯 :對於基本網址:

def find_urls(text):
    import re

    pattern = r'\"hostPageDisplayUrl\":\s*"([^"]*)"'
    return re.findall(pattern, text)

def base_url(url):
    import re

    return re.search(r'(https?://)?(www\.)?([^/]*)', url)[3]

print([base_url(u) for u in find_urls(test)])

您的示例的結果:
['wikimapia.org', 'panoramio.com', 'wikimapia.org', 'nativepakistan.com']

正則表達式解釋

\\"hostPageDisplayUrl\\":\\s*"([^"]*)"

我們搜索一個字符串,帶有前導和尾隨"並將其分組: "([^"]*)"
在此之前,對於任何數量的分隔符\\s*我們需要確切的字符串"hostPageDisplayUrl":

(https?://)?(www\\.)?([^/]*)

忽略任何領先的http(s)://www. ,我們想要第一個/之前的url部分並將其分組: ([^/]*)

從您的評論中我了解到文件數據是一個json保存為文本文件。 因此,您可以直接從文本文件加載json數據並獲取值。 你的代碼應該是這樣的

json_data=json.loads(open("json_file.txt").read())
for data in json_data:
    print data["hostPageDisplayUrl"] #this will print all the urls

我發布了這個,因為編程語言可以提高效率,減少代碼行。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM