简体   繁体   English

如何从python中的文本文件中提取网址?

[英]How to extract urls from text file in python?

I have a text file full of URLs and text i want to extract URLs that start with 我有一个充满URL的文本文件,我想提取以以下开头的URL的文本

thumbnailUrl\": \

I used this code 我用了这段代码

def get_net_target(page):
    start_link=page.find("thumbnailUrl")
    start_quote=page.find('"',start_link)
    end_quote=page.find('"',start_quote+1)
    url=page[start_quote+1:end_quote]
    print url

my_file = open("data.txt")
page = my_file.read()

print(get_net_target(page))

I want output like this 我想要这样的输出

https://tse3.mm.bing.net///th?id=OIP.Mcbb568859281f5bc7a7f64d8c58d4895H1&pid=Api\
https:\\/\\/tse1.mm.bing.net\\/th?id=OIP.M7ff1f4e880bac2c244c0b6a286cee669o2&pid=Api\

.... ....

but I get only: 但我只得到:

None

Few lines of data are... 几行数据是...

webSearchUrl\": \"https:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=RUc0BARkL2P78A5CI7XPWqhCYAA2XaQLP-fHGdfODEY&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dshoaibmalik%26id%3d97C5A1ECB43BCDC1B5739F49555CE0C75CEDF83F%26simid%3d607996336242885612&p=DevEx,5006.1\", \"thumbnailUrl\": \"https:\\/\\/tse2.mm.bing.net\\/th?id=OIP.Me19820ab68b4bcc7ec82756b2b5ecffbo1&pid=Api\", \"datePublished\": \"2011-07-08T12:00:00\", \"contentUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=gA9S9qCIF1jvD5yA4V9VOqfrJUxdW2_wyacSDR15Yc8&v=1&r=http%3a%2f%2fwww.forumpakistan.com%2fimages%2fcelebrity-profiles%2fShoaib-Malik-1.jpg&p=DevEx,5008.1\", \"hostPageUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=IODAmtxi3pYzDGhiJcJgCv0fWHEq8hlJauGxRW5o2c4&v=1&r=http%3a%2f%2fok-khan.blogspot.com%2f2011%2f07%2fshoaib-malik.html&p=DevEx,5007.1\", \"contentSize\": \"48445 B\", \"encodingFormat\": \"jpeg\", \"hostPageDisplayUrl\": \"ok-khan.blogspot.com\\/2011\\/07\\/shoaib-malik.html\", \"width\": 500, \"height\": 647, \"thumbnail\": {\"width\": 231, \"height\": 300}, \"imageInsightsToken\": \"ccid_4Zggq2i0*mid_97C5A1ECB43BCDC1B5739F49555CE0C75CEDF83F*simid_607996336242885612\", \"imageId\": \"97C5A1ECB43BCDC1B5739F49555CE0C75CEDF83F\", \"accentColor\": \"3A6491\"}, {\"name\": \"Pakistani Crickert Player: Shoaib Malik\", \"webSearchUrl\": \"https:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=4qc04BUbtNDwiCHco5m3IY_YFqKVaY2q8ZWhX-DvFQs&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dshoaibmalik%26id%3dF690295FD18526BA8225367169A0664405923A09%26simid%3d608039315980946676&p=DevEx,5012.1\", \"thumbnailUrl\": \"https:\\/\\/tse3.mm.bing.net\\/th?id=OIP.Mcbb568859281f5bc7a7f64d8c58d4895H1&pid=Api\", \"datePublished\": \"2012-12-24T12:00:00\", \"contentUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=9psh5pXKn2R_2Zn4-iMzpjDFePVuLSNVJhbVjf2uTI0&v=1&r=http%3a%2f%2fi1.tribune.com.pk%2fwp-content%2fuploads%2f2010%2f10%2fshoaib-malik-640x480.jpg&p=DevEx,5014.1\", \"hostPageUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=-cUvEUoDmZ1OAI-PVQc4MOfS-ELdt5Im521SJ2ZP4j8&v=1&r=http%3a%2f%2fpakistanicricketplayr44410.blogspot.com%2f2012%2f12%2fshoaib-malik.html&p=DevEx,5013.1\", \"contentSize\": \"51986 B\", \"encodingFormat\": \"jpeg\", \"hostPageDisplayUrl\": \"pakistanicricketplayr44410.blogspot.com\\/2012\\/12\\/shoaib-malik.html\", \"width\": 640, \"height\": 480, \"thumbnail\": {\"width\": 300, \"height\": 225}, \"imageInsightsToken\": \"ccid_y7VohZKB*mid_F690295FD18526BA8225367169A0664405923A09*simid_608039315980946676\", \"imageId\": \"F690295FD18526BA8225367169A0664405923A09\", \"accentColor\": \"98AE1D\"}, {\"name\": \"Pakistani Cricket Players: Shoaib Malik\", \"webSearchUrl\": \"https:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=n2Lkz5bg7h-AgbmZE4SnL-_AFBcCgc-_vaiVeAuC84s&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dshoaibmalik%26id%3d320A83F8A63DED3BD4B4EF926CAA3BE901F9DEA2%26simid%3d608028569977424814&p=DevEx,5018.1\", \"thumbnailUrl\": \"https:\\/\\/tse3.mm.bing.net\\/th?id=OIP.Mb6ca65eda578c80e71f4c3b3193c5b67H1&pid=Api\", \"datePublished\": \"2011-04-17T12:00:00\", \"contentUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=TwpcQHy-RdAJUStMisg6zBtjt_j60EStRFRAJS1D69Q&v=1&r=http%3a%2f%2fimages.teamtalk.com%2f08%2f10%2f800x600%2fShoaib-Malik_1264846.jpg&p=DevEx,5020.1\", \"hostPageUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=xICbhyFdmUBblBavcA3pXPdpbOa-1bJuBvP5H6Z0kms&v=1&r=http%3a%2f%2fcricketplayerspk.blogspot.com%2f2011%2f04%2fshoaib-malik.html&p=DevEx,5019.1\", \"contentSize\": \"51243 B\", \"encodingFormat\": \"jpeg\", \"hostPageDisplayUrl\": \"cricketplayerspk.blogspot.com\\/2011\\/04\\/shoaib-malik.html\", \"width\": 800, \"height\": 600, \"thumbnail\": {\"width\": 300, \"height\": 225}, \"imageInsightsToken\": \"ccid_tspl7aV4*mid_320A83F8A63DED3BD4B4EF926CAA3BE901F9DEA2*simid_608028569977424814\", \"imageId\": \"320A83F8A63DED3BD4B4EF926CAA3BE901F9DEA2\", \"accentColor\": \"416838\"}, {\"name\": \"Shoaib Malik in line for Test comeback after 5 years - Sports\", \"webSearchUrl\": \"https:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=7CIa0gvwncEquihLMmMIvtYAAUYZutf8EQr57d8EDO0&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dshoaibmalik%26id%3d8045A5C7203C2203C8238D9E00905FCB328BD4D9%26simid%3d608033376034882300&p=DevEx,5024.1\", \"thumbnailUrl\": \"https:\\/\\/tse2.mm.bing.net\\/th?id=OIP.M65fe5bf16283dc466e93650fbaef1205o1&pid=Api\", \"datePublished\": \"2015-10-06T04:07:00\", \"contentUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=F2RLPPSfrErnxq7OZt_3mbKbvpJITet7f_kGd90aKlg&v=1&r=http%3a%2f%2fimages.mid-day.com%2fimages%2f2015%2foct%2f6Shoaib-Malik-1.jpg&p=DevEx,5026.1\", \"hostPageUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=3V02TER99J6fm2eshh_cv4NCdJELV1DpI1pOmALtDMQ&v=1&r=http%3a%2f%2fwww.mid-day.com%2farticles%2fshoaib-malik-in-line-for-test-comeback-after-5-years%2f16586181&p=DevEx,5025.1\", \"contentSize\": \"119997 B\", \"encodingFormat\": \"jpeg\", \"hostPageDisplayUrl\": \"www.mid-day.com\\/articles\\/shoaib-malik-in-line-for-test-comeback...\", \"width\": 670, \"height\": 746, \"thumbnail\": {\"width\": 269, \"height\": 300}, \"imageInsightsToken\": \"ccid_Zf5b8WKD*mid_8045A5C7203C2203C8238D9E00905FCB328BD4D9*simid_608033376034882300\", \"imageId\": \"8045A5C7203C2203C8238D9E00905FCB328BD4D9\", \"accentColor\": \"304987\"}, {\"name\": \"Gallery > Cricketers > Shoaib Malik > Shoaib Malik high quality! Free ...\", \"webSearchUrl\": \"https:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=A9FD1ucKtYszoNQZ2KEhYMvgMwvJ6AA5d-DFInyr9I4&v=1&r=https%3a%2f%2fwww.bing.com%2fimages%2fsearch%3fview%3ddetailv2%26FORM%3dOIIRPO%26q%3dshoaibmalik%26id%3dB7AD00B57D67FD1664C7BBA404FF6E2679019517%26simid%3d608007657767896024&p=DevEx,5030.1\", \"thumbnailUrl\": \"https:\\/\\/tse3.mm.bing.net\\/th?id=OIP.M5d9fb4d528228cb5c8b9748bff10365bo1&pid=Api\", \"datePublished\": \"2013-05-18T00:44:00\", \"contentUrl\": \"http:\\/\\/www.bing.com\\/cr?IG=4588890DDF1744A79DAEC3DB4C5C87D0&CID=3C16AFB87BB96F70283EA5B77A886E24&rd=1&h=7jwPNSK-kjHNAXQmqBqznMWCB3u4YPz0uHDFoJizw1U&v=1&r=http%3a%2f%2fpak101.com%2fgallery%2fCricketers%2fShoaib_Malik%2f2011%2f9%2f22%2fShoaib_Malik_Picture_9_xmnqf.jpg&p=DevEx,5032.1\", \"hostPageUrl\": \"http:\\/\\/www.bing.com\

This code demonstrates two approaches. 该代码演示了两种方法。 The first parallels your and the second shows an easier way involving the use of regular expressions. 第一个与您的平行,第二个显示了涉及使用正则表达式的更简单方法。

It's worth learning the first way but the trick is to keep your place in the string that you're parsing. 值得学习第一种方法,但诀窍是将您的位置保留在要分析的字符串中。

data = '''webSearchUrl\": \"https:\\/\\/w ... p:\\/\\/www.bing.com"'''
data = data.replace ('\/', '/')

print ('Using roughly your approach ...')

start = 0
while True:
    p = data[start:].find('thumbnailUrl')
    if p == -1: break
    q = data[start+p+12:].find('http')
    r = data[start+p+q+12:].find('"')
    print (data[start+p+q+12:start+p+q+r+12])
    start = start+p+q+r+12

print ('Using a regular expression ...')

from re import compile

thumbNailRE = compile(r'thumbnailUrl":\s+"([^"]+)')
for match in thumbNailRE.findall(data):
    print (match)

Outputs are identical: 输出是相同的:

Using roughly your approach ...
https://tse2.mm.bing.net/th?id=OIP.Me19820ab68b4bcc7ec82756b2b5ecffbo1&pid=Api
https://tse3.mm.bing.net/th?id=OIP.Mcbb568859281f5bc7a7f64d8c58d4895H1&pid=Api
https://tse3.mm.bing.net/th?id=OIP.Mb6ca65eda578c80e71f4c3b3193c5b67H1&pid=Api
https://tse2.mm.bing.net/th?id=OIP.M65fe5bf16283dc466e93650fbaef1205o1&pid=Api
https://tse3.mm.bing.net/th?id=OIP.M5d9fb4d528228cb5c8b9748bff10365bo1&pid=Api
Using a regular expression ...
https://tse2.mm.bing.net/th?id=OIP.Me19820ab68b4bcc7ec82756b2b5ecffbo1&pid=Api
https://tse3.mm.bing.net/th?id=OIP.Mcbb568859281f5bc7a7f64d8c58d4895H1&pid=Api
https://tse3.mm.bing.net/th?id=OIP.Mb6ca65eda578c80e71f4c3b3193c5b67H1&pid=Api
https://tse2.mm.bing.net/th?id=OIP.M65fe5bf16283dc466e93650fbaef1205o1&pid=Api
https://tse3.mm.bing.net/th?id=OIP.M5d9fb4d528228cb5c8b9748bff10365bo1&pid=Api

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM