简体   繁体   English

Python替换/删除字符串中的所有网址

[英]Python Replace/Remove all urls in string

I am trying to replace all URLs with an empty string in a string.我试图用字符串中的空字符串替换所有 URL。 Following JSON is a string . JSON 后面是一个字符串 it's not an object.它不是一个对象。 But I have difficulty capturing various permutations.但是我很难捕捉到各种排列。

This is my python script.这是我的python脚本。 However if you look at https://regex101.com/r/r6tQ3B/2/ you notice that regex also removes ending " and also doesn't really capture shorthand "t.co" or the urls in the middle.但是,如果您查看https://regex101.com/r/r6tQ3B/2/,您会注意到 regex 还删除了结尾"并且也没有真正捕获速记 "t.co" 或中间的 url。

for filename in dataFiles:
    path = 'data/' + filename
    with open(path) as r:
        text = re.sub(r'https?:\/\/\S*', '"', text, flags=re.MULTILINE)
    with open(path, "w") as w:
        w.write(text)

Test: https://regex101.com/r/r6tQ3B/1/测试: https : //regex101.com/r/r6tQ3B/1/

{
   "created_at":"Fri Aug 12 10:04:00 +0000 2016",
   "id":764039724818272256,
   "text":"@theblaze https://t.com/TY9DlZ584c @realDonaldTrump https://t.com/TY9DlZ584c",
   "in_reply_to_screen_name":"theblaze",
   "source":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>",
   "user":{
      "id":366636488,
      "id_str":"366636488",
      "name":"GIL DUPUY",
      "screen_name":"DUPUY77",
      "location":"Miami",
      "url":"http://ggm-dupuy.com",
      "description":"Fashion photographer, love action and adventure, care for the less fortunate, don't tolerate any kind of racism regardless of race or religion",
      "verified":false,
      "followers_count":186,
      "friends_count":446,
      "utc_offset":null,
      "time_zone":null,
      "lang":"en",
      "default_profile_image":false,
      "following":null,
      "notifications":null
   },
   "geo":null,
   "coordinates":null,
   "place":{
      "name":"Frontenac",
      "full_name":"Frontenac, MO",
      "country_code":"US",
      "country":"United States",
      "attributes":{
         
      }
   },
   "retweet_count":0,
   "favorite_count":0,
   "extended_entities":{
      "media":[
         {
            "id":764039718237409281,
            "id_str":"764039718237409281",
            "indices":[
               27,
               50
            ],
            "media_url":"http://pbs.twimg.com/media/CppqE1_UkAE2qFj.jpg",
            "media_url_https":"https://pbs.twimg.com/media/CppqE1_UkAE2qFj.jpg",
            "url":"https://t.com/TY9DlZ584c",
            "display_url":"pic.twitter.com/TY9DlZ584c",
            "expanded_url":"http://twitter.com/DUPUY77/status/764039724818272256/photo/1",
            "type":"photo",
            "sizes":{
               "medium":{
                  "w":640,
                  "h":1136,
                  "resize":"fit"
               },
               "large":{
                  "w":640,
                  "h":1136,
                  "resize":"fit"
               },
               "thumb":{
                  "w":150,
                  "h":150,
                  "resize":"crop"
               },
               "small":{
                  "w":383,
                  "h":680,
                  "resize":"fit"
               }
            }
         }
      ]
   },
   "favorited":false,
   "retweeted":false,
   "possibly_sensitive":false,
   "lang":"und"
}

Try this pattern \\s?(https?:\\/\\/[^\\\\\\s"]*)试试这个模式\\s?(https?:\\/\\/[^\\\\\\s"]*)

Not very clean, but work for your example.不是很干净,但适用于您的示例。

Removes all urls in "", all urls without "", and urls starting with pic.twitter (these seem the only ones without http(s)).删除“”中的所有网址、所有不带“”的网址以及以 pic.twitter 开头的网址(这些似乎是唯一没有 http(s) 的网址)。
Assumes, there is no whitespace or " in the url: r"(?:https?:\\/\\/|pic\\.twitter)[^\\s\\"\\\\]*"假设 url 中没有空格或 ": r"(?:https?:\\/\\/|pic\\.twitter)[^\\s\\"\\\\]*"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM