繁体   English   中英

Python替换/删除字符串中的所有网址

[英]Python Replace/Remove all urls in string

我试图用字符串中的空字符串替换所有 URL。 JSON 后面是一个字符串 它不是一个对象。 但是我很难捕捉到各种排列。

这是我的python脚本。 但是,如果您查看https://regex101.com/r/r6tQ3B/2/,您会注意到 regex 还删除了结尾"并且也没有真正捕获速记 "t.co" 或中间的 url。

for filename in dataFiles:
    path = 'data/' + filename
    with open(path) as r:
        text = re.sub(r'https?:\/\/\S*', '"', text, flags=re.MULTILINE)
    with open(path, "w") as w:
        w.write(text)

测试: https : //regex101.com/r/r6tQ3B/1/

{
   "created_at":"Fri Aug 12 10:04:00 +0000 2016",
   "id":764039724818272256,
   "text":"@theblaze https://t.com/TY9DlZ584c @realDonaldTrump https://t.com/TY9DlZ584c",
   "in_reply_to_screen_name":"theblaze",
   "source":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>",
   "user":{
      "id":366636488,
      "id_str":"366636488",
      "name":"GIL DUPUY",
      "screen_name":"DUPUY77",
      "location":"Miami",
      "url":"http://ggm-dupuy.com",
      "description":"Fashion photographer, love action and adventure, care for the less fortunate, don't tolerate any kind of racism regardless of race or religion",
      "verified":false,
      "followers_count":186,
      "friends_count":446,
      "utc_offset":null,
      "time_zone":null,
      "lang":"en",
      "default_profile_image":false,
      "following":null,
      "notifications":null
   },
   "geo":null,
   "coordinates":null,
   "place":{
      "name":"Frontenac",
      "full_name":"Frontenac, MO",
      "country_code":"US",
      "country":"United States",
      "attributes":{
         
      }
   },
   "retweet_count":0,
   "favorite_count":0,
   "extended_entities":{
      "media":[
         {
            "id":764039718237409281,
            "id_str":"764039718237409281",
            "indices":[
               27,
               50
            ],
            "media_url":"http://pbs.twimg.com/media/CppqE1_UkAE2qFj.jpg",
            "media_url_https":"https://pbs.twimg.com/media/CppqE1_UkAE2qFj.jpg",
            "url":"https://t.com/TY9DlZ584c",
            "display_url":"pic.twitter.com/TY9DlZ584c",
            "expanded_url":"http://twitter.com/DUPUY77/status/764039724818272256/photo/1",
            "type":"photo",
            "sizes":{
               "medium":{
                  "w":640,
                  "h":1136,
                  "resize":"fit"
               },
               "large":{
                  "w":640,
                  "h":1136,
                  "resize":"fit"
               },
               "thumb":{
                  "w":150,
                  "h":150,
                  "resize":"crop"
               },
               "small":{
                  "w":383,
                  "h":680,
                  "resize":"fit"
               }
            }
         }
      ]
   },
   "favorited":false,
   "retweeted":false,
   "possibly_sensitive":false,
   "lang":"und"
}

试试这个模式\\s?(https?:\\/\\/[^\\\\\\s"]*)

不是很干净,但适用于您的示例。

删除“”中的所有网址、所有不带“”的网址以及以 pic.twitter 开头的网址(这些似乎是唯一没有 http(s) 的网址)。
假设 url 中没有空格或 ": r"(?:https?:\\/\\/|pic\\.twitter)[^\\s\\"\\\\]*"

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM