![](/img/trans.png)
[英]Python Regex: Replace all urls in string with <img> and <a> tags
[英]Python Replace/Remove all urls in string
我试图用字符串中的空字符串替换所有 URL。 JSON 后面是一个字符串。 它不是一个对象。 但是我很难捕捉到各种排列。
这是我的python脚本。 但是,如果您查看https://regex101.com/r/r6tQ3B/2/,您会注意到 regex 还删除了结尾"
并且也没有真正捕获速记 "t.co" 或中间的 url。
for filename in dataFiles:
path = 'data/' + filename
with open(path) as r:
text = re.sub(r'https?:\/\/\S*', '"', text, flags=re.MULTILINE)
with open(path, "w") as w:
w.write(text)
测试: https : //regex101.com/r/r6tQ3B/1/
{
"created_at":"Fri Aug 12 10:04:00 +0000 2016",
"id":764039724818272256,
"text":"@theblaze https://t.com/TY9DlZ584c @realDonaldTrump https://t.com/TY9DlZ584c",
"in_reply_to_screen_name":"theblaze",
"source":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>",
"user":{
"id":366636488,
"id_str":"366636488",
"name":"GIL DUPUY",
"screen_name":"DUPUY77",
"location":"Miami",
"url":"http://ggm-dupuy.com",
"description":"Fashion photographer, love action and adventure, care for the less fortunate, don't tolerate any kind of racism regardless of race or religion",
"verified":false,
"followers_count":186,
"friends_count":446,
"utc_offset":null,
"time_zone":null,
"lang":"en",
"default_profile_image":false,
"following":null,
"notifications":null
},
"geo":null,
"coordinates":null,
"place":{
"name":"Frontenac",
"full_name":"Frontenac, MO",
"country_code":"US",
"country":"United States",
"attributes":{
}
},
"retweet_count":0,
"favorite_count":0,
"extended_entities":{
"media":[
{
"id":764039718237409281,
"id_str":"764039718237409281",
"indices":[
27,
50
],
"media_url":"http://pbs.twimg.com/media/CppqE1_UkAE2qFj.jpg",
"media_url_https":"https://pbs.twimg.com/media/CppqE1_UkAE2qFj.jpg",
"url":"https://t.com/TY9DlZ584c",
"display_url":"pic.twitter.com/TY9DlZ584c",
"expanded_url":"http://twitter.com/DUPUY77/status/764039724818272256/photo/1",
"type":"photo",
"sizes":{
"medium":{
"w":640,
"h":1136,
"resize":"fit"
},
"large":{
"w":640,
"h":1136,
"resize":"fit"
},
"thumb":{
"w":150,
"h":150,
"resize":"crop"
},
"small":{
"w":383,
"h":680,
"resize":"fit"
}
}
}
]
},
"favorited":false,
"retweeted":false,
"possibly_sensitive":false,
"lang":"und"
}
试试这个模式\\s?(https?:\\/\\/[^\\\\\\s"]*)
不是很干净,但适用于您的示例。
删除“”中的所有网址、所有不带“”的网址以及以 pic.twitter 开头的网址(这些似乎是唯一没有 http(s) 的网址)。
假设 url 中没有空格或 ": r"(?:https?:\\/\\/|pic\\.twitter)[^\\s\\"\\\\]*"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.