简体   繁体   English

如何从 python 中的 1 个字符串中删除特定案例元素

[英]how to remove specific case elements from 1 string in python

here i have a string with a website html data it's stored in urldata在这里,我有一个带有网站 html 数据的字符串,它存储在 urldata 中

urldata = BeautifulSoup(urlopen(urllib.request.Request(url, headers=headers), timeout=3).read(),features="html.parser")```

when i print urldata it's showing the html data from the specific page so here i need to remove the https and http links当我打印urldata它显示来自特定页面的 html 数据所以这里我需要删除 https 和 http 链接

so i can fillter the http or https links by this way所以我可以通过这种方式填写 http 或 https 链接

web_page = str(urldata)
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA- F]))+', web_page)
print(urls)

so here i'm thinking to remove the http&https links from "urldata"所以在这里我想从“urldata”中删除http和https链接

I have the url list already in that url variable (type "list")我有 url 列表已经在 url 变量(类型“列表”)

so is there any way to compare the list "urls" with "web_page" string那么有什么方法可以将列表“urls”与“web_page”字符串进行比较

and remove the urls from web_page string并从 web_page 字符串中删除 url

You can use re.sub() to substitute each url with an empty string:您可以使用re.sub()将每个 url 替换为空字符串:

web_page = str(urldata)
web_page = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA- F]))+', '', web_page)
print(web_page)


UPDATE:更新:

web_page = str(urldata)
for url in urls:
    web_page = web_page.replace(url, '')
print(web_page)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM