简体   繁体   English

使用正则表达式从URL中删除空格

[英]Remove space from URL using regular expression

I have this paragraph: 我有这段:

The Daily Eastern News is a student-run newspaper published for the community of Eastern Illinois University in Charleston, Illinois. 东方日报》是由学生经营的报纸,专门为伊利诺伊州查尔斯顿市的东伊利诺伊大学社区出版。 The newspaper was founded in 1915 http://media . 该报纸成立于1915年, 网址为http:// media www. 万维网。 dennews. dennews。 com/media/storage/paper309/news/2005/11/04/News/TheNews. com / media / storage / paper309 / news / 2005/11/04 / News / TheNews。 Turns.90-1045667. 旋转90-1045667。 shtml and publishes on weekdays during the school year and twice-weekly in the summer. SHTML和出版上在夏季学年和每周两次的工作日。

The paper has won numerous state and national awards, including several Pacemaker awards. 该纸赢得了许多州和国家的奖项,包括一些起搏器奖。 http://search . http:// search atomz. Atomz。 com/search/?sp_a=sp01089f00&sp_f=iso-8859-1&sp_q=%22daily+eastern+news%22 The paper's editorial, production, and advertising staff are composed entirely of students from a range of degree programs. com / search /?sp_a = sp01089f00&sp_f = iso-8859-1&sp_q =%22daily + eastern + news%22这篇论文的编辑,制作和广告人员完全由来自各个学位课程的学生组成。

I want to remove the space from the bold parts in above paragraph . 我要删除上面段落中 粗体部分 的空间

Expected Output: 预期产量:

The Daily Eastern News is a student-run newspaper published for the community of Eastern Illinois University in Charleston, Illinois. 东方日报》是由学生经营的报纸,专门为伊利诺伊州查尔斯顿市的东伊利诺伊大学社区出版。 The newspaper was founded in 1915 http://media.www.dennews.com/media/storage/paper309/news/2005/11/04/News/TheNews.Turns.90-1045667.shtml and publishes on weekdays during the school year and twice-weekly in the summer. 该报纸成立于1915年, http://media.www.dennews.com/media/storage/paper309/news/2005/11/04/News/TheNews.Turns.90-1045667.shtml并在学校工作日的平日出版年,夏季每周两次。

The paper has won numerous state and national awards, including several Pacemaker awards. 该纸赢得了许多州和国家的奖项,包括一些起搏器奖。 http://search . http:// search atomz. Atomz。 com/search/?sp_a=sp01089f00&sp_f=iso-8859-1&sp_q=%22daily+eastern+news%22 The paper's editorial, production, and advertising staff are composed entirely of students from a range of degree programs. com / search /?sp_a = sp01089f00&sp_f = iso-8859-1&sp_q =%22daily + eastern + news%22这篇论文的编辑,制作和广告人员完全由来自各个学位课程的学生组成。

Tried regexp - (http://(?:.)*?\\.) ((?:.)*?\\.) ((?:.)*?\\.) ((?:.)*?\\.) ((?:.)*?\\.) 尝试过正则表达式- (http://(?:.)*?\\.) ((?:.)*?\\.) ((?:.)*?\\.) ((?:.)*?\\.) ((?:.)*?\\.)

But it is working for the first URL not for the second URL. 但是它适用于第一个URL, 而不适用于第二个URL。 Because I used this ((?:.)*?\\.) to check repetitive group of .(dot) with space. 因为我使用了这个((?:.)*?\\.)检查带有空格的。(点)的重复组。 It doesn't seems to be worked for second URL. 它似乎不适用于第二个URL。 Is there any way to do such thing for all URL. 有没有办法对所有网址执行此类操作。

Check this - https://regex101.com/r/tB9oL5/7 检查此-https://regex101.com/r/tB9oL5/7

Unfortunately, this is not possible, unless you make assumptions such as require the URLs to appear at the end of sentences, end with .html etc. (and that is unreasonable, especially because all the links in your example don't appear at the end of sentence nor end with common suffix). 不幸的是,这是不可能的,除非您做出诸如使URL出现在句子末尾,以.html结尾等假设(这是不合理的,尤其是因为示例中的所有链接都不会出现在句子结尾或以公共后缀结尾)。 To illustrate the reason why this is not possible, observe how you can't tell difference between: 为了说明无法做到这一点的原因,请观察如何分辨两者之间的区别:

A new site: http://example.com/ appeared. 出现了一个新站点: http : //example.com/

and: 和:

A new site: http://example.com/ appeared . 出现了一个新网站: http : //example.com/

possibly something like this?; 可能是这样的吗?

list = str.split('http://search. atomz. com/search/?sp_a=sp01089f00&sp_f=iso-8859-1&sp_q=%22daily+eastern+news%22')
nlist = ''.join(list)
print nlist

which returns; 返回; http://search.atomz.com/search/?sp_a=sp01089f00&sp_f=iso-8859-1&sp_q=%22daily+eastern+news%22

You may need to complicate it with variables etc, since this is a barebones function.. 您可能需要使用变量等使其复杂化,因为这是一个准系统函数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM