[英]Convert raw html to text with javascript and regex
I have raw html with link tags and the goal I want to achieve is extract href attribute from tags and all text between tags except tags. 我有带有链接标签的原始html,我要实现的目标是从标签中提取href属性以及标签之间除标签之外的所有文本。 For example: 例如:
<br>#EXTINF:-1 tvg-name="1377",Страшное HD<br>
<a title="Ссылка" rel="nofollow" href="http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C01_STRASHNOEHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2" target="_blank">http://46.61.226.18/hl…variant.m3u8?version=2</a>
<br>#EXTINF:-1 tvg-name="983" ,Первый канал HD<br>
<a title="Ссылка" rel="nofollow" href="http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C06_1TVHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2" target="_blank">http://46.61.226.18/hl…variant.m3u8?version=2</a>
have to convert to: 必须转换为:
#EXTINF:-1 tvg-name="1377",Страшное HD
http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C01_STRASHNOEHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2
#EXTINF:-1 tvg-name="983" ,Первый канал HD
http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C06_1TVHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2
I tried different regex's: 我尝试了不同的正则表达式:
Here what I did 这是我做的
var source_text = $("#source").val();
var delete_start_of_link_tag = source_text.replace(/<a(.+?)href="/gi, "");
var delete_tags = delete_start_of_link_tag.replace(/<\\/?\\w+((\\s+\\w+(\\s*=\\s*(?:".*?"|'.*?'|[^'">\\s]+))?)+\\s*|\\s*)\\/?>/gi, "");
</a>
, <br>
删除所有标签</a>
, <br>
And then I want to delete all text after href values to the end of the line. 然后我想删除href值之后到行尾的所有文本。
What regex should i use in replace method or maybe where is a some different way to do this converting? 我应该在replace方法中使用什么正则表达式,或者在进行转换的另一种方式是什么?
Looks like you're already using jQuery. 看起来您已经在使用jQuery。
Get the href of each anchor 获取每个锚点的href
$('a').each(function(){
var href = $(this).attr('href');
});
Get the text of each anchor: 获取每个锚点的文本:
$('a').each(function(){
var text = $(this).text();
});
You haven't shown a wrapper element around these but you can get the text (without tags) of any selection. 您尚未在这些内容周围显示包装元素,但是可以获取任何选择的文本(不带标签)。
var text = $('#some_id').text();
Formatting Anchor Tags 格式化锚标签
In your example , you are not replacing the ">
part form the html. 在您的示例中,您没有替换html中的">
部分。
So check this example 所以检查这个例子
use this code to remove everything after href close quote(' or ") 使用此代码删除href结束引号('或“)之后的所有内容
var delete_tags = delete_start_of_link_tag.replace(/".*/gi, "");
And few things to notice are 值得注意的是
1.The value in href
is enclosed in single quote( '
) or double quotes( "
), both are valid. 1. href
值用单引号( '
)或双引号( "
)括起来,两者均有效。
2.The exact regex to match all href
in a given string or content is href=[\\"|'].*?[\\"|']
2.与给定字符串或内容中的所有href
匹配的确切正则表达式为href=[\\"|'].*?[\\"|']
3.Some patterns in href
values , I came across are below. 3.我遇到的href
值中的一些模式如下。
http://www.so.com
https://www.so.com
www.so.com
//so.com
/socom.html
javascript*
mailto*
tel*
So if you want to format URL's then you have consider the above cases and i may have missed some. 因此,如果您想格式化URL的格式,那么您已经考虑了上述情况,而我可能已经错过了一些。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.