简体   繁体   English

使用javascript和regex将原始html转换为文本

[英]Convert raw html to text with javascript and regex

I have raw html with link tags and the goal I want to achieve is extract href attribute from tags and all text between tags except tags. 我有带有链接标签的原始html,我要实现的目标是从标签中提取href属性以及标签之间除标签之外的所有文本。 For example: 例如:

<br>#EXTINF:-1 tvg-name="1377",Страшное HD<br>
<a title="Ссылка" rel="nofollow" href="http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C01_STRASHNOEHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2" target="_blank">http://46.61.226.18/hl…variant.m3u8?version=2</a>
<br>#EXTINF:-1  tvg-name="983" ,Первый канал HD<br>
<a title="Ссылка" rel="nofollow" href="http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C06_1TVHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2" target="_blank">http://46.61.226.18/hl…variant.m3u8?version=2</a>

have to convert to: 必须转换为:

#EXTINF:-1 tvg-name="1377",Страшное HD
http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C01_STRASHNOEHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2
#EXTINF:-1  tvg-name="983" ,Первый канал HD
http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C06_1TVHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2

I tried different regex's: 我尝试了不同的正则表达式:

Here what I did 这是我做的

  1. var source_text = $("#source").val();

  2. var delete_start_of_link_tag = source_text.replace(/<a(.+?)href="/gi, "");

    • delete beginning of the tag to the href attribute 将标签的开头删除到href属性
  3. var delete_tags = delete_start_of_link_tag.replace(/<\\/?\\w+((\\s+\\w+(\\s*=\\s*(?:".*?"|'.*?'|[^'">\\s]+))?)+\\s*|\\s*)\\/?>/gi, "");

    • delete all tags </a> , <br> 删除所有标签</a><br>

example

And then I want to delete all text after href values to the end of the line. 然后我想删除href值之后到行尾的所有文本。

What regex should i use in replace method or maybe where is a some different way to do this converting? 我应该在replace方法中使用什么正则表达式,或者在进行转换的另一种方式是什么?

Looks like you're already using jQuery. 看起来您已经在使用jQuery。

Get the href of each anchor 获取每个锚点的href

$('a').each(function(){
    var href = $(this).attr('href');
});

Get the text of each anchor: 获取每个锚点的文本:

$('a').each(function(){
    var text = $(this).text();
});

You haven't shown a wrapper element around these but you can get the text (without tags) of any selection. 您尚未在这些内容周围显示包装元素,但是可以获取任何选择的文本(不带标签)。

var text = $('#some_id').text();

Example

Formatting Anchor Tags 格式化锚标签

In your example , you are not replacing the "> part form the html. 在您的示例中,您没有替换html中的">部分。
So check this example 所以检查这个例子

use this code to remove everything after href close quote(' or ") 使用此代码删除href结束引号('或“)之后的所有内容

var delete_tags = delete_start_of_link_tag.replace(/".*/gi, "");

And few things to notice are 值得注意的是
1.The value in href is enclosed in single quote( ' ) or double quotes( " ), both are valid. 1. href值用单引号( ' )或双引号( " )括起来,两者均有效。
2.The exact regex to match all href in a given string or content is href=[\\"|'].*?[\\"|'] 2.与给定字符串或内容中的所有href匹配的确切正则表达式为href=[\\"|'].*?[\\"|']
3.Some patterns in href values , I came across are below. 3.我遇到的href值中的一些模式如下。

  • http://www.so.com
  • https://www.so.com
  • www.so.com
  • //so.com
  • /socom.html
  • javascript*
  • mailto*
  • tel*

So if you want to format URL's then you have consider the above cases and i may have missed some. 因此,如果您想格式化URL的格式,那么您已经考虑了上述情况,而我可能已经错过了一些。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM