[英]Best way to search a plain text string in an HTML string in c#?
This is the html string :这是 html 字符串:
string htmlString = "<body lang=\"EN-US\" link=\"blue\" vlink=\"#954F72\"><div class=\"WordSection1\"><p class=\"MsoNormal\">Hi, </p><p class=\"MsoNormal\"><o:p> </o:p></p><p class=\"MsoNormal\"><o:p> </o:p></p><p class=\"MsoNormal\">My name is Gaurav Illness.</p><p class=\"MsoNormal\"><span style=\"color:purple !important\">Today <b>MY relation</b>ship breakdown <span style=\"color:red\">happened?<o:p></o:p></span> </span></p><p class=\"MsoNormal\"><span style=\"color:red\"><o:p> </o:p></span></p><p class=\"MsoNormal\"><span style=\"color:red\">I am Gr</span><span style=\"font-size:15.0pt;color:red;background:yellow;mso-highlight:yellow\">iESh and I</span><span style=\"font-size:15.0pt;color:red\"><o:p></o:p></span></p><p class=\"MsoNormal\"><span style=\"font-size:15.0pt;color:#B4C7E7;mso-style-textfill-fill-color:#B4C7E7;mso-style-textfill-fill-alpha:100.0%\">Am drugger.<o:p></o:p></span></p><p class=\"MsoNormal\"><o:p> </o:p></p><p class=\"MsoNormal\" style=\"line-height:16.5pt\"><span style=\"font-size:10.0pt;font-family:"Arial",sans-serif;color:#1F497D\">Thanks<span style=\"text-transform:uppercase\">,<o:p></o:p>"
I am Extracting Plain text from this using this function :我正在使用此功能从中提取纯文本:
private static string extractTextFromHtml(string htmlString)
{
// Remove new lines since they are not visible in HTML
html = html.Replace("\n", " ");
// Remove tab spaces
html = html.Replace("\t", " ");
// Remove multiple white spaces from HTML
html = Regex.Replace(html, "\\s+", " ");
// Remove HEAD tag
html = Regex.Replace(html, "<head.*?</head>", ""
, RegexOptions.IgnoreCase | RegexOptions.Singleline);
// Remove any JavaScript
html = Regex.Replace(html, "<script.*?</script>", ""
, RegexOptions.IgnoreCase | RegexOptions.Singleline);
// Replace special characters like &, <, >, " etc.
StringBuilder sbHTML = new StringBuilder(html);
// Note: There are many more special characters, these are just
// most common. You can add new characters in this arrays if needed
string[] OldWords = {" ", "&", """, "<", ">", "®", "©", "•", "™"};
string[] NewWords = { " ", "&", "\"", "<", ">", "®", "©", "•", "™" };
for (int i = 0; i < OldWords.Length; i++)
{
sbHTML.Replace(OldWords[i], NewWords[i]);
}
// Check if there are line breaks (<br>) or paragraph (<p>)
sbHTML.Replace("<br>", "\n<br>");
sbHTML.Replace("<br ", "\n<br ");
sbHTML.Replace("<p ", "\n<p ");
// Finally, remove all HTML tags and return plain text
return System.Text.RegularExpressions.Regex.Replace(
sbHTML.ToString(), "<[^>]*>", "");
}
This function returns :此函数返回:
"Hi, “你好,
My name is Gaurav Illness.我的名字是高拉夫病。 Today MY relationship breakdown happened?今天我的关系破裂了吗?
I am GriESh and I Am drugger.我是 GriESh,我是毒贩。
Thanks,"谢谢,”
Now I send this Text to an API that detects weather there is an emotion or not in these sentences.现在我将此文本发送到一个 API,该 API 检测天气是否在这些句子中存在情绪。 The API gives a response of all the sentences which are emotional. API 给出了所有情绪化句子的响应。 For example, API says "Today MY relationship breakdown happened?"例如,API 说“今天发生了我的关系破裂?” is emotional.是情绪化的。 Now I want to mark this sentence as purple color in the html for which I have to add a span around the sentence.现在我想在 html 中将此句子标记为紫色,为此我必须在句子周围添加一个跨度。 To do this I have to find the start and end index of this sentence in the html code.为此,我必须在 html 代码中找到这句话的开始和结束索引。
How can I find the start and end index of this sentence in the html code?如何在html代码中找到这句话的开始和结束索引?
I have a code which gives me the indexes but I think it is not the best way to do.我有一个代码可以给我索引,但我认为这不是最好的方法。 Can anyone suggest a better way?任何人都可以提出更好的方法吗? This is my code example :这是我的代码示例:
public static void findTextInHtml(string htmlCode)
{
string textToBeFind = "I am GriESh and IAm drugger.";
int i = 0;
int j = 0;
int startIndex = 0;
int endIndex = 0;
bool isHtml = false;
bool isbeingMatched = false;
while (i < htmlCode.Length)
{
if (htmlCode[i] == '<')
{
isHtml = true;
i++;
continue;
}
if (htmlCode[i] == '>')
{
isHtml = false;
i++;
continue;
}
if (isHtml)
{
i++;
continue;
}
if (textToBeFind[j] == htmlCode[i])
{
if (!isbeingMatched)
{
startIndex = i;
}
isbeingMatched = true;
j++;
if (j == textToBeFind.Length)
{
endIndex = i;
break;
}
}
else
{
isbeingMatched = false;
j = 0;
}
i++;
}
AddStartSpan(startIndex, htmlCode);
AddEndSpan(endIndex, htmlCode);
}
Install the nuget package HtmlAgilityPack
安装 nuget 包HtmlAgilityPack
Then its easy to parse like this:然后很容易像这样解析:
string htmlString = "<p class=\"MsoNormal\"><span style=\"color:red\"><o:p> </o:p></span></p><p class=\"MsoNormal\"><span style=\"color:red\">I am Gr</span><span style=\"font-size:15.0pt;color:red;background:yellow;mso-highlight:yellow\">iESh and I</span><span style=\"font-size:15.0pt;color:red\"><o:p></o:p></span></p><p class=\"MsoNormal\"><span style=\"font-size:15.0pt;color:#B4C7E7;mso-style-textfill-fill-color:#B4C7E7;mso-style-textfill-fill-alpha:100.0%\">Am drugger.<o:p></o:p></span></p>";
var doc = new HtmlDocument();
doc.LoadHtml(htmlString);
var inner = doc.DocumentNode.InnerText.TrimStartString(" ");
// inner = I am GriESh and IAm drugger.
To remove the nbsp;
删除nbsp;
at the start of the InnerText在 InnerText 的开头
public static string TrimStartString(this string input, string prefixToRemove,
StringComparison comparisonType = StringComparison.OrdinalIgnoreCase)
{
if (input != null && prefixToRemove != null
&& input.StartsWith(prefixToRemove, comparisonType))
{
return input.Substring(prefixToRemove.Length);
}
else return input;
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.