简体   繁体   中英

Trying to convert string to proper format / encoding?

I have a program that does some screen scraping of a French language web page and finds a specific string. Once found I take that string and save it. The returned string shows up as User does not have a desktop configured. or in French as L'utilisateur ne dispose pas d'un bureau configuré. , but actually shows up as: L**\\x26#39**;utilisateur ne dispose pas d**\\x26#39**;un bureau configur** **. How can I get it to consider the \\x26#39 as the apostrophe ' character.

Is there something in C# that I can use to read the Url and return the correct phrase.

I have looked an many available C# capabilities, but cannot find one that will provide me with the correct result.

Sample code tried playing with:

// translated the true French text to English to help out with this example.
// 
Encoding winVar1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;

string url = String.Format("http://www.My-TEST-SITE.com/);
WebClient webClient = new WebClient();
webClient.Encoding = System.Text.Encoding.UTF8;
string result = webClient.DownloadString(url);
cVar = result.Substring(result.IndexOf("Search_TEXT=")).Length ;
result = result.Substring(result.IndexOf("Search_TEXT="),  cVar);
result = WebUtility.HtmlDecode(result);
result = WebUtility.UrlDecode(result);
result = result.Substring(0, result.IndexOf("Found: "));

This returns L**\\x26#39**;utilisateur ne dispose pas d**\\x26#39**;un bureau configur** **. when it should return: L'utilisateur ne dispose pas d'un bureau configuré. .

I am trying to get rid of the \\x26#39 and get the proper French characters to show as é ê è ç â etc.

I'm can't be sure but:

result = result.Substring(result.IndexOf("Search_TEXT="),  cVar);
result = WebUtility.HtmlDecode(result);
result = WebUtility.UrlDecode(result);

Double decoding the text can't be good. It's either a URL or HTML or neither. Not both.

It looks like your first issue is not with character encoding, but with someone's custom combination of an "\\x" escaped sequence and obscured html entities .

That funny **\\x26#39**; is actually just a simple single quote. The translated hex character \\x26 becomes & so you get **&#39**; . Remove the extraneous stars and you get the html entity ' . With HtmlDecode this becomes the simple apostrophe, ' , which is just ascii character 39.

Try out this snippet. Note that only the last step are we able to do an HtmlDecode.

var input = @"L**\x26#39**;utilisateur ne dispose pas d**\x26#39**;un bureau configur**�**";

var result = Regex.Replace(input, @"\*\*([^*]*)\*\*", "$1");  // Take out the extra stars 

// Unescape \x values
result = Regex.Replace(result,
                       @"\\x([a-fA-F0-9]{2})",
                       match => char.ConvertFromUtf32(Int32.Parse(match.Groups[1].Value,
                                                                  System.Globalization.NumberStyles.HexNumber)));

// Decode html entities
result = System.Net.WebUtility.HtmlDecode(result);

The output is L'utilisateur ne dispose pas d'un bureau configur

The second issue is with accented "e". That actually is an encoding issue and you'll probably have to keep playing around with it to get it right. You might want to also try UTF16 or even UTF32 . But HtmlAgilityPack might just take care of this for you automatically.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM