简体   繁体   English

将重音符号和HTML实体转换为UTF-8?

[英]Converting accented characters and HTML entities into UTF-8?

I'm working on a project that will allow me to download stories from Portkey.org for reading on my kindle, and I can't for the life of me figure out how to properly encode/parse the grabbed HTML from the website. 我正在开发一个项目,该项目可以让我从Portkey.org下载故事以供我阅读,而我一辈子都无法弄清楚如何正确编码/解析网站中捕获的HTML。 I am using simple_html_dom to grab it, and am passing the innertext of the main element where the story is held for parsing. 我正在使用simple_html_dom来抓取它,并且正在传递故事要进行解析的主要元素的innertext

So what I'm trying to accomplish here is the following: 所以我要在这里完成的工作如下:

  1. Grab HTML from Portkey.org story 从Portkey.org故事中获取HTML
  2. Convert all HTML Entities on page to regular characters for reading (entities like ” to , “ to , … to and so on) 转换为常规字符页面上的所有HTML实体,用于读取(如实体”“…等等)
  3. Any accented characters or characters of other languages (like Korean, Japanese, Chinese, etc.) should stay as they are. 重音字符或其他语言的字符(如韩语,日语,中文等)应保持原样。
  4. Fix the HTML using tidy and save it to a .html file. 使用tidy修复HTML并将其保存到.html文件。

Everything I have tried so far results in either of the following: 到目前为止,我尝试的所有操作均导致以下任一情况:

  • Diamond with question mark inside of it where the accented characters should be 带有问号的菱形,应在其中加重音符号
  • Broken UTF-8 characters where there should be quotations and ellipses, but accented characters show correctly UTF-8字符损坏,应在引号和省略号处显示,但带重音符号的字符可以正确显示

A sample from the story HTML: 故事HTML中的示例:

<p> Wel [snip] your emotions&hellip;but most impor [snip] ng fiancé </p>

EDIT 编辑

html_entity_decode results in the following output: html_entity_decode结果如下:

 Wel [snip] your emotions…but most impor [snip] ng fiancé

As you can see, the accented character is correct, but the &hellip; 如您所见,带重音的字符是正确的,但&hellip; now displays incorrectly. 现在显示不正确。

EDIT 2: 编辑2:

Results of get_html_translation_table(HTML_ENTITIES) : get_html_translation_table(HTML_ENTITIES)结果:

array(252) { ["""]=> string(6) """ ["&"]=> string(5) "&" ["<"]=> string(4) "<" [">"]=> string(4) ">" [" "]=> string(6) " " ["¡"]=> string(7) "¡" ["¢"]=> string(6) "¢" ["£"]=> string(7) "£" ["¤"]=> string(8) "¤" ["Â¥"]=> string(5) "¥" ["¦"]=> string(8) "¦" ["§"]=> string(6) "§" ["¨"]=> string(5) "¨" ["©"]=> string(6) "©" ["ª"]=> string(6) "ª" ["«"]=> string(7) "«" ["¬"]=> string(5) "¬" ["­"]=> string(5) "­" ["®"]=> string(5) "®" ["¯"]=> string(6) "¯" ["°"]=> string(5) "°" ["±"]=> string(8) "±" ["²"]=> string(6) "²" ["³"]=> string(6) "³" ["´"]=> string(7) "´" ["µ"]=> string(7) "µ" ["¶"]=> string(6) "¶" ["·"]=> string(8) "·" ["¸"]=> string(7) "¸" ["¹"]=> string(6) "¹" ["º"]=> string(6) "º" ["»"]=> string(7) "»" ["¼"]=> string(8) "¼" ["½"]=> string(8) "½" ["¾"]=> string(8) "¾" ["¿"]=> string(8) "¿" ["À"]=> string(8) "À" ["Ã"]=> string(8) "Á" ["Â"]=> string(7) "Â" ["Ã"]=> string(8) "Ã" ["Ä"]=> string(6) "Ä" ["Ã…"]=> string(7) "Å" ["Æ"]=> string(7) "Æ" ["Ç"]=> string(8) "Ç" ["È"]=> string(8) "È" ["É"]=> string(8) "É" ["Ê"]=> string(7) "Ê" ["Ë"]=> string(6) "Ë" ["ÃŒ"]=> string(8) "Ì" ["Ã"]=> string(8) "Í" ["ÃŽ"]=> string(7) "Î" ["Ã"]=> string(6) "Ï" ["Ã"]=> string(5) "Ð" ["Ñ"]=> string(8) "Ñ" ["Ã’"]=> string(8) "Ò" ["Ó"]=> string(8) "Ó" ["Ô"]=> string(7) "Ô" ["Õ"]=> string(8) "Õ" ["Ö"]=> string(6) "Ö" ["×"]=> string(7) "×" ["Ø"]=> string(8) "Ø" ["Ù"]=> string(8) "Ù" ["Ú"]=> string(8) "Ú" ["Û"]=> string(7) "Û" ["Ãœ"]=> string(6) "Ü" ["Ã"]=> string(8) "Ý" ["Þ"]=> string(7) "Þ" ["ß"]=> string(7) "ß" ["à "]=> string(8) "à" ["á"]=> string(8) "á" ["â"]=> string(7) "â" ["ã"]=> string(8) "ã" ["ä"]=> string(6) "ä" ["Ã¥"]=> string(7) "å" ["æ"]=> string(7) "æ" ["ç"]=> string(8) "ç" ["è"]=> string(8) "è" ["é"]=> string(8) "é" ["ê"]=> string(7) "ê" ["ë"]=> string(6) "ë" ["ì"]=> string(8) "ì" ["í"]=> string(8) "í" ["î"]=> string(7) "î" ["ï"]=> string(6) "ï" ["ð"]=> string(5) "ð" ["ñ"]=> string(8) "ñ" ["ò"]=> string(8) "ò" ["ó"]=> string(8) "ó" ["ô"]=> string(7) "ô" ["õ"]=> string(8) "õ" ["ö"]=> string(6) "ö" ["÷"]=> string(8) "÷" ["ø"]=> string(8) "ø" ["ù"]=> string(8) "ù" ["ú"]=> string(8) "ú" ["û"]=> string(7) "û" ["ü"]=> string(6) "ü" ["ý"]=> string(8) "ý" ["þ"]=> string(7) "þ" ["ÿ"]=> string(6) "ÿ" ["Å’"]=> string(7) "Œ" ["Å“"]=> string(7) "œ" ["Å "]=> string(8) "Š" ["Å¡"]=> string(8) "š" ["Ÿ"]=> string(6) "Ÿ" ["Æ’"]=> string(6) "ƒ" ["ˆ"]=> string(6) "ˆ" ["Ëœ"]=> string(7) "˜" ["Α"]=> string(7) "Α" ["Î’"]=> string(6) "Β" ["Γ"]=> string(7) "Γ" ["Δ"]=> string(7) "Δ" ["Ε"]=> string(9) "Ε" ["Ζ"]=> string(6) "Ζ" ["Η"]=> string(5) "Η" ["Θ"]=> string(7) "Θ" ["Ι"]=> string(6) "Ι" ["Κ"]=> string(7) "Κ" ["Λ"]=> string(8) "Λ" ["Îœ"]=> string(4) "Μ" ["Î"]=> string(4) "Ν" ["Ξ"]=> string(4) "Ξ" ["Ο"]=> string(9) "Ο" ["Î "]=> string(4) "Π" ["Ρ"]=> string(5) "Ρ" ["Σ"]=> string(7) "Σ" ["Τ"]=> string(5) "Τ" ["Î¥"]=> string(9) "Υ" ["Φ"]=> string(5) "Φ" ["Χ"]=> string(5) "Χ" ["Ψ"]=> string(5) "Ψ" ["Ω"]=> string(7) "Ω" ["α"]=> string(7) "α" ["β"]=> string(6) "β" ["γ"]=> string(7) "γ" ["δ"]=> string(7) "δ" ["ε"]=> string(9) "ε" ["ζ"]=> string(6) "ζ" ["η"]=> string(5) "η" ["θ"]=> string(7) "θ" ["ι"]=> string(6) "ι" ["κ"]=> string(7) "κ" ["λ"]=> string(8) "λ" ["μ"]=> string(4) "μ" ["ν"]=> string(4) "ν" ["ξ"]=> string(4) "ξ" ["ο"]=> string(9) "ο" ["Ï€"]=> string(4) "π" ["Ï"]=> string(5) "ρ" ["Ï‚"]=> string(8) "ς" ["σ"]=> string(7) "σ" ["Ï„"]=> string(5) "τ" ["Ï…"]=> string(9) "υ" ["φ"]=> string(5) "φ" ["χ"]=> string(5) "χ" ["ψ"]=> string(5) "ψ" ["ω"]=> string(7) "ω" ["Ï‘"]=> string(10) "ϑ" ["Ï’"]=> string(7) "ϒ" ["Ï–"]=> string(5) "ϖ" [" "]=> string(6) " " [" "]=> string(6) " " [" "]=> string(8) " " ["‌"]=> string(6) "‌" ["â€"]=> string(5) "‍" ["‎"]=> string(5) "‎" ["â€"]=> string(5) "‏" ["–"]=> string(7) "–" ["—"]=> string(7) "—" ["‘"]=> string(7) "‘" ["’"]=> string(7) "’" ["‚"]=> string(7) "‚" ["“"]=> string(7) "“" ["â€"]=> string(7) "”" ["„"]=> string(7) "„" ["†"]=> string(8) "†" ["‡"]=> string(8) "‡" ["•"]=> string(6) "•" ["…"]=> string(8) "…" ["‰"]=> string(8) "‰" ["′"]=> string(7) "′" ["″"]=> string(7) "″" ["‹"]=> string(8) "‹" ["›"]=> string(8) "›" ["‾"]=> string(7) "‾" ["â„"]=> string(7) "⁄" ["€"]=> string(6) "€" ["â„‘"]=> string(7) "ℑ" ["℘"]=> string(8) "℘" ["â„œ"]=> string(6) "ℜ" ["â„¢"]=> string(7) "™" ["ℵ"]=> string(9) "ℵ" ["â†"]=> string(6) "←" ["↑"]=> string(6) "↑" ["→"]=> string(6) "→" ["↓"]=> string(6) "↓" ["↔"]=> string(6) "↔" ["↵"]=> string(7) "↵" ["â‡"]=> string(6) "⇐" ["⇑"]=> string(6) "⇑" ["⇒"]=> string(6) "⇒" ["⇓"]=> string(6) "⇓" ["⇔"]=> string(6) "⇔" ["∀"]=> string(8) "∀" ["∂"]=> string(6) "∂" ["∃"]=> string(7) "∃" ["∅"]=> string(7) "∅" ["∇"]=> string(7) "∇" ["∈"]=> string(6) "∈" ["∉"]=> string(7) "∉" ["∋"]=> string(4) "∋" ["âˆ"]=> string(6) "∏" ["∑"]=> string(5) "∑" ["−"]=> string(7) "−" ["∗"]=> string(8) "∗" ["√"]=> string(7) "√" ["âˆ"]=> string(6) "∝" ["∞"]=> string(7) "∞" ["∠"]=> string(5) "∠" ["∧"]=> string(5) "∧" ["∨"]=> string(4) "∨" ["∩"]=> string(5) "∩" ["∪"]=> string(5) "∪" ["∫"]=> string(5) "∫" ["∴"]=> string(8) "∴" ["∼"]=> string(5) "∼" ["≅"]=> string(6) "≅" ["≈"]=> string(7) "≈" ["≠"]=> string(4) "≠" ["≡"]=> string(7) "≡" ["≤"]=> string(4) "≤" ["≥"]=> string(4) "≥" ["⊂"]=> string(5) "⊂" ["⊃"]=> string(5) "⊃" ["⊄"]=> string(6) "⊄" ["⊆"]=> string(6) "⊆" ["⊇"]=> string(6) "⊇" ["⊕"]=> string(7) "⊕" ["⊗"]=> string(8) "⊗" ["⊥"]=> string(6) "⊥" ["â‹…"]=> string(6) "⋅" ["⌈"]=> string(7) "⌈" ["⌉"]=> string(7) "⌉" ["⌊"]=> string(8) "⌊" ["⌋"]=> string(8) "⌋" ["〈"]=> string(6) "⟨" ["〉"]=> string(6) "⟩" ["â—Š"]=> string(5) "◊" ["â™ "]=> string(8) "♠" ["♣"]=> string(7) "♣" ["♥"]=> string(8) "♥" ["♦"]=> string(7) "♦" }

EDIT 3: 编辑3:

Just for full disclosure, here is a test file I have set up for the purposes of figuring this out. 仅出于全面披露的目的,这是我为弄清这一点而设置的测试文件。 Currently, all entities display correctly, but accented characters display as . 当前,所有实体均正确显示,但带重音符号的显示为

<?php

header('Content-Type: text/html; charset=UTF-8');

require_once('_RESOURCES/simple_html_dom.php');

$url = 'http://fanfiction.portkey.org/index.php?act=read&storyid=1585&chapterid=&agree=1';

function tidyHTML($html) {
    ob_start();
    $tidy = new tidy;
    $config = array('indent' => true, 'output-xhtml' => false, 'wrap' => 200, 'clean' => false, 'show-body-only' => true);
    $tidy->parseString($html, $config, 'utf8');
    $tidy->cleanRepair();
    $input = $tidy;
    return $input;
}

function filter($html) {
    $html = preg_replace('~>\s+<~', '><', $html);
    $html = preg_replace('/<\/b>\s?<b>/', '', $html);
    $html = preg_replace('/<\/i>\s?<i>/', '', $html);
    $html = str_replace('<br>', '', $html);
    $output = $html;
    return $output;
}

$page_html = file_get_html($url);
$chapter_html = $page_html->find('td[class="story"]', 0);
foreach ($chapter_html->find('center') as $node) { $node->outertext = ''; }

$entities = html_entity_decode($chapter_html->innertext, ENT_QUOTES, 'UTF-8');

echo tidyHTML(filter($entities));

// var_dump(get_html_translation_table(HTML_ENTITIES));

?>

You probably want html_entity_decode . 您可能需要html_entity_decode From the documentation: "converts all HTML entities in the string to their applicable characters." 在文档中:“将字符串中的所有HTML实体转换为它们的适用字符。” Depending on your PHP version and setup, you may have to specify the encoding manually. 根据您的PHP版本和设置,您可能必须手动指定编码。 Something like: 就像是:

html_entity_decode($raw_text, ENT_QUOTES, 'UTF-8');

Tidy may be re-encoding your entities. 整理可能会重新编码您的实体。 I'm not sure how complex your input strings are, but could consider just dropping the HTML tags, using something like strip_tags , if you don't need the formatting to match exactly. 我不确定您的输入字符串有多复杂,但是如果您不需要完全匹配的格式,可以考虑使用strip_tags之类的方法放下HTML标签。

I accomplished what I set out to do by changing the encoding of tidy from 通过更改整齐的编码,我完成了我打算要做的事情

$tidy->parseString($html, $config, 'utf8');

to

$tidy->parseString($html, $config, 'win1252');

This converted the accented characters to HTML entities. 这会将重音符号转换为HTML实体。 I then used html_entity_decode to convert all of the entities into UTF-8 characters. 然后,我使用html_entity_decode将所有实体转换为UTF-8字符。

New test file (works!) 新测试文件(有效!)

<?php

header('Content-Type: text/html; charset=UTF-8');

require_once('_RESOURCES/simple_html_dom.php');

$url = 'http://fanfiction.portkey.org/index.php?act=read&storyid=1585&chapterid=&agree=1';

function tidyHTML($html) {
    ob_start();
    $tidy = new tidy;
    $config = array('indent' => true, 'output-xhtml' => false, 'wrap' => 200, 'clean' => false, 'show-body-only' => true);
    $tidy->parseString($html, $config, 'win1252');
    $tidy->cleanRepair();
    $input = $tidy;
    return $input;
}

function filter($html) {
    $html = preg_replace('~>\s+<~', '><', $html);
    $html = preg_replace('/<\/b>\s?<b>/', '', $html);
    $html = preg_replace('/<\/i>\s?<i>/', '', $html);
    $html = str_replace('<br>', '', $html);
    $output = $html;
    return $output;
}

$page_html = file_get_html($url);
$chapter_html = $page_html->find('td[class="story"]', 0);
foreach ($chapter_html->find('center') as $node) { $node->outertext = ''; }

echo filter(html_entity_decode(tidyHTML($chapter_html->innertext)));

?>

Couldn't have done it without you, Skunkwaffle! 没有你,做不到,Skunkwaffle!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM