简体   繁体   中英

how to handle all special characters in HTML/php form that outputs to XML

I have a little php/mysql app I put together that takes an input form and stores it in a MySQL database, and outputs the data as XML for consumption by a radio-playing hardware device.

The problem is ampersands and other characters. The user is taking descriptions of various radio stations, along with streaming URL or Playlist URL and pasting them into the form. Some radio stations are in non-english speaking countries (mostly French). I need to know what to do to preprocess these fields so that the XML that is generated is not corrupted, which breaks the external hardware app.

I assume that this should go into the php that is called when the form is submitted. I'm pretty sure the htmlspecialchars function should be used, but I'm not sure the best method, since I've hacked this together from a variety of sources:

UPDATE: Here is my current output code with some regex that cleans up the ampersands.

<?
include("HLN/manager/connect.php");

$query = "SELECT * FROM hln_stations ORDER BY orderid ASC";
$result = mysql_query($query);

$num = mysql_num_rows ($result);
mysql_close();

$xml = new XMLWriter();

$xml->openURI("php://output");
$xml->startDocument();
header('Content-type: text/xml');
$xml->setIndent(true);

$xml->startElement('channels');

while ($row = mysql_fetch_assoc($result)) {

  $xml->startElement("channel");
     $xml->startElement("title");
          $xml->writeRaw(preg_replace('/&(?![A-Za-z0-9#]{1,7};)/','&amp;',$row['station_title']));
     $xml->endElement();
     $xml->startElement("descriptionline1");
          $xml->writeRaw(preg_replace('/&(?![A-Za-z0-9#]{1,7};)/','&amp;',$row['station_display_name']));
     $xml->endElement();

     $xml->startElement("descriptionline2");
          $xml->writeRaw(preg_replace('/&(?![A-Za-z0-9#]{1,7};)/','&amp;',$row['station_subtitle']));
     $xml->endElement();

     $xml->startElement("description");
          $xml->writeRaw(preg_replace('/&(?![A-Za-z0-9#]{1,7};)/','&amp;',$row['station_detailed_description']));
     $xml->endElement();

     $xml->startElement("sdimage");
          $xml->writeRaw(preg_replace('/&(?![A-Za-z0-9#]{1,7};)/','&amp;',$row['sdtv_thumbnail_graphic_url']));
     $xml->endElement();

     $xml->startElement("hdimage");
          $xml->writeRaw(preg_replace('/&(?![A-Za-z0-9#]{1,7};)/','&amp;',$row['hdtv_thumbnail_graphic_url']));
     $xml->endElement();

     $xml->startElement("uri");
          $xml->writeRaw(preg_replace('/&(?![A-Za-z0-9#]{1,7};)/','&amp;',$row['stream_url_or_playlist_url']));
     $xml->endElement();

     $xml->startElement("linktype");
          $xml->writeRaw(preg_replace('/&(?![A-Za-z0-9#]{1,7};)/','&amp;',$row['link_type']));
     $xml->endElement();

 $xml->endElement();
}

$xml->endElement();


$xml->flush();

?>

But I still need to solve the French character set issues that are cropping up. How can I replace the é character for example with something that doesn't cause problems?

You've an error in Firefox, that says not well formed, because the character set detected doesn't match the character set you output. I tried various combinations of character sets and could reproduce the issue.

You've to specify explicitly your character sets, such as:

header('Content-type: text/xml; charset=UTF-8');
$xml = new XMLWriter();
$xml->openURI("php://output");
$xml->startDocument("1.0", "UTF-8");

If specifying character set as UTF-8 in the content type and in XML gives you error, it means that your input is not valid UTF-8, try with ISO-8859-15 instead, or recode your input.

You have to put the content-type charset header for every page of your site, including the form to input data or your special characters could be messed up. Further you've to connect to mysql specifying the character set that you want to use for the connection and that should match the charset and collation of your tables.

Supposing that you're using UTF-8 look at your database with PHPMyAdmin and a UTF-8 connection, if you can't see your special characters well it means you're doing something wrong.

As for the device, if you say that it can display only ASCII characters, does it do the conversion for you when you give UTF-8 input or do you have to give the entity such as:

Ch&#xE9;rie 

If those two options doesn't work, you may want to convert to ASCII, such as "Cherie"... but that would be the last resort.


Proof of concept code without using the DB:

<?php

header('Content-type: text/xml; charset=UTF-8');

$radioArr = array(
   array("Chérie FM @Work", "http://www.listenlive.eu/cheriefm_atwork.m3u?p&test"), 
   array("Hélène FM", "http://broadcast.infomaniak.ch/helenefm-high.mp3.m3u")
);
$xml = new XMLWriter();
$xml->openURI("php://output");
$xml->startDocument("1.0", "UTF-8");
$xml->setIndent(true);
$xml->startElement('channels');
foreach ($radioArr AS $radio) {
     $xml->startElement("channel");

     $xml->startElement("title");
     $xml->writeRaw(preg_replace('/&(?![A-Za-z0-9#]{1,7};)/','&amp;', $radio[0]));
     $xml->endElement();

     $xml->startElement("uri");
     $xml->writeRaw(preg_replace('/&(?![A-Za-z0-9#]{1,7};)/','&amp;', $radio[1]));
     $xml->endElement();

     $xml->endElement(); //end channel
}

$xml->endElement();
$xml->flush();

?>

If you want to really "clean up french characters" (remove)

What about doing this ( iconv ) ?

iconv('utf8', 'ascii//TRANSLIT', $text);

Wrapped the data using CDATA. Instead of writeRaw() use writeCData() Please refer to the sample below.

// CData output
$xml->startElement('title');
$xml->writeCData($row['station_subtitle']);
$xml->endElement();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM