Using the following PowerShell script, I am converting a directory of Word documents to HTML.使用以下 PowerShell 脚本,我将 Word 文档目录转换为 HTML。

$wdTypes = Add-Type -AssemblyName 'Microsoft.Office.Interop.Word' -Passthru
$docSrc = "C:\Users\Me\Desktop\TestWordDocs"
$htmlOutputPath = "C:\Users\Me\Desktop\TestHTMLDocs"
$srcFiles = Get-ChildItem $docSrc -filter "*.doc"
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatHTML");
$wordApp = new-object -comobject word.application
$wordApp.Visible = $false

function saveashtml {
  $openDoc = $wordApp.documents.open($doc.FullName);
  $openDoc.saveas([ref]"$htmlOutputPath\$doc.fullname.html", [ref]$saveFormat);

ForEach ($doc in $srcFiles) {
  Write-Host "Converting to html :" $doc.FullName
  $doc = $null


This successfully converts the file but not in UTF-8 format as seen in the meta tag.这成功地转换了文件,但不是meta标记中所见的 UTF-8 格式。

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">

Special characters are displayed as � in the HTML file.特殊字符在 HTML 文件中显示为 �。

How can I fix this?我怎样才能解决这个问题?

Windows 10 64-bit. Windows 10 64 位。 Powershell 5.1 and 7rc.1 Powershell 5.1 和7rc.1

HTML 4 and 5 documents should be saved using the UTF-8 character encoding format. HTML 4 和 5 文档应使用 UTF-8 字符编码格式保存。 PowerShell less than version 6 default character encoding format is UTF-8-BOM.低于版本 6 的 PowerShell 默认字符编码格式为 UTF-8-BOM。 <meta http-equiv=Content-Type content="text/html; charset=windows-1252"> has nothing to do with what character encoding the document is saved in. <meta http-equiv=Content-Type content="text/html; charset=windows-1252">与保存文档的字符编码无关。

You have at least three jobs:你至少有三份工作:

  1. Replace charset=windows-1252 with charset=UTF-8charset=windows-1252替换为charset=UTF-8
  2. Save your documents using UTF-8 character encoding format.使用 UTF-8 字符编码格式保存文档。
  3. Check your output for errors.检查您的输出是否有错误。

Use your conversion script of choice.使用您选择的转换脚本。 I like Thomas Stensitzki's Convert-WordDocument.ps1 for converting word documents with powershell.我喜欢 Thomas Stensitzki 的Convert-WordDocument.ps1 ,用于使用 powershell 转换 word 文档。 Like most conversion scripts it requires Apache OpenOffice ~v4.1.7 or ~ Microsoft Word 12?像大多数转换脚本一样,它需要Apache OpenOffice ~v4.1.7 或 ~ Microsoft Word 12? (Thomas says Word 16) be installed locally. (Thomas 说 Word 16)在本地安装。 It converts a 5MB Word 2003 document with 16 images to html in under twelve seconds.它可以在 12 秒内将包含 16 个图像的 5MB Word 2003 文档转换为 html。

Change your http-equiv meta element if necessary:如有必要,更改您的http-equiv元元素:

<meta http-equiv=Content-Type content="text/html; charset=UTF-8"> for HTML 4 documents 


<meta charset="UTF-8"> for HTML 5 documents.

A sitemap I created 012420 at xml-sitemaps.com used both.我在 xml-sitemaps.com 上创建的站点地图 012420 使用了两者。

<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta charset="utf-8">

Save / Create the document using the UTF-8 character encoding format.使用 UTF-8 字符编码格式保存/创建文档。

What works in Powershell 5.1 might be easier in PowerShell 6 or >. Powershell 5.1 中的工作可能在 PowerShell 6 或 > 中更容易。 Read the links below.阅读下面的链接。 Later versions of PowerShell default to UTF-8 character encoding format.更高版本的 PowerShell 默认为 UTF-8 字符编码格式。

Powershell 5.1:电源外壳 5.1:

# without overwriting. UTF-8 character encoding format.
$source = (gc $env:userprofile\Desktop\source.html) -replace "charset=windows-1252", "charset=UTF-8"
$output = "$env:userprofile\Desktop\output.html"
[IO.File]::WriteAllLines($output, $source)

PowerShell 7rc.1 PowerShell 7rc.1

# without overwriting. UTF-8 character encoding format.
(gc $env:userprofile\Desktop\source.html) -replace "charset=windows-1252", "charset=UTF-8" | out-file -force $env:userprofile\Desktop\output.html
# with overwriting. UTF-8 character encoding format.
(gc $env:userprofile\Desktop\source.html) -replace "charset=windows-1252", "charset=UTF-8" | out-file -force $env:userprofile\Desktop\source.html

Batch convert with PowerShell 7rc.1:使用 PowerShell 7rc.1 批量转换:

# with overwriting. UTF-8 character encoding format.
foreach ($i in ls -name "$env:userprofile\Desktop\*.html")
    (gc "$env:userprofile\Desktop\$i") -replace "charset=windows-1252", "charset=UTF-8" | out-file -force "$env:userprofile\Desktop\$i"

That should display your special characters correctly.那应该正确显示您的特殊字符。

