简体   繁体   English

使用 PowerShell 将 Word 文档保存为带编码的 HTML

[英]Use PowerShell to save Word document as HTML with encoding

Using the following PowerShell script, I am converting a directory of Word documents to HTML.使用以下 PowerShell 脚本,我将 Word 文档目录转换为 HTML。

$wdTypes = Add-Type -AssemblyName 'Microsoft.Office.Interop.Word' -Passthru
[void][System.Reflection.Assembly]::LoadWithPartialName('Microsoft.Office.Interop.Word.WdSaveFormat')
$docSrc = "C:\Users\Me\Desktop\TestWordDocs"
$htmlOutputPath = "C:\Users\Me\Desktop\TestHTMLDocs"
$srcFiles = Get-ChildItem $docSrc -filter "*.doc"
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatHTML");
$wordApp = new-object -comobject word.application
$wordApp.Visible = $false

function saveashtml {
  $openDoc = $wordApp.documents.open($doc.FullName);
  $openDoc.saveas([ref]"$htmlOutputPath\$doc.fullname.html", [ref]$saveFormat);
  $openDoc.close();
}

ForEach ($doc in $srcFiles) {
  Write-Host "Converting to html :" $doc.FullName
  saveashtml
  $doc = $null
}

$wordApp.quit();

This successfully converts the file but not in UTF-8 format as seen in the meta tag.这成功地转换了文件,但不是meta标记中所见的 UTF-8 格式。

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">

Special characters are displayed as � in the HTML file.特殊字符在 HTML 文件中显示为 �。

How can I fix this?我怎样才能解决这个问题?

Windows 10 64-bit. Windows 10 64 位。 Powershell 5.1 and 7rc.1 Powershell 5.1 和7rc.1

Use PowerShell to convert Microsoft Word documents to HTML 4 / 5 documents.使用 PowerShell 将 Microsoft Word 文档转换为 HTML 4 / 5 文档。

HTML 4 and 5 documents should be saved using the UTF-8 character encoding format. HTML 4 和 5 文档应使用 UTF-8 字符编码格式保存。 PowerShell less than version 6 default character encoding format is UTF-8-BOM.低于版本 6 的 PowerShell 默认字符编码格式为 UTF-8-BOM。 <meta http-equiv=Content-Type content="text/html; charset=windows-1252"> has nothing to do with what character encoding the document is saved in. <meta http-equiv=Content-Type content="text/html; charset=windows-1252">与保存文档的字符编码无关。

You have at least three jobs:你至少有三份工作:

  1. Replace charset=windows-1252 with charset=UTF-8charset=windows-1252替换为charset=UTF-8
  2. Save your documents using UTF-8 character encoding format.使用 UTF-8 字符编码格式保存文档。
  3. Check your output for errors.检查您的输出是否有错误。

Use your conversion script of choice.使用您选择的转换脚本。 I like Thomas Stensitzki's Convert-WordDocument.ps1 for converting word documents with powershell.我喜欢 Thomas Stensitzki 的Convert-WordDocument.ps1 ,用于使用 powershell 转换 word 文档。 Like most conversion scripts it requires Apache OpenOffice ~v4.1.7 or ~ Microsoft Word 12?像大多数转换脚本一样,它需要Apache OpenOffice ~v4.1.7 或 ~ Microsoft Word 12? (Thomas says Word 16) be installed locally. (Thomas 说 Word 16)在本地安装。 It converts a 5MB Word 2003 document with 16 images to html in under twelve seconds.它可以在 12 秒内将包含 16 个图像的 5MB Word 2003 文档转换为 html。

Change your http-equiv meta element if necessary:如有必要,更改您的http-equiv元元素:

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">` 

to

<meta http-equiv=Content-Type content="text/html; charset=UTF-8"> for HTML 4 documents 

or要么

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">

to

<meta charset="UTF-8"> for HTML 5 documents.

A sitemap I created 012420 at xml-sitemaps.com used both.我在 xml-sitemaps.com 上创建的站点地图 012420 使用了两者。

<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta charset="utf-8">

Save / Create the document using the UTF-8 character encoding format.使用 UTF-8 字符编码格式保存/创建文档。

What works in Powershell 5.1 might be easier in PowerShell 6 or >. Powershell 5.1 中的工作可能在 PowerShell 6 或 > 中更容易。 Read the links below.阅读下面的链接。 Later versions of PowerShell default to UTF-8 character encoding format.更高版本的 PowerShell 默认为 UTF-8 字符编码格式。

Powershell 5.1:电源外壳 5.1:

# without overwriting. UTF-8 character encoding format.
$source = (gc $env:userprofile\Desktop\source.html) -replace "charset=windows-1252", "charset=UTF-8"
$output = "$env:userprofile\Desktop\output.html"
[IO.File]::WriteAllLines($output, $source)

PowerShell 7rc.1 PowerShell 7rc.1

# without overwriting. UTF-8 character encoding format.
(gc $env:userprofile\Desktop\source.html) -replace "charset=windows-1252", "charset=UTF-8" | out-file -force $env:userprofile\Desktop\output.html
# with overwriting. UTF-8 character encoding format.
(gc $env:userprofile\Desktop\source.html) -replace "charset=windows-1252", "charset=UTF-8" | out-file -force $env:userprofile\Desktop\source.html

Batch convert with PowerShell 7rc.1:使用 PowerShell 7rc.1 批量转换:

# with overwriting. UTF-8 character encoding format.
foreach ($i in ls -name "$env:userprofile\Desktop\*.html")
{
    (gc "$env:userprofile\Desktop\$i") -replace "charset=windows-1252", "charset=UTF-8" | out-file -force "$env:userprofile\Desktop\$i"
}

That should display your special characters correctly.那应该正确显示您的特殊字符。

Understanding file encoding了解文件编码

HTML Charset - W3Schools HTML 字符集 - W3Schools

Declaring character encodings in HTML 在 HTML 中声明字符编码

HTML http-equiv Attribute HTML http-equiv 属性

Using PowerShell to write a file in UTF-8 without the BOM 使用 PowerShell 编写没有 BOM 的 UTF-8 文件

Understanding file encoding2 理解文件编码2

Understand default encoding and change the same in PowerShell 了解默认编码并在 PowerShell 中进行更改

What version of powershell do you have $PSVersionTable.PSVersion你有什么版本的 powershell $PSVersionTable.PSVersion

Always declare the encoding of your document using a meta element with a charset attribute.始终使用具有 charset 属性的 meta 元素来声明文档的编码。 The declaration should fit completely within the first 1024 bytes at the start of the file, so it's best to put it immediately after the opening head tag.声明应完全位于文件开头的前 1024 个字节内,因此最好将其紧跟在开头的 head 标记之后。 How to find the first 1024 bytes of a.html file in Windows 10 64-bit?如何在 Windows 10 64 位中找到 a.html 文件的前 1024 个字节? Download http://unxutils.sourceforge.net/UnxUpdates.zip and use head -c 1024 myfilenamehere.html下载http://unxutils.sourceforge.net/UnxUpdates.zip并使用head -c 1024 myfilenamehere.html

None of the following worked but they should be read.以下均无效,但应阅读。

Changing PowerShell's default output encoding to UTF-8 将 PowerShell 的默认输出编码更改为 UTF-8

Changing source files encoding and some fun with PowerShell 使用 PowerShell 更改源文件编码和一些乐趣

Convert Word documents using PowerShell使用 PowerShell 转换 Word 文档

How to convert a word document to other formats using PowerShell如何使用 PowerShell 将 word 文档转换为其他格式

Saving Word document as HTML将 Word 文档另存为 HTML

Convert word document to text file using powershell使用powershell将word文档转换为文本文件

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM