简体   繁体   English

使用Powershell将HTML文件转换为.CSV

[英]Convert HTML file into .CSV using powershell

So I have an HTML file generated from a 3rd-party that gets e-mailed to me (and my group) daily. 因此,我有一个由第三方生成的HTML文件,该文件每天都会通过电子邮件发送给我(和我的小组)。 It contains a table of ID no's, Names, and multiple e-mail addresses if applicable. 它包含ID号,名称和多个电子邮件地址(如果适用)的表。 It is used to update group membership in AD and I would like to be able to do this in powershell since the group membership update portion is easy. 它用于更新AD中的组成员身份,并且我希望能够在Powershell中执行此操作,因为组成员身份更新部分很容易。 The parsing HTML file to pull e-mail addresses, which are also their AD usernames, is the tough part. 解析HTML文件以提取电子邮件地址(也是其AD用户名)是困难的部分。 I'm kinda at a stump. 我有点在树桩上。 I've tried using HTMLAgilityParser which doesn't seem to work all that great for my purpose. 我已经尝试使用HTMLAgilityParser,但对于我的目的来说似乎并没有那么好用。 If I could somehow get the data into a .CSV for ease of use that would be great. 如果我能以某种方式将数据保存到.CSV文件中,那将是很好的选择。

What I need is to either A) Pull the e-mail addresses directly from the HTML and place them in a CSV file or B) Convert the HTML file to a .CSV to be parsed. 我需要的是A)直接从HTML中提取电子邮件地址并将其放置在CSV文件中,或者B)将HTML文件转换为要解析的.CSV。

The reason is that this data comes in daily so this will have to be automated. 原因是该数据每天都会到来,因此必须自动进行。

Thanks! 谢谢!

Sample from the html file, all identifying info has been removed and/or adjusted: 来自html文件的示例,所有标识信息均已删除和/或调整:

<table>
<tr>
<td class=xl27>
<span class=font7>ID</span>
</td>
<td class=xl27>
<span class=font7>Name</span>
</td>
<td class=xl27>
<span class=font7>Primary E-Mail</span>
</td>
<td class=xl27>
<span class=font7>Alternate E-Mail</span>
</td>
</tr>
<tr>
<td class=xl28>
<span class=font8>00000000</span>
</td>
<td class=xl28>
<span class=font8>Smith,John R</span>
</td>
<td class=xl28>
<span class=font8></span>
</td>
<td class=xl28>
<span class=font8>John_Smith@addr</span>
</td>
</tr>

Here is the beginning of a solution, not so good ... It supposes that HtmlAgilityPack.dll is in Html-Agility-Pack directory of the directory script file. 这是解决方案的开始,效果不是很好。它假定HtmlAgilityPack.dll位于目录脚本文件的Html-Agility-Pack目录中。

Add-Type -Path "$(Split-Path -parent $PSCommandPath)\Html-Agility-Pack\HtmlAgilityPack.dll"


$webGraber = New-Object -TypeName HtmlAgilityPack.HtmlWeb
$webDoc = $webGraber.Load("C:\temp\t.htm")
$trDatas = $webDoc.DocumentNode.ChildNodes.Elements("tr")

Remove-Item "c:\temp\t.csv"

foreach ($trData in $trDatas)
{
  $tdDatas = $trData.elements("td")
  $line = ""
  foreach ($tdData in $tdDatas)
  {
    $line = $line + $tdData.InnerText.Trim() + ','
  }
  $line.Remove($line.Length -1) | Out-File -FilePath "c:\temp\t.csv" -Append
}

I hesitate to post this answer as it is extremely specific to this case, but this can be accomplished with simple string methods. 我很犹豫地发布此答案,因为它非常特定于此情况,但这可以通过简单的字符串方法来完成。 First get the content of the html file: 首先获取html文件的内容:

$htmlContent = Get-Content -Path 'thePath\andFile.html'

Next select the strings from the html data that contain the values you are looking for. 接下来,从html数据中选择包含您要查找的值的字符串。 This part is absolutely specific to the structure of your html: 这部分绝对特定于您的html结构:

$stringsWithDesiredValues = $htmlContent.Where({$_ -like '*<span class=font8>*'})

Now we can use a foreach and use the indices of '>' and '<' to get a substring with only the desired values. 现在我们可以使用foreach并使用索引'>'和'<'来获得仅包含所需值的子字符串。

foreach($htmlString in $stringsWithDesiredValues){
$firstIndex = $htmlString.IndexOf('>') + 1
$lastIndex = $htmlString.LastIndexOf('<')
$lengthOfSubstring = $lastIndex - $firstIndex
$desiredValue = $htmlString.Substring($firstIndex,$lengthOfSubstring)
$desiredValue}

Of course I'm not doing anything with the desired value here, but this script will write out the values so you can see that they are correct. 当然,我在这里没有对期望的值做任何事情,但是此脚本将写出这些值,以便您可以看到它们是正确的。 You can obviously capture those values in the loop and do with them what you will. 显然,您可以在循环中捕获这些值,然后按需进行处理。 An ugly solution to be sure, I only posted it because no other answers suggested. 可以肯定,这是一个丑陋的解决方案,因为没有其他答案,我只发布了它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM