简体   繁体   中英

Use Powershell to find SSN's in Word and Excell Documents

I am very noob to Powershell and have small amounts of Linux bash scripting experience. I have been looking for a way to get a list of files that have Social Security Numbers on a server. I found this in my research and it performed exactly as I had wanted when testing on my home computer except for the fact that it did not return results from my work and excel test documents. Is there a way to use a PowerShell command to get results from the various office documents as well? This server is almost all Word and excel files with a few PowerPoints.

PS C:\Users\Stephen> Get-ChildItem -Path C:\Users -Recurse -Exclude *.exe, *.dll | `
Select-String "\d{3}[-| ]\d{2}[-| ]\d{4}"

Documents\\SSN:1:222-33-2345
Documents\\SSN:2:111-22-1234
Documents\\SSN:3:111 11 1234

PS C:\Users\Stephen> Get-childitem  -rec | ?{ findstr.exe /mprc:. $_.FullName } | `
select-string "[0-9]{3}[-| ][0-9]{2}[-| ][0-9]{4}"

Documents\\SSN:1:222-33-2345
Documents\\SSN:2:111-22-1234
Documents\\SSN:3:111 11 1234

Is there a way to use a PowerShell command to get results from the various office documents as well? This server is almost all Word and excel files with a few PowerPoints.

When interacting with MS Office files, the best way is to use COM interfaces to grab the information you need.

If you are new to Powershell, COM will definitely be somewhat of a learning curve for you, as very little "beginner" documentation exists on the internet.

Therefore I strongly advise starting off small :

  • First focus on opening a single Word doc and reading in the contents into a string for now.
  • Once you have this ready, focus on extracting relevant info (The Powershell Match operator is very helpful )
  • Once you are able to work with a single Word doc, try to locate all files named *.docx in a folder and repeat your process on them: foreach ($file in (ls *.docx)) { # work on $file }

Here's some reading (admittedly, all this is for Excel as I build automated Excel charting tools, but the lessons will be very helpful for automating any Office application)

当您只希望将其限制为docx和xlsx时,您可能还需要考虑简单地将其解压缩然后搜索内容,而忽略任何XML标记(因此在每个数字之间允许一个或多个XML元素)。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM