简体   繁体   English

使用 c# 打印 PDF 文件的每一行

[英]Print each Line of a PDF File with c#

i have a pdf file stored in a server url, and i want to get each line of the file, i want later export it to an excel file so i need to get every line, one by one, i will put the code here.我有一个存储在服务器 url 中的 pdf 文件,我想获取文件的每一行,我想稍后将其导出到一个 excel 文件,所以我需要逐行获取每一行,我将把代码放在这里。 OBS: the url of the pdf stop working after 3 hours, i will always update it here in the comments. OBS: pdf 的 url 在 3 小时后停止工作,我将始终在评论中更新它。 thanks.谢谢。

using System;
using System.Net.Http;
using System.Threading.Tasks;
                    
    public class Program
    {
        public static async Task Main()
        {
                var pdfUrl = "https://eproc.trf4.jus.br/eproc2trf4/controlador.php?acao=acessar_documento_implementacao&doc=41625504719486351366932807019&evento=20084&key=4baa2515293382eb41b2a95e121550490b5b154f1c4c06e8b0469eff082311e6&hash=3112f8451af24a1a5c3e69afab09f079&termosPesquisados=";
                var client = new HttpClient();
                var response = await client.GetAsync(pdfUrl);
    
                using (var stream = await response.Content.ReadAsStreamAsync())
                {
                    Console.WriteLine("print each line of my pdf file");
                }
        }
    }

Well, extracting text from PDF is not an ordinary task.好吧,从 PDF 中提取文本不是一项普通的任务。 If you need really generic solution works with any pdf, then state of art solution here is to use AI based API provided for example by some cloud platforms like Google, AWS or Azure:如果您需要真正通用的解决方案适用于任何 pdf,那么这里最先进的解决方案是使用基于 AI 的 API,例如由 Google、AWS 或 Azure 等一些云平台提供:

https://cloud.google.com/vision/docs/pdf https://cloud.google.com/vision/docs/pdf

https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/ https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/

https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/automatically-extract-content-from-pdf-files-using-amazon-textract.html https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/automatically-extract-content-from-pdf-files-using-amazon-textract.html

So, read pdf as bytes, send bytes to external AI based API, receive parsed content back.因此,将 pdf 读取为字节,将字节发送到基于 AI 的外部 API,然后接收解析后的内容。

Of course, you will need to do some preparation to use cloud services mentioned above and also it costs some money当然,你需要做一些准备才能使用上面提到的云服务,而且需要一些钱

How can I best explain why you need a pdf decompressor like pdftotext, is that, the first line when decoded by an app (this is not the raw byte stream) comes in three separate parts.我怎样才能最好地解释为什么您需要像 pdftotext 这样的 pdf 解压缩器,即,由应用程序解码时的第一行(这不是原始字节流)分为三个独立的部分。 Luckily as whole word strings (they do not need to) and also luckily in this case from the same ascii font table.幸运的是整个单词字符串(他们不需要),幸运的是在这种情况下来自同一个 ascii 字体表。

BT /F1 12.00 Tf ET
BT 42.52 793.70 Td (Espelho de Valores Atualizados.) Tj ET
BT /F1 12.00 Tf ET
BT 439.37 793.70 Td (Data: ) Tj ET
BT 481.89 793.70 Td (05/07/2021) Tj ET 

so we can easily see when converted into ascii that all three parts are at level 793.70 thus a lib can assume they are one line with only 3 different offsets, hence you need a 3rd party lib to decode and reassemble a line of text as if it is just one line string.因此我们可以很容易地看到,当转换为 ascii 时,所有三个部分都处于 793.70 级别,因此 lib 可以假设它们是只有 3 个不同偏移量的一行,因此您需要一个第 3 方 lib 来解码和重新组合一行文本,就好像它一样只是一个行字符串。 That requires first save pdf as file, parse the whole file into several common encodings like ascii, hex and UTF-16 mixed (there is generally no UTF-8) then save those as a plain text file with UTF-8 encoding, Then you can extract the UTF-8 lines as required.这需要首先将 pdf 保存为文件,将整个文件解析为几种常见的编码,如 ascii、hex 和 UTF-16 混合(通常没有 UTF-8),然后将它们保存为 UTF-8 编码的纯文本文件,然后你可以根据需要提取UTF-8行。

Unclear what format of line output you are hoping for since a PDF does not have numbered lines, however if we allocate numbers to lines with text (and some without) based on Human concept of Layout we can run a few lines using poppler utils and native OS text parsing.不清楚您希望的行输出格式是什么,因为 PDF 没有编号的行,但是如果我们根据人类的布局概念将数字分配给带有文本(有些没有)的行,我们可以使用 poppler utils 和 native 运行几行操作系统文本解析。 Here Cme could have loops and arguments, but hardcoded for demonstration.这里 Cme 可以有循环和参数,但为了演示而硬编码。 Note the console output would need local chcp but the text file is good请注意,控制台输出需要本地 chcp,但文本文件很好

Poppler\poppler-22.04.0\Library\bin>Cme.bat |more

@curl -o brtemp.pdf "https://eproc.trf4.jus.br/eproc2trf4/controlador.php?acao=acessar_documento_implementacao&doc=41625504719486351366932807019&evento=20084&key=c6c5f83e942a3ee021a874f6287505c1cb484235935ff1305c6081893e3481b1&hash=922cacb9024f200d13d3f819e2e906f4&termosPesquisados="
@pdftotext -f 1 -l 1 -nopgbrk -layout -enc UTF-8 brtemp.pdf page1.txt
@pdftotext -f 2 -l 2 -nopgbrk -layout -enc UTF-8 brtemp.pdf page2.txt
@find /N /V "Never2BFound" page1.txt
@find /N /V "Never2BFound" page2.txt

responds回应

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3749  100  3749    0     0   4051      0 --:--:-- --:--:-- --:--:--  4052

---------- PAGE1.TXT
[1]Espelho de Valores Atualizados.                                    Data:   05/07/2021
[2]

在此处输入图像描述

Page 1.txt第1页.txt

Espelho de Valores Atualizados.                                    Data:   05/07/2021

PROCESSO         : 5018290-57.2021.4.04.9388
ORIGINÁRIO       : 5002262-05.2018.4.04.7000/PR
TIPO             : Precatório

REQUERENTE       : ERCILIA GRACIE RIBEIRO
ADVOGADO         : ANA PAULA HORIGUCHI - PR064269

REQUERIDO  : INSTITUTO NACIONAL DO SEGURO SOCIAL - INSS
PROCURADOR : PROCURADORIA REGIONAL FEDERAL DA 4 REGIÃO - PRF4

DEPRECANTE       : Juízo Substituto da 10ª VF de Curitiba

etc..... ETC.....

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM