i have a pdf file stored in a server url, and i want to get each line of the file, i want later export it to an excel file so i need to get every line, one by one, i will put the code here. OBS: the url of the pdf stop working after 3 hours, i will always update it here in the comments. thanks.
using System;
using System.Net.Http;
using System.Threading.Tasks;
public class Program
{
public static async Task Main()
{
var pdfUrl = "https://eproc.trf4.jus.br/eproc2trf4/controlador.php?acao=acessar_documento_implementacao&doc=41625504719486351366932807019&evento=20084&key=4baa2515293382eb41b2a95e121550490b5b154f1c4c06e8b0469eff082311e6&hash=3112f8451af24a1a5c3e69afab09f079&termosPesquisados=";
var client = new HttpClient();
var response = await client.GetAsync(pdfUrl);
using (var stream = await response.Content.ReadAsStreamAsync())
{
Console.WriteLine("print each line of my pdf file");
}
}
}
Well, extracting text from PDF is not an ordinary task. If you need really generic solution works with any pdf, then state of art solution here is to use AI based API provided for example by some cloud platforms like Google, AWS or Azure:
https://cloud.google.com/vision/docs/pdf
https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/
So, read pdf as bytes, send bytes to external AI based API, receive parsed content back.
Of course, you will need to do some preparation to use cloud services mentioned above and also it costs some money
How can I best explain why you need a pdf decompressor like pdftotext, is that, the first line when decoded by an app (this is not the raw byte stream) comes in three separate parts. Luckily as whole word strings (they do not need to) and also luckily in this case from the same ascii font table.
BT /F1 12.00 Tf ET
BT 42.52 793.70 Td (Espelho de Valores Atualizados.) Tj ET
BT /F1 12.00 Tf ET
BT 439.37 793.70 Td (Data: ) Tj ET
BT 481.89 793.70 Td (05/07/2021) Tj ET
so we can easily see when converted into ascii that all three parts are at level 793.70 thus a lib can assume they are one line with only 3 different offsets, hence you need a 3rd party lib to decode and reassemble a line of text as if it is just one line string. That requires first save pdf as file, parse the whole file into several common encodings like ascii, hex and UTF-16 mixed (there is generally no UTF-8) then save those as a plain text file with UTF-8 encoding, Then you can extract the UTF-8 lines as required.
Unclear what format of line output you are hoping for since a PDF does not have numbered lines, however if we allocate numbers to lines with text (and some without) based on Human concept of Layout we can run a few lines using poppler utils and native OS text parsing. Here Cme could have loops and arguments, but hardcoded for demonstration. Note the console output would need local chcp but the text file is good
Poppler\poppler-22.04.0\Library\bin>Cme.bat |more
@curl -o brtemp.pdf "https://eproc.trf4.jus.br/eproc2trf4/controlador.php?acao=acessar_documento_implementacao&doc=41625504719486351366932807019&evento=20084&key=c6c5f83e942a3ee021a874f6287505c1cb484235935ff1305c6081893e3481b1&hash=922cacb9024f200d13d3f819e2e906f4&termosPesquisados="
@pdftotext -f 1 -l 1 -nopgbrk -layout -enc UTF-8 brtemp.pdf page1.txt
@pdftotext -f 2 -l 2 -nopgbrk -layout -enc UTF-8 brtemp.pdf page2.txt
@find /N /V "Never2BFound" page1.txt
@find /N /V "Never2BFound" page2.txt
responds
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3749 100 3749 0 0 4051 0 --:--:-- --:--:-- --:--:-- 4052
---------- PAGE1.TXT
[1]Espelho de Valores Atualizados. Data: 05/07/2021
[2]
Page 1.txt
Espelho de Valores Atualizados. Data: 05/07/2021
PROCESSO : 5018290-57.2021.4.04.9388
ORIGINÁRIO : 5002262-05.2018.4.04.7000/PR
TIPO : Precatório
REQUERENTE : ERCILIA GRACIE RIBEIRO
ADVOGADO : ANA PAULA HORIGUCHI - PR064269
REQUERIDO : INSTITUTO NACIONAL DO SEGURO SOCIAL - INSS
PROCURADOR : PROCURADORIA REGIONAL FEDERAL DA 4 REGIÃO - PRF4
DEPRECANTE : Juízo Substituto da 10ª VF de Curitiba
etc.....
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.