[英]extracting text from online pdf file (sharepoint)
我知道有人問過類似的問題,但我找不到解決問題的辦法。 因此,基本上,我想使用C#從在線pdf文件中提取文本,這可以通過[itextsharp]庫實現。 但是,這適用於一些隨機pdf文件,我可以通過在Google上搜索它們來找到。 我的目標是對存儲多個PDF文件的私人共享點帳戶執行相同操作。 我的chrome瀏覽器設置為可以記住用戶名和密碼,但是我的代碼仍然無法實現。 對我來說,身份驗證似乎沒有問題,但是我可能錯了。 這是代碼:
public static string pdfTX(string path)
{
CookieContainer cookieJar = new CookieContainer();
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(path);
request.Proxy.Credentials = CredentialCache.DefaultCredentials;
request.UserAgent = @"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36";
request.Credentials = CredentialCache.DefaultCredentials;
// request.Credentials = new NetworkCredential(uName, pWord);
Thread.Sleep(3000);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream stream = request.GetResponse().GetResponseStream();
string textPDF = string.Empty;
PdfReader reader = new PdfReader(stream); //error here (iTextSharp.text.exceptions.InvalidPdfException: 'PDF header signature not found.')
for (int page = 1; page <= reader.NumberOfPages; page++)
{
textPDF += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
response.Close();
// readStream.Close();
// public string passtext = text;
return textPDF;
}
我會很高興提供任何幫助或信息! 感謝您的時間和精力!
這是我的CSOM測試代碼(如果使用流讀取文件,則會遇到相同的錯誤)。
static void Main(string[] args)
{
string login = "lee@domain.onmicrosoft.com"; //give your username here
using (var context = new ClientContext("https://domain.sharepoint.com/sites/tst"))
{
string password = "pw";
SecureString sec_pass = new SecureString();
Array.ForEach(password.ToArray(), sec_pass.AppendChar);
sec_pass.MakeReadOnly();
context.Credentials = new SharePointOnlineCredentials(login, sec_pass);
var path = "/sites/tst/mydoc3/beginning_sharepoint_2013_development.pdf";
var file = context.Web.GetFileByServerRelativeUrl(path);
context.Load(file);
context.ExecuteQuery();
ClientResult<System.IO.Stream> data = file.OpenBinaryStream();
context.Load(file);
context.ExecuteQuery();
string textPDF = string.Empty;
using (System.IO.MemoryStream mStream = new System.IO.MemoryStream())
{
if (data != null)
{
data.Value.CopyTo(mStream);
byte[] array = mStream.ToArray();
PdfReader reader = new PdfReader(array);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
textPDF += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
}
}
}
Console.WriteLine("done");
Console.ReadKey();
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.