[英]c# itext7/itextsharp : how to find co-ordinates of particular term in PDF file?
我在 c# 中使用 itext7/itextsharp。
如何實現搜索特定單詞多次出現的坐標?
查找搜索詞(更一般地說,搜索表達式)的位置是 iText 7 RegexBasedLocationExtractionStrategy
。 它允許您在頁面上搜索正則表達式的所有匹配項,並返回這些匹配項,包括完全匹配的文本及其在頁面上的位置:
PdfDocument pdfDocument = ...
for (int page = 1; page <= pdfDocument.GetNumberOfPages(); page++)
{
Console.WriteLine("Page {0}", page);
RegexBasedLocationExtractionStrategy strategy = new RegexBasedLocationExtractionStrategy(SEARCH_EXPRESSION);
new PdfCanvasProcessor(strategy).ProcessPageContent(pdfDocument.GetPage(page));
foreach (IPdfTextLocation location in strategy.GetResultantLocations())
{
if (location != null)
{
Rectangle rect = location.GetRectangle();
Console.WriteLine(String.Format(CultureInfo.InvariantCulture, " - '{0}' at ({1}, {2}), {3}\u00d7{4}", location.GetText(), rect.GetX(), rect.GetY(), rect.GetWidth(), rect.GetHeight()));
}
}
}
例如,考慮最初由這個問題的 OP 共享的 PDF ENaB 20180317.pdf
:
在這些表中多次出現具有不同XXX的“SAN XXX ”。 將帶有正則表達式的上述代碼應用於該文件會導致:
Page 1
- 'SAN IGNACIO' at (183.6127, 81.85992), 45.49265×9.500404
- 'SAN CERNIN' at (260.1665, 203.9058), 42.94177×9.500397
- 'SAN IGNACIO' at (183.6058, 244.58), 45.49265×9.500397
- 'SAN CERNIN' at (239.0477, 244.58), 42.94247×9.500397
- 'SAN JORGE' at (392.0537, 298.8239), 40.27756×9.500397
- 'SAN CERNIN' at (183.6128, 407.3039), 42.93692×9.500397
- 'SAN IGNACIO' at (183.6058, 434.42), 45.49265×9.500397
- 'SAN IGNACIO' at (392.0432, 434.42), 45.48703×9.500397
Page 2
- 'SAN ADRIAN' at (279.3961, 136.1134), 42.57495×9.500397
- 'SAN CERNIN' at (183.6475, 149.6715), 42.93692×9.500397
- 'SAN CERNIN' at (392.0876, 149.6715), 42.94247×9.500397
- 'SAN CERNIN' at (183.6127, 231.0199), 42.9418×9.500397
- 'SAN CERNIN' at (392.0528, 231.0199), 42.94247×9.500397
- 'SAN IGNACIO' at (308.1654, 312.3896), 45.48703×9.500397
- 'SAN ADRIAN' at (472.0908, 339.5058), 42.57428×9.500397
- 'SAN CERNIN' at (263.1662, 380.18), 42.94247×9.500397
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.