简体   繁体   中英

Extract JSON-LD from HTML using Apache Any23

My aim is to extract structured data from webpages. I'm using the code mentioned in this SO question . I'm using Apache Any23 CLI library dependency in my Spring project.

By using this, I'm able to extract the HTML5 Microdata (Schema.org) from webpages. But, I can't extract the JSON-LD format present in the webpages. When I checked Apache Any23 's documentation, JSON-LD format is supported in it. Didn't find any further documentations on it.

Usually, if you create a new Any23 extractor with new Any23() it should work out of the box. If you use another constructor like Any23(String... extractorNames) you have to make make sure that the correct one is added for embedded JSON LD, which is "html-embedded-jsonld" .

Now if there are any errors in the extraction process, Any23 drops them silently. (It's great, I know!)

I found it is possible to set a breakpoint in the org.apache.any23.extractorExtractionResultImpl method notifyIssue . With this you may be able to find a more detailed reason for your problems.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM