简体   繁体   English

Java SAX 解析器子元素,当它们具有与父元素相同的标签时

[英]Java SAX Parser child element when they have same tag as parent

I'm trying to crawl data from a website with a list of item that belong in a div tag.我正在尝试从包含属于 div 标签的项目列表的网站中抓取数据。 Then in that single item, two separate part is made also with div tag.然后在那个单个项目中,两个单独的部分也用 div 标签制作。 One with image, and one with text and description.一张有图片,一张有文字和描述。 In startElement, I can identify them with Attribute but I can't end in endElement.在 startElement 中,我可以用 Attribute 来识别它们,但我不能在 endElement 中结束。 How can I parse item with same tag?如何解析具有相同标签的项目?

Example of an item I want to crawl:我要抓取的项目示例:

<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>JSP Page</title>
</head>
<body>
    <div class="o-ResultCard__m-MediaBlock m-MediaBlock">
        <div class="m-MediaBlock__m-TextWrap">
            <h3 class="m-MediaBlock__a-Headline">
                <a href="abc.com"><span class="m-MediaBlock__a-HeadlineText">Air Fryer Chicken Wings</span></a>
            </h3>
            <div class="parbase recipeInfo time">
                <section class="o-RecipeInfo__o-Time">
                    <dl>
                        <dt class="o-RecipeInfo__a-Headline a-Headline">Total Time: 40 minutes</dt>
                    </dl>
                </section>
            </div>
        </div>
        <div class="m-MediaBlock__m-MediaWrap">
            <a href="abc.com" class="" title="Air Fryer Chicken Wings">
                <img src="https://dinnerthendessert.com/wp-content/uploads/2019/01/Fried-Chicken-2.jpg" class="m-MediaBlock__a-Image" alt="Air Fryer Chicken Wings">
            </a>
        </div>
    </div>
</body>

My handler:我的处理程序:

private String currentTag;
private FoodDAO dao;
private FoodsDTO dto;
private String itemIdentify = "o-ResultCard__m-MediaBlock m-MediaBlock";
private String itemMedia = "m-MediaBlock__m-MediaWrap";
private String itemText = "m-MediaBlock__m-TextWrap";
private boolean foundItem;

public FoodHandler() {
    dao = new FoodDAO();
    foundItem = false;
}

@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
    String attrVal = attributes.getValue(0);

    if (qName.equals("div") && attrVal.equals(itemIdentify)) {
        dto = new FoodsDTO();
        foundItem = true;
    }
    currentTag = qName;
}

@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
    if (qName.endsWith("div")) {
        foundItem = false;
        try {
            dao.manageCrawl(dto);
        } catch (Exception e) {
            Logger.getLogger(NewsHandler.class.getName()).log(Level.SEVERE, null, e);
        }
    }
    currentTag = "";
}

Stop the attributes in a stack.停止堆栈中的属性。

More specifically, store a copy of the attributes in a Deque :更具体地说,将属性的副本存储在Deque中:

private Deque<Attributes> attributesStack = new ArrayDeque<>();

@Override
public void startDocument() throws SAXException {
    // Clear the stack at start of parsing, in case this handler is
    // re-used for multiple parsing operations, and previous parse failed.
    attributesStack.clear();
}

@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
    attributesStack.push(new AttributesImpl(attributes)); // Attributes must be copied
    
    // code here
}

@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
    Attributes attributes = attributesStack.pop();
    
    // code here
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM