简体   繁体   English

使用Java从html文件中提取数据

[英]Extracting data from html file with Java

I have an HTML file full of addresses I need to extract. 我有一个包含需要提取的地址的HTML文件。 Looks like this, but with about 60 streets, and multiple numbers on each street 看起来像这样,但是大约有60条街道,每条街道上都有多个数字

    <BR>
    <Font Color=#FF0000 Size=7>MACARTNEY STREET (L)</Font>
    <BR>
    <BR>
    10........<Font Color=#FFFFFF> CM </Font>
    <BR>
    <BR>
    15........<Font Color=#FF0000> SH </Font>
    <BR>
    <BR>
    43A.......<Font Color=#FFFFFF> CM </Font>
    <BR>

I've been using regex to extract data, which works a treat for getting the street names 我一直在使用正则表达式提取数据,这对于获取街道名称很有帮助

    final Pattern STREETNAME = Pattern.compile("<Font Color=#FF0000 Size=7>(.+?)</Font>");
    Matcher stMatcher = STREETNAME.matcher("");
    while ((line = reader.readLine()) != null) {
       stMatcher = STREETNAME.matcher(line);
        if (stMatcher.find()) {
            String street = stMatcher.group(1);
            customerList.add(new Customer(street));}
    //customerList is an array of Customer Objects, defined elsewhere in the program

but no matter what, I just can't get it to read the house numbers (10, 15 and 43A in the example). 但是无论如何,我只是无法读取门牌号(在示例中为10、15和43A)。

Ideally I would store the street name string, extract the house number and concat them together before creating the customer object. 理想情况下,在创建客户对象之前,我将存储街道名称字符串,提取门牌号并将其合并在一起。 I'll need to check for the CM or SH line as well, but that can wait. 我还需要检查CM或SH线,但这可以等待。

Anyone got an idea that might help? 任何人都有可能有用的想法吗? I'm pretty stumped right now. 我现在很沮丧。

Thanks! 谢谢!

It's easy (I change also your existing regexp to match any color definition), this should work and get numbers and additional CM or SH 这很容易(我也更改了您现有的正则表达式以匹配任何颜色定义),这应该可以正常工作并获取数字和附加的CM或SH

final Pattern STREETNAME = Pattern.compile("<Font Color=.* Size=7>(.+?)</Font>");
final Pattern STREETNUMBER = Pattern.compile("^(\\d[^\\.]*)<Font Color=.*>\\s*(.+?)\\s*</Font>");
Matcher stMatcher;
Customer lastCustomer = null;
while ((line = reader.readLine()) != null) {
    stMatcher = STREETNAME.matcher(line);
    if (stMatcher.find()) {
        String street = stMatcher.group(1);
        lastCustomer = new Customer(street);
    } else {
        stMatcher = STREETNUMBER.matcher(line);
        if (stMatcher.find()) {
            if (lastCustomer != null) {
                lastCustomer.setStreetNumber(stMatcher.group(1));
                lastCustomer.setStreetCmSh(stMatcher.group(2));
                customerList.add(lastCustomer);
            }
        }
    }
}

How does it work? 它是如何工作的?

  • The pattern looks for a decimal character \\\\d 该模式寻找小数字符\\\\d
  • at the beginning ^ of a line, 在一行的开头^
  • accept all * characters through the first dot \\\\. 通过第一个点\\\\.接受所有*字符\\\\. (all other than dot: [^\\\\.] ) (除点以外的所有字符: [^\\\\.]
  • and put it in group 1. 并放在第1组中

The group 2 is filled without spaces \\\\s . 组2填充为无空格\\\\s


I guess you want to have separate attributes for the number and the additional info. 我猜您想为数字和其他信息设置单独的属性。 If not just concatenate the matches, eg 如果不只是连接匹配,例如

String lastStreet; // instead of lastCustomer
... 1st if/then:
lastStreet = stMatcher.group(1);
... 2nd if/then:
customerList.add(new Customer(lastStreet
    + " " + stMatcher.group(1)
    + " " + stMatcher.group(2));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM