简体   繁体   中英

Java - for-loop not adding all elements to list

I am currently writing a Java program to parse through text content with special markup language and do the following :

  • pick up specific contents, like definitions
  • remove the special syntax to get a more clean text output

I have been struggling with a for loop in my code for days and I cannot find the problem : my for-loop adds the first element of the list again to the second one and I do not understand where in my code it leads to that kind of bug. This piece code is the result of "repairing" several NullPointerExceptions and it is not very beautiful, I hope some of you guys could read and give me a hint about my mistake :

//we want to use the advantages of both ArrayList and String Array so we will work with both types
    List<String> temp = new ArrayList<String>();
    for (String line : dr.getAllLines()){
        temp.add(line);
    }
    String[] tempArray= new String[temp.size()];
    temp.toArray(tempArray); //fill the array with the contents of the temp list

    for (int i=0; i<temp.size(); i++){//this for loop goes through the lines looking for our pattern

        //define this pattern :
        Pattern patternS = Pattern.compile("^=== (.+) ==="); //new entry is always characterized by this regex
        Matcher matcherS = patternS.matcher(tempArray[i]);

        if (matcherS.find()){ //if current line matches pattern

            for(int ii=i+1; ii<temp.size(); ii++){ //this for loop adds content to our Eintraege list (because we found the pattern)

                //clean up current line (i)
                tempArray[ii-1]=tempArray[ii-1].replaceAll("[^a-zA-ZßüöäÜÖÄ|\\s+]", "");

                //add current line (i) to temporary Eintrag_lines String
                Eintrag_lines=Eintrag_lines + "\n" + tempArray[ii-1]; 

                //define again pattern (for next entry)
                Pattern patternStop = Pattern.compile("^=== (.+) ===");
                Matcher MatcherSTOP = patternStop.matcher(temp.get(ii)); //look at next line (ii)

                if(MatcherSTOP.find()){//if we find the line corresponding to the next Eintrag

                    //Eintraege is a list of all entries for one word (one element=one entry)
                    Eintraege.add(Eintrag_lines); 

                    Eintrag_lines = ""; //clear current entry String
                    break;//stop adding to our Eintraege list and go back to 
                }

            } //this for-loop adds lines for 1 entry until it finds the first line of the next entry (MatcherSTOP)
        }   else {
            continue;
        }
    }

    return Eintraege;
}

//method
public String getSpecificEintrag(int inputNumber){ //input "1" : first element of list
    parseEintraege();
    for (int i=0;i<Eintraege.size();i++){
        System.out.println(i + Eintraege.get(i) + "\n next : \n");
    }


    //try{
    //  System.out.println(Eintraege.get(inputNumber-1));
    //  return Eintraege.get(inputNumber-1);
    //} catch (IndexOutOfBoundsException e){
    //  System.err.println("IndexOutOfBoundsException : try a smaller number for the entry");
    //}
    return "";
}

The initial text file is a Wiktionary entry for the german word "Ton" and is roughly built up in this style :

== Ton ({{Sprache|Deutsch}}) ==

=== {{Wortart|Substantiv|Deutsch}}, {{m}} ===

----------content, definitions, examples, ...------------------

=== {{Wortart|Substantiv|Deutsch}}, {{m}} ===

----------more content, ... -----------------------------------

This is what I get :

0
 Wortart|Substantiv|Deutsch m 

Deutsch Substantiv Übersicht
|Genusm
|Nominativ SingularTon
|Genitiv SingularTons
|Genitiv SingularTones
|Genitiv PluralTone
|Dativ SingularTon
|Dativ SingularTone
|Dativ PluralTonen
|Akkusativ SingularTon
|Akkusativ PluralTone


Worttrennung
Ton Pl Tone

Aussprache
IPA Lautschrift|ton Pl Lautschrift|ton
Hörbeispiele Audio|DeTonogg Pl Audio|DeTöneogg|Tone
Reime Reim|on|Deutsch

Bedeutungen
 feinkörniges Verwitterungsprodukt Bodenart Töpfermaterial

Herkunft
Durch Verdumpfung von  zu  aus dem frühneuhochdeutschen tahen than welches wiederum aus dem spätmittelhochdeutschen dhe the Genitiv dhen then Lehm althochdeutsch thha Ton Lehm Töpfererde irdenes Gefäß hervorgegangenen war Belegt seit der Zeit um  Verwandt sind das mittelniederdeutsche d das altenglische  he und gotisch h Ton Lehm Allen zugrunde liegt des protogermanische germ anhn beim Trocknen schrumpfende dichter werdende Erde Während die frühen Formen noch feminin waren fand ein Genuswechsel zum maskulinen Genus wohl in Anlehnung an Lehm statt refLiteratur|AutorWolfgang Pfeifer Leitung|TitelEtymologisches Wörterbuch des Deutschen|Auflage durchgesehene und erweiterte|VerlagDeutscher Taschenbuch Verlag|OrtMünchen|Jahr|ISBN Stichwort supsupTonref

Synonyme
 Lehm Mergel

Beispiele
 Der Boden hier besteht zum größten Teil aus Ton

Wortbildungen
 tönern Tonerde Tonpfeife Tontaube

 Übersetzungen 
ÜTabelle|Ülinks
en  Ü|en|clay
fr  Ü|fr|argile f
it  Ü|it|argilla f
ca  Ü|ca|argila f
pl  Ü|pl|glina f  Ü|pl|i m
pt  Ü|pt|argila f
|Ürechts
ro  Ü|ro|lut
ru  Üt|ru||
sv  Ü|sv|lera
es  Ü|es|arcilla f
hu  Ü|hu|agyag


Referenzen
 Wikipedia|Ton
 RefDWDS|Ton
 RefDuden|TonSediment|Ton Sediment
 RefCanoo|Ton
 RefUniLeipzig|Ton

Quellen



 next : 

So I get only the first entry. Maybe the mistake is very dumb and simple, but I just cannot see it.

Thank you very much and sorry for the long post !

PS : if you need a translation for some german words or variable names please let me know.

I'll expand my comments with a rough outline of what I'd do to simplify/improve things:

//your Eintraege
List<String> entries = ...; 

//compile the pattern only once
Pattern startPattern= Pattern.compile("^=== (.+) ===");

//use a StringBuilder for better performance
StringBuilder entry = new StringBuilder();  
boolean inEntry = false;

for (String line : dr.getAllLines()){   
  //you could also create the matcher outside and call matcher.reset(line) instead
  Matcher matcher = startPattern.matcher( line );

  if( matcher.find() ) {
    if( inEntry ) {
      //we're in an entry already so add the current and start a new one
      entries.add( entry.toString() );
      entry.clear();
    }

    //now we're definitely in an entry
    inEntry = true;
  }
  //no entry start but in an entry already 
  else if( inEntry ) {
    //apply whatever replacements you want and add the line to the current entry
    entry.append( line.replace("foo", "bar" ) );
  }
}

//if we're still in an entry here we need to add it as it didn't already happen in the loop
if( inEntry ) {
   entries.add( entry.toString() );
   entry.clear();
}

As you can see there are a few differences to your code:

  • No need for any additional list or array.
  • The pattern is compiled only once (no need to do that multiple times).
  • No need to look what's next, just react based on the current line.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM