I am currently writing a Java program to parse through text content with special markup language and do the following :
I have been struggling with a for loop in my code for days and I cannot find the problem : my for-loop adds the first element of the list again to the second one and I do not understand where in my code it leads to that kind of bug. This piece code is the result of "repairing" several NullPointerExceptions and it is not very beautiful, I hope some of you guys could read and give me a hint about my mistake :
//we want to use the advantages of both ArrayList and String Array so we will work with both types
List<String> temp = new ArrayList<String>();
for (String line : dr.getAllLines()){
temp.add(line);
}
String[] tempArray= new String[temp.size()];
temp.toArray(tempArray); //fill the array with the contents of the temp list
for (int i=0; i<temp.size(); i++){//this for loop goes through the lines looking for our pattern
//define this pattern :
Pattern patternS = Pattern.compile("^=== (.+) ==="); //new entry is always characterized by this regex
Matcher matcherS = patternS.matcher(tempArray[i]);
if (matcherS.find()){ //if current line matches pattern
for(int ii=i+1; ii<temp.size(); ii++){ //this for loop adds content to our Eintraege list (because we found the pattern)
//clean up current line (i)
tempArray[ii-1]=tempArray[ii-1].replaceAll("[^a-zA-ZßüöäÜÖÄ|\\s+]", "");
//add current line (i) to temporary Eintrag_lines String
Eintrag_lines=Eintrag_lines + "\n" + tempArray[ii-1];
//define again pattern (for next entry)
Pattern patternStop = Pattern.compile("^=== (.+) ===");
Matcher MatcherSTOP = patternStop.matcher(temp.get(ii)); //look at next line (ii)
if(MatcherSTOP.find()){//if we find the line corresponding to the next Eintrag
//Eintraege is a list of all entries for one word (one element=one entry)
Eintraege.add(Eintrag_lines);
Eintrag_lines = ""; //clear current entry String
break;//stop adding to our Eintraege list and go back to
}
} //this for-loop adds lines for 1 entry until it finds the first line of the next entry (MatcherSTOP)
} else {
continue;
}
}
return Eintraege;
}
//method
public String getSpecificEintrag(int inputNumber){ //input "1" : first element of list
parseEintraege();
for (int i=0;i<Eintraege.size();i++){
System.out.println(i + Eintraege.get(i) + "\n next : \n");
}
//try{
// System.out.println(Eintraege.get(inputNumber-1));
// return Eintraege.get(inputNumber-1);
//} catch (IndexOutOfBoundsException e){
// System.err.println("IndexOutOfBoundsException : try a smaller number for the entry");
//}
return "";
}
The initial text file is a Wiktionary entry for the german word "Ton" and is roughly built up in this style :
== Ton ({{Sprache|Deutsch}}) ==
=== {{Wortart|Substantiv|Deutsch}}, {{m}} ===
----------content, definitions, examples, ...------------------
=== {{Wortart|Substantiv|Deutsch}}, {{m}} ===
----------more content, ... -----------------------------------
This is what I get :
0
Wortart|Substantiv|Deutsch m
Deutsch Substantiv Übersicht
|Genusm
|Nominativ SingularTon
|Genitiv SingularTons
|Genitiv SingularTones
|Genitiv PluralTone
|Dativ SingularTon
|Dativ SingularTone
|Dativ PluralTonen
|Akkusativ SingularTon
|Akkusativ PluralTone
Worttrennung
Ton Pl Tone
Aussprache
IPA Lautschrift|ton Pl Lautschrift|ton
Hörbeispiele Audio|DeTonogg Pl Audio|DeTöneogg|Tone
Reime Reim|on|Deutsch
Bedeutungen
feinkörniges Verwitterungsprodukt Bodenart Töpfermaterial
Herkunft
Durch Verdumpfung von zu aus dem frühneuhochdeutschen tahen than welches wiederum aus dem spätmittelhochdeutschen dhe the Genitiv dhen then Lehm althochdeutsch thha Ton Lehm Töpfererde irdenes Gefäß hervorgegangenen war Belegt seit der Zeit um Verwandt sind das mittelniederdeutsche d das altenglische he und gotisch h Ton Lehm Allen zugrunde liegt des protogermanische germ anhn beim Trocknen schrumpfende dichter werdende Erde Während die frühen Formen noch feminin waren fand ein Genuswechsel zum maskulinen Genus wohl in Anlehnung an Lehm statt refLiteratur|AutorWolfgang Pfeifer Leitung|TitelEtymologisches Wörterbuch des Deutschen|Auflage durchgesehene und erweiterte|VerlagDeutscher Taschenbuch Verlag|OrtMünchen|Jahr|ISBN Stichwort supsupTonref
Synonyme
Lehm Mergel
Beispiele
Der Boden hier besteht zum größten Teil aus Ton
Wortbildungen
tönern Tonerde Tonpfeife Tontaube
Übersetzungen
ÜTabelle|Ülinks
en Ü|en|clay
fr Ü|fr|argile f
it Ü|it|argilla f
ca Ü|ca|argila f
pl Ü|pl|glina f Ü|pl|i m
pt Ü|pt|argila f
|Ürechts
ro Ü|ro|lut
ru Üt|ru||
sv Ü|sv|lera
es Ü|es|arcilla f
hu Ü|hu|agyag
Referenzen
Wikipedia|Ton
RefDWDS|Ton
RefDuden|TonSediment|Ton Sediment
RefCanoo|Ton
RefUniLeipzig|Ton
Quellen
next :
So I get only the first entry. Maybe the mistake is very dumb and simple, but I just cannot see it.
Thank you very much and sorry for the long post !
PS : if you need a translation for some german words or variable names please let me know.
I'll expand my comments with a rough outline of what I'd do to simplify/improve things:
//your Eintraege
List<String> entries = ...;
//compile the pattern only once
Pattern startPattern= Pattern.compile("^=== (.+) ===");
//use a StringBuilder for better performance
StringBuilder entry = new StringBuilder();
boolean inEntry = false;
for (String line : dr.getAllLines()){
//you could also create the matcher outside and call matcher.reset(line) instead
Matcher matcher = startPattern.matcher( line );
if( matcher.find() ) {
if( inEntry ) {
//we're in an entry already so add the current and start a new one
entries.add( entry.toString() );
entry.clear();
}
//now we're definitely in an entry
inEntry = true;
}
//no entry start but in an entry already
else if( inEntry ) {
//apply whatever replacements you want and add the line to the current entry
entry.append( line.replace("foo", "bar" ) );
}
}
//if we're still in an entry here we need to add it as it didn't already happen in the loop
if( inEntry ) {
entries.add( entry.toString() );
entry.clear();
}
As you can see there are a few differences to your code:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.