Regex how to find pattern?

Question

I need to separate text below with Regex syntax. Actually I found recipes for dddd-dddd and dddd-ddd[x] . What with text? I need to get string with this value like this: "British Journal of Applied Science & Technology" . How to write it in regex?

337 British Journal of Applied Science & Technology 2231-0843 5
338 British Journal of Economics, Management & Trade 2278-098X 5
339 British Journal of Education, Society & Behavioural Science 2278-0998 6
340 British Journal of Environment and Climate Change 2231-4784 5
341 British Journal of Mathematics & Computer Science 2231-0851 4
342 British Journal of Medicine and Medical Research 2231-0614 8
343 British Journal of Pharmaceutical Research 2231-2919 4
344 British Microbiology Research Journal 2231-0886 9
345 Bromatologia i Chemia Toksykologiczna 0365-9445 5
346 Budownictwo Górnicze i Tunelowe 1234-5342 5
347 Budownictwo i Architektura 1899-0665 3
348 Budownictwo, Technologie, Architektura 1644-745X 3
349 Builder 1896-0642 2
350 Built Environment 0263-7960 10
351 Bulgarian Journal of Veterinary Medicine 1311-1477 8
352 Bulgarian Medicine 1314-3387 2
353 Bulletin de la Société des sciences et des lettres de Łódź, Série: Recherches sur les déformations 0459-6854 7
354 Bulletin of Alfred Nobel University. Series "Legal Science" 2226-2873 6
355 Bulletin of Geography. Socio-economic Series 1732-4254 10
356 Bulletin of Geography: Physical Geography Series 2080-7686 9
357 Bulletin of the Polish Academy of Sciences. Mathematics 0239-7269 9
358 Business and Economic Horizons 1804-1205 8
359 Business and Economics Research Journal 1309-2448 10
360 Business Process Management Journal 1463-7154 10

Answer 1

(?<=\d\s)\D+(?=\s\d)

That should find what you need. If you are interested in how it works: The first part of the Regex ( (?<=\\d\\s) ) declares that the searched phrase must come after a digit ( \\d ) followd by a whitespace ( \\s ).

The second part ( \\D+ ) is what is actually found. It means any number of non digit characters.

The third part ( (?=\\s\\d) ) makes sure that the result is followed by another whitespace and digit.

Answer 2

You can do it with an expression that uses lookahead and lookbehind, like this:

(?<=\d{3}\s).*(?=\s\d{4}-)

This expression requires three digits followed by space in front of the text, and four digits preceded by space and followed by a dash after the text. The name itself is matched by a straight .* pattern.

Demo.

Answer 3

Since you don't specify a target language or anything like that, here's how you could do it with perl:

cat test.txt | perl -pe 's/^\d+\s//' | perl -pe 's/[0-9X "-]+$//'

The second expression might need adaptation depending on how the rest of your data looks like.

This prints:

British Journal of Applied Science & Technology
British Journal of Economics, Management & Trade
British Journal of Education, Society & Behavioural Science
British Journal of Environment and Climate Change
[snip]
Bulletin of the Polish Academy of Sciences. Mathematics
Business and Economic Horizons
Business and Economics Research Journal
Business Process Management Journal

Answer 4

\d+ (.+) ....-.... \d+

Extracting:

British Journal of Applied Science & Technology
British Journal of Economics, Management & Trade
British Journal of Education, Society & Behavioural Science
British Journal of Environment and Climate Change
British Journal of Mathematics & Computer Science
British Journal of Medicine and Medical Research
British Journal of Pharmaceutical Research
[... cut ...]

Answer 5

(\d{3})\s([\D]+)(\d{4}-\d{3,4}X?\s\d{1,2})

This splits the string into 3 capture groups:

3 digits

Anything NOT containing a digit, up to the next digit

The reference at the end (assumes it begins with 4 digits and is in a consistent format)

See demo here

Answer 6

I understand you are looking for REGEX, but if you wanted something slightly more straight forward it looks like your document can easily be parsed using simple string manipulation. I offer this idea as an alternative for people not looking to use REGEX.

String tmp = "340 British Journal of Environment and Climate Change 2231-4784 5";
String ending = tmp.substring(tmp.length() - 11);
tmp = tmp.substring(0, (tmp.length() - 11)); //parse off the ending
StringTokenizer st = new StringTokenizer(tmp, " ");
String index = st.nextToken(); //reads the first int up to the first space.
tmp = tmp.substring(index.length()); //parse front

Now tmp is the name of the journal, index is the first few characters, and the reference at the end is saved as ending . This method only works presuming all the strings are exactly as listed above, or within similar bounds.

Answer 7

This one:

(?<=\d\s)\D+(?=\s\d)

works very well, but i found in my pdf that titles could have numbers, for example

338 British Journal of 5Economics, Management & Trade 2278-098X 5

How to properly parse it ? PS I write my app in C#(.NET).

Regex how to find pattern?

Question

7 answers

solution1
2 ACCPTED 2015-07-09 14:10:03

solution2
1 2015-07-09 14:03:23

solution3
0 2015-07-09 14:03:03

solution4
0 2015-07-09 14:03:53

solution5
0 2015-07-09 14:29:06

solution6
0 2015-07-09 14:39:33

solution7
0 2015-07-10 12:03:31

Regex how to find pattern?

Question

7 answers

solution1 2 ACCPTED 2015-07-09 14:10:03

solution2 1 2015-07-09 14:03:23

solution3 0 2015-07-09 14:03:03

solution4 0 2015-07-09 14:03:53

solution5 0 2015-07-09 14:29:06

solution6 0 2015-07-09 14:39:33

solution7 0 2015-07-10 12:03:31

solution1
2 ACCPTED 2015-07-09 14:10:03

solution2
1 2015-07-09 14:03:23

solution3
0 2015-07-09 14:03:03

solution4
0 2015-07-09 14:03:53

solution5
0 2015-07-09 14:29:06

solution6
0 2015-07-09 14:39:33

solution7
0 2015-07-10 12:03:31