簡體   English   中英

正則表達式從段落中提取參考書目文本 - Python

[英]Regex to extract bibliography text from paragraph - Python

在我的Python任務中,我有一個參考書目字符串(段落),我想將其解析為字符串列表。

這是整個字符串

A. Berger and H. Printz. 1998. Recognition perfor- mance of a large-scale dependency-grammar lan- guage model. In Int'l Conference on Spoken Lan- guage Processing (ICSLP'98), Sydney, Australia. A. Blum. 1992. Learning boolean functions in an infinite attribute space. Machine Learning, 9(4):373-386. E. Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4):543-565. C. Chelba and F. Jelinek. 1998. Exploiting syntac- tic structure for language modeling. In COLING- A CL '98. C. Cumby and D. Roth. 2000. Relational repre- sentations that facilitate learning. In Proc. of the International Conference on the Principles of Knowledge Representation and Reasoning. To ap- pear. I. Dagan, L. Lee, and F. Pereira. 1999. Similarity- based models of word cooccurrence probabilities. Machine Learning, 34(1-3):43-69. A. R. Golding and D. Roth. 1999. A Winnow based approach to context-sensitive spelling correction. Machine Learning, 34(1-3):107-130. Special Issue on Machine Learning and Natural Language. F. Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press. D. Jurafsky and J. H. Martin. 200. Speech and Lan- guage Processing. Prentice Hall. L. Lee and F. Pereira. 1999. Distributional similar- ity models: Clustering vs. nearest neighbors. In A CL 99, pages 33-40. L. Lee. 1999. Measure of distributional similarity. In A CL 99, pages 25-32. N. Littlestone. 1988. Learning quickly when irrel- evant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285-318. M. Munoz, V. Punyakanok, D. Roth, and D. Zimak. 1999. A learning approach to shallow parsing. In EMNLP-VLC'99, the Joint SIGDAT Conference on Empirical Methods in Natural Language Pro- cessing and Very Large Corpora, June. A. Ratnaparkhi, J. Reynar, and S. Roukos. 1994. A maximum entropy model for prepositional phrase attachment. In ARPA, Plainsboro, N J, March. R. Rosenfeld. 1996. A maximum entropy approach to adaptive statistical language modeling. Com- puter, Speech and Language, 10. D. Roth and D. Zelenko. 1998. Part of speech tagging using a network of linear separators. In COLING-ACL 98, The 17th International Conference on Computational Linguistics, pages 1136-1142. D. Roth. 1998. Learning to resolve natural language ambiguities: A unified approach. In Proc. Na- tional Conference on Artificial Intelligence, pages 806-813. D. Roth. 1999. Learning in natural language. In Proc. of the International Joint Conference of Ar- tificial Intelligence, pages 898-904. P. Tapanainen and T. Jrvinen. 1997. A non- projective dependency parser. In In Proceedings of the 5th Conference on Applied Natural Lan- guage Processing, Washington DC. D. Yarowsky. 1994. Decision lists for lexical ambi- guity resolution: application to accent restoration in Spanish and French. In Proc. of the Annual Meeting of the A CL, pages 88-95. D. Yuret. 1998. Discovery of Linguistic Relations Using Lexical Attraction. Ph.D. thesis, MIT. 131

這就是我想要的 output...

A. Berger and H. Printz. 1998. Recognition performance of a large-scale dependency-grammar language model. In Int'l Conference on Spoken Language Processing (ICSLP'98), Sydney, Australia.


A. Blum. 1992. Learning boolean functions in an infinite attribute space. Machine Learning, 9(4):373-386.

E. Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4):543-565.

C. Chelba and F. Jelinek. 1998. Exploiting syntactic structure for language modeling. In COLINGA CL '98.

C. Cumby and D. Roth. 2000. Relational representations that facilitate learning. In Proc. of the International Conference on the Principles of Knowledge Representation and Reasoning. To appear.

I. Dagan, L. Lee, and F. Pereira. 1999. Similaritybased models of word cooccurrence probabilities. Machine Learning, 34(1-3):43-69.

A. R. Golding and D. Roth. 1999. A Winnow based approach to context-sensitive spelling correction. Machine Learning, 34(1-3):107-130. Special Issue on Machine Learning and Natural Language.

F. Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press.

D. Jurafsky and J. H. Martin. 200. Speech and Language Processing. Prentice Hall. 

等等...

我嘗試了不同的正則表達式,但無法獲得正確的結果。 因為字符串沒有任何特定的結尾。

但是每個新字符串都以作者姓名開頭,然后是年份,然后是論文名稱。

例如,在第一個字符串 AuthorName ( A. Berger ) 后跟一個and另一個作者姓名 ( H. printz. ),然后是1998. 但在第 2 個字符串中,作者姓名 ( A. Blum. ) 緊隨1992.

任何形式的幫助將不勝感激。

unable to get a proper result. because string does not have any specific end. But every new string is starting with Author Name(s) following by year

這可能就足夠了。 我寫了一個works on your whole sample的正則表達式,
但是它仍然是主觀的。 名稱形式或標點符號的任何加減
會把它從水里吹出來。

((?:(?<![a-zA-Z])[A-Z]\.[ \t]+)+[A-Z][a-zA-Z]+(?:[ \t]*,[ \t]*(?:(?<![a-zA-Z])[A-Z]\.[ \t]+)+[A-Z][a-zA-Z]+)*(?:[ \t]*,)?(?:[ \t]+and[ \t]+(?:(?<![a-zA-Z])[A-Z]\.[ \t]+)+[A-Z][a-zA-Z]+)*[ \t]*\.[ \t]*\d{4}[ \t]*\.)(?!\S)

替換為\r\n\1

在此處查看示例-> https://regex101.com/r/ylZKDH/1

python 每個請求的子樣本

>>> import re
>>>
>>> biblioStr = '''A. Berger and H. Printz. 1998. Recognition perfor- mance of a large-scale dependency-grammar lan- guage model. In Int'l Conference on Spoken Lan- guage Processing (ICSLP'98), Sydney, Australia. A. Blum. 1992. Learning boo
lean functions in an infinite attribute space. Machine Learning, 9(4):373-386. E. Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21
(4):543-565. C. Chelba and F. Jelinek. 1998. Exploiting syntac- tic structure for language modeling. In COLING- A CL '98. C. Cumby and D. Roth. 2000. Relational repre- sentations that facilitate learning. In Proc. of the International Confe
rence on the Principles of Knowledge Representation and Reasoning. To ap- pear. I. Dagan, L. Lee, and F. Pereira. 1999. Similarity- based models of word cooccurrence probabilities. Machine Learning, 34(1-3):43-69. A. R. Golding and D. Roth.
 1999. A Winnow based approach to context-sensitive spelling correction. Machine Learning, 34(1-3):107-130. Special Issue on Machine Learning and Natural Language. F. Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press. D.
Jurafsky and J. H. Martin. 200. Speech and Lan- guage Processing. Prentice Hall. L. Lee and F. Pereira. 1999. Distributional similar- ity models: Clustering vs. nearest neighbors. In A CL 99, pages 33-40. L. Lee. 1999. Measure of distributi
onal similarity. In A CL 99, pages 25-32. N. Littlestone. 1988. Learning quickly when irrel- evant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285-318. M. Munoz, V. Punyakanok, D. Roth, and D. Zimak. 1999. A lea
rning approach to shallow parsing. In EMNLP-VLC'99, the Joint SIGDAT Conference on Empirical Methods in Natural Language Pro- cessing and Very Large Corpora, June. A. Ratnaparkhi, J. Reynar, and S. Roukos. 1994. A maximum entropy model for
prepositional phrase attachment. In ARPA, Plainsboro, N J, March. R. Rosenfeld. 1996. A maximum entropy approach to adaptive statistical language modeling. Com- puter, Speech and Language, 10. D. Roth and D. Zelenko. 1998. Part of speech ta
gging using a network of linear separators. In COLING-ACL 98, The 17th International Conference on Computational Linguistics, pages 1136-1142. D. Roth. 1998. Learning to resolve natural language ambiguities: A unified approach. In Proc. Na-
 tional Conference on Artificial Intelligence, pages 806-813. D. Roth. 1999. Learning in natural language. In Proc. of the International Joint Conference of Ar- tificial Intelligence, pages 898-904. P. Tapanainen and T. Jrvinen. 1997. A non
- projective dependency parser. In In Proceedings of the 5th Conference on Applied Natural Lan- guage Processing, Washington DC. D. Yarowsky. 1994. Decision lists for lexical ambi- guity resolution: application to accent restoration in Span
ish and French. In Proc. of the Annual Meeting of the A CL, pages 88-95. D. Yuret. 1998. Discovery of Linguistic Relations Using Lexical Attraction. Ph.D. thesis, MIT. 131
... '''
>>>
>>> Rx = re.compile( r"((?:(?<![a-zA-Z])[A-Z]\.[ \t]+)+[A-Z][a-zA-Z]+(?:[ \t]*,[ \t]*(?:(?<![a-zA-Z])[A-Z]\.[ \t]+)+[A-Z][a-zA-Z]+)*(?:[ \t]*,)?(?:[ \t]+and[ \t]+(?:(?<![a-zA-Z])[A-Z]\.[ \t]+)+[A-Z][a-zA-Z]+)*[ \t]*\.[ \t]*\d{4}[ \t]*\.)(?!
\S)" )
>>>
>>> print (re.sub( Rx, r'\r\n\1', biblioStr ))

A. Berger and H. Printz. 1998. Recognition perfor- mance of a large-scale dependency-grammar lan- guage model. In Int'l Conference on Spoken Lan- guage Processing (ICSLP'98), Sydney, Australia.
A. Blum. 1992. Learning boolean functions in an infinite attribute space. Machine Learning, 9(4):373-386.
E. Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4):543-565.
C. Chelba and F. Jelinek. 1998. Exploiting syntac- tic structure for language modeling. In COLING- A CL '98.
C. Cumby and D. Roth. 2000. Relational repre- sentations that facilitate learning. In Proc. of the International Conference on the Principles of Knowledge Representation and Reasoning. To ap- pear.
I. Dagan, L. Lee, and F. Pereira. 1999. Similarity- based models of word cooccurrence probabilities. Machine Learning, 34(1-3):43-69.
A. R. Golding and D. Roth. 1999. A Winnow based approach to context-sensitive spelling correction. Machine Learning, 34(1-3):107-130. Special Issue on Machine Learning and Natural Language.
F. Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press. D. Jurafsky and J. H. Martin. 200. Speech and Lan- guage Processing. Prentice Hall.
L. Lee and F. Pereira. 1999. Distributional similar- ity models: Clustering vs. nearest neighbors. In A CL 99, pages 33-40.
L. Lee. 1999. Measure of distributional similarity. In A CL 99, pages 25-32.
N. Littlestone. 1988. Learning quickly when irrel- evant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285-318.
M. Munoz, V. Punyakanok, D. Roth, and D. Zimak. 1999. A learning approach to shallow parsing. In EMNLP-VLC'99, the Joint SIGDAT Conference on Empirical Methods in Natural Language Pro- cessing and Very Large Corpora, June.
A. Ratnaparkhi, J. Reynar, and S. Roukos. 1994. A maximum entropy model for prepositional phrase attachment. In ARPA, Plainsboro, N J, March.
R. Rosenfeld. 1996. A maximum entropy approach to adaptive statistical language modeling. Com- puter, Speech and Language, 10.
D. Roth and D. Zelenko. 1998. Part of speech tagging using a network of linear separators. In COLING-ACL 98, The 17th International Conference on Computational Linguistics, pages 1136-1142.
D. Roth. 1998. Learning to resolve natural language ambiguities: A unified approach. In Proc. Na- tional Conference on Artificial Intelligence, pages 806-813.
D. Roth. 1999. Learning in natural language. In Proc. of the International Joint Conference of Ar- tificial Intelligence, pages 898-904.
P. Tapanainen and T. Jrvinen. 1997. A non- projective dependency parser. In In Proceedings of the 5th Conference on Applied Natural Lan- guage Processing, Washington DC.
D. Yarowsky. 1994. Decision lists for lexical ambi- guity resolution: application to accent restoration in Spanish and French. In Proc. of the Annual Meeting of the A CL, pages 88-95.
D. Yuret. 1998. Discovery of Linguistic Relations Using Lexical Attraction. Ph.D. thesis, MIT. 131

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM