简体   繁体   中英

Extracting specific information from data

How can i convert a data format like:

James Smith was born on November 17, 1948

into something like

("James Smith", DOB, "November 17, 1948")

without having to rely on positional index of strings

I have tried the following

from nltk import word_tokenize, pos_tag

new = "James Smith was born on November 17, 1948"
sentences = word_tokenize(new)
sentences = pos_tag(sentences)
grammar = "Chunk: {<NNP*><NNP*>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentences)
print(result)

How to proceed further to get the output in desired fromat.

在修剪空格并分配给 name 和 dob 之后,用“出生于”拆分字符串

You could always use a regular expressions. The regex (\\S+)\\s(\\S+)\\s\\bwas born on\\b\\s(\\S+)\\s(\\S+),\\s(\\S+) will match and return data from specifically the string format above.

Here's it in action: https://regex101.com/r/W2ykKS/1

Regex in python:

import re

regex = r"(\S+)\s(\S+)\s\bwas born on\b\s(\S+)\s(\S+),\s(\S+)"
test_str = "James Smith was born on November 17, 1948"

matches = re.search(regex, test_str)

# group 0 in a regex is the input string

print(matches.group(1)) # James
print(matches.group(2)) # Smith
print(matches.group(3)) # November
print(matches.group(4)) # 17
print(matches.group(5)) # 1948

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM