[英]How to create meaningful column value pair lists from a string?
I am trying to categorize columns and values (column=value) meaningfully from an input string using Python dictionaries. 我正在尝试使用Python字典对输入字符串中的列和值(列=值)进行有意义的分类。
input_string = "the status is processing and product subtypes are HL year 30 ARM and applicant name is Ryan"
I have created dictionaries of key value pairs. 我创建了键值对字典。 In the first scenario, the key is the column name.
在第一种情况下, 键是列名。 The value represents the lowest index of key found in
input_string
. 该值表示在
input_string
找到的键的最低索引。
Here is the dictionary of column names: 这是列名称的字典:
dict_columns = {'status': 4, 'product subtypes': 29, 'applicant name': 69}
In the above dictionary, 'status'
has the lowest index of 4
in the input_string
. 在以上字典中,
'status'
在input_string
中的索引最低为4
。
Similarly, here is the dictionary of values: 同样,这是值的字典:
dict_values = {'processing': 14, 'hl': 50, 'year': 53, '30': 58, 'arm': 61, 'ryan': 87}
The question is: 问题是:
How to get the expected ouput as: 如何获得预期的输出为:
list_parsed_values = ['processing', 'hl year 30 arm', 'ryan']
and the (optional) corresponding list of columns as: 以及(可选)对应的列列表为:
list_parsed_columns = ['status', 'product subtypes', 'applicant name']
How to clearly distinguish the values in a list? 如何清楚地区分列表中的值?
Check the following approach: 检查以下方法:
dict_columns
keys dict_columns
键构建正则表达式以拆分文本 Here is the code I have come so far: 这是到目前为止我得到的代码:
import nltk, re
s = "the status is processing and product subtypes are HL year 30 ARM and applicant name is Ryan"
dict_columns = {'status': 4, 'product subtypes': 29, 'applicant name': 69}
dict_values = {'processing': 14, 'hl': 50, 'year': 53, '30': 58, 'arm': 61, 'ryan': 87}
# Build the regex to remove irrelevant words from the results
rx_stopwords = r"\b(?:{})\b".format("|".join([x for x in nltk.corpus.stopwords.words("English")]))
# Build the regex to split the text with using the dict_columns keys
rx_split = r"\b({})\b".format("|".join([x for x in dict_columns]))
chunks = re.split(rx_split, s)
# After splitting, zip the resulting list into a tuple list
it = iter(chunks[1:])
lst = list(zip(it, it))
# Remove the irrelevant words from the values and trim them (this can be further enhanced
res = [(x, re.sub(rx_stopwords, "", y).strip()) for x, y in lst]
# =>
# [('status', 'processing'), ('product subtypes', 'HL year 30 ARM'), ('applicant name', 'Ryan')]
# It can be cast to a dictionary
dict(res)
# =>
# {'product subtypes': 'HL year 30 ARM', 'status': 'processing', 'applicant name': 'Ryan'}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.