如何从字符串创建有意义的列值对列表？

Question

I am trying to categorize columns and values (column=value) meaningfully from an input string using Python dictionaries. 我正在尝试使用Python字典对输入字符串中的列和值（列=值）进行有意义的分类。

input_string = "the status is processing and product subtypes are HL year 30 ARM and applicant name is Ryan"

I have created dictionaries of key value pairs. 我创建了键值对字典。 In the first scenario, the key is the column name. 在第一种情况下，键是列名。 The value represents the lowest index of key found in input_string . 该值表示在input_string找到的键的最低索引。

Here is the dictionary of column names: 这是列名称的字典：

 dict_columns = {'status': 4, 'product subtypes': 29, 'applicant name': 69}

In the above dictionary, 'status' has the lowest index of 4 in the input_string . 在以上字典中， 'status'在input_string中的索引最低为4 。

Similarly, here is the dictionary of values: 同样，这是值的字典：

dict_values = {'processing': 14, 'hl': 50, 'year': 53, '30': 58, 'arm': 61, 'ryan': 87}

The question is: 问题是：
How to get the expected ouput as: 如何获得预期的输出为：

list_parsed_values = ['processing', 'hl year 30 arm', 'ryan']

and the (optional) corresponding list of columns as: 以及（可选）对应的列列表为：

list_parsed_columns = ['status', 'product subtypes', 'applicant name']

How to clearly distinguish the values in a list? 如何清楚地区分列表中的值？

Answer 1

Check the following approach: 检查以下方法：

Build the regex to remove irrelevant words from the results based on the English nltk stopword list 构建正则表达式以根据英语nltk停用词列表从结果中删除不相关的词
Build the regex to split the text with using the dict_columns keys 使用dict_columns键构建正则表达式以拆分文本
After splitting, zip the resulting list into a tuple list 拆分后，将结果列表压缩到元组列表中
Remove the irrelevant words from the values and strip the whitespace 从值中删除不相关的单词并去除空格

Here is the code I have come so far: 这是到目前为止我得到的代码：

import nltk, re
s = "the status is processing and product subtypes are HL year 30 ARM and applicant name is Ryan"
dict_columns = {'status': 4, 'product subtypes': 29, 'applicant name': 69}
dict_values = {'processing': 14, 'hl': 50, 'year': 53, '30': 58, 'arm': 61, 'ryan': 87}
# Build the regex to remove irrelevant words from the results
rx_stopwords = r"\b(?:{})\b".format("|".join([x for x in nltk.corpus.stopwords.words("English")]))
# Build the regex to split the text with using the dict_columns keys
rx_split = r"\b({})\b".format("|".join([x for x in dict_columns]))
chunks = re.split(rx_split, s)
# After splitting, zip the resulting list into a tuple list
it = iter(chunks[1:])
lst = list(zip(it, it))
# Remove the irrelevant words from the values and trim them (this can be further enhanced
res = [(x, re.sub(rx_stopwords, "", y).strip()) for x, y in lst]
# =>
#   [('status', 'processing'), ('product subtypes', 'HL year 30 ARM'), ('applicant name', 'Ryan')]
# It can be cast to a dictionary
dict(res)
# => 
#   {'product subtypes': 'HL year 30 ARM', 'status': 'processing', 'applicant name': 'Ryan'}

如何从字符串创建有意义的列值对列表？

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-04-25 12:56:23

如何从字符串创建有意义的列值对列表？

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-04-25 12:56:23

解决方案1
2 已采纳 2017-04-25 12:56:23