简体   繁体   English

如何从字符串创建有意义的列值对列表?

[英]How to create meaningful column value pair lists from a string?

I am trying to categorize columns and values (column=value) meaningfully from an input string using Python dictionaries. 我正在尝试使用Python字典对输入字符串中的列和值(列=值)进行有意义的分类。

input_string = "the status is processing and product subtypes are HL year 30 ARM and applicant name is Ryan"

I have created dictionaries of key value pairs. 我创建了键值对字典。 In the first scenario, the key is the column name. 在第一种情况下, 是列名。 The value represents the lowest index of key found in input_string . 表示在input_string找到的键的最低索引。

Here is the dictionary of column names: 这是列名称的字典:

 dict_columns = {'status': 4, 'product subtypes': 29, 'applicant name': 69}

In the above dictionary, 'status' has the lowest index of 4 in the input_string . 在以上字典中, 'status'input_string中的索引最低为4


Similarly, here is the dictionary of values: 同样,这是值的字典:

dict_values = {'processing': 14, 'hl': 50, 'year': 53, '30': 58, 'arm': 61, 'ryan': 87}

The question is: 问题是:
How to get the expected ouput as: 如何获得预期的输出为:

list_parsed_values = ['processing', 'hl year 30 arm', 'ryan']

and the (optional) corresponding list of columns as: 以及(可选)对应的列列表为:

list_parsed_columns = ['status', 'product subtypes', 'applicant name']

How to clearly distinguish the values in a list? 如何清楚地区分列表中的值?

Check the following approach: 检查以下方法:

  • Build the regex to remove irrelevant words from the results based on the English nltk stopword list 构建正则表达式以根据英语nltk停用词列表从结果中删除不相关的词
  • Build the regex to split the text with using the dict_columns keys 使用dict_columns键构建正则表达式以拆分文本
  • After splitting, zip the resulting list into a tuple list 拆分后,将结果列表压缩到元组列表中
  • Remove the irrelevant words from the values and strip the whitespace 从值中删除不相关的单词并去除空格

Here is the code I have come so far: 这是到目前为止我得到的代码:

import nltk, re
s = "the status is processing and product subtypes are HL year 30 ARM and applicant name is Ryan"
dict_columns = {'status': 4, 'product subtypes': 29, 'applicant name': 69}
dict_values = {'processing': 14, 'hl': 50, 'year': 53, '30': 58, 'arm': 61, 'ryan': 87}
# Build the regex to remove irrelevant words from the results
rx_stopwords = r"\b(?:{})\b".format("|".join([x for x in nltk.corpus.stopwords.words("English")]))
# Build the regex to split the text with using the dict_columns keys
rx_split = r"\b({})\b".format("|".join([x for x in dict_columns]))
chunks = re.split(rx_split, s)
# After splitting, zip the resulting list into a tuple list
it = iter(chunks[1:])
lst = list(zip(it, it))
# Remove the irrelevant words from the values and trim them (this can be further enhanced
res = [(x, re.sub(rx_stopwords, "", y).strip()) for x, y in lst]
# =>
#   [('status', 'processing'), ('product subtypes', 'HL year 30 ARM'), ('applicant name', 'Ryan')]
# It can be cast to a dictionary
dict(res)
# => 
#   {'product subtypes': 'HL year 30 ARM', 'status': 'processing', 'applicant name': 'Ryan'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从数据框列值创建多个列表 - How to create multiple lists from data frame column value 使用数据框列中的键对值创建新列 - Create new column using keys pair value from a dataframe column 如何从给定键的行创建列:pandas 列中的值对? - How to create columns from rows given key:value pair in the column in pandas? 如何从字符串创建列表列表? - How to create a list of lists from a string? 如何从两个字符串列表的笛卡尔积创建列表 - How to create lists from a cartesian product of two string lists 如何从 2 个单独的列表中返回一对索引? - How to return the indexes of a pair from 2 separate lists? 从列表字典的 dataframe 列中提取值并创建一个新列 - Extract value from a dataframe column of dictionary of lists lists and create a new column 如何从一列列表中创建一个包含元素总数的新列 - How to create a new column with total number of elements from a column of lists 如何创建一个使用 2 个不同列表中的 2 个整数并将该对添加到某个整数的函数? - How to create a function that uses 2 integers from 2 different lists and adding the pair to a certain integer? 如何在 Pandas 中创建一个新列并根据第二列是否包含来自各种字符串列表的字符串来设置其值 - How to create a new column in pandas and set its values according to whether a second column includes a string from various lists of strings
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM