简体   繁体   English

Python 从列表中提取内容

[英]Python extracting contents from list

I am putting together a text analysis script in Python using pyLDAvis, and I am trying to clean up one of the outputs into something cleaner and easier to read.我正在使用 pyLDAvis 在 Python 中编写一个文本分析脚本,并且我正在尝试将其中一个输出清理为更清晰、更易于阅读的内容。 The function to return the top 5 important words for 4 topics is a list that looks like:返回 4 个主题的前 5 个重要词的 function 是一个如下所示的列表:

    [(0, '0.008*"de" + 0.007*"sas" + 0.004*"la" + 0.003*"et" + 0.003*"see"'),

     (1,
      '0.009*"sas" + 0.004*"de" + 0.003*"les" + 0.003*"recovery" + 0.003*"data"'),

     (2,
      '0.007*"sas" + 0.006*"data" + 0.005*"de" + 0.004*"recovery" + 0.004*"raid"'),

     (3,
      '0.019*"sas" + 0.009*"expensive" + 0.008*"disgustingly" + 0.008*"cool." + 0.008*"houses"')]

I ideally want to turn this into a dataframe where the first row contains the first words of each topic, as well as the corresponding score, and the columns represent the word and its score ie:理想情况下,我想将其转换为 dataframe ,其中第一行包含每个主题的第一个单词以及相应的分数,列代表单词及其分数,即:

r1col1 is 'de', r1col2 is 0.008, r1col3 is 'sas', r1col4 is 0.009, etc, etc. r1col1 是“de”,r1col2 是 0.008,r1col3 是“sas”,r1col4 是 0.009,等等。

Is there a way to extract the contents of the list and separate the values given the format it is in?有没有办法提取列表的内容并在给定格式的情况下分离值?

Here is a solution, using regex "(.*?)" to extract the text between double quotes & use enumerate over extracted values to get expected result and join on delimeter , .这是一个解决方案,使用正则表达式"(.*?)"提取双引号之间的文本并使用enumerate提取的值来获得预期结果并join分隔符, .

import re

for k, v in values:
    print(
        ", ".join([f"r{k + 1}col{i + 1} is {j}"
                   for i, j in enumerate(re.findall(r'"(.*?)"', v))])
    )

r1col1 is de, r1col2 is sas, r1col3 is la, r1col4 is et, r1col5 is see
r2col1 is sas, r2col2 is de, r2col3 is les, r2col4 is recovery, r2col5 is data
r3col1 is sas, r3col2 is data, r3col3 is de, r3col4 is recovery, r3col5 is raid
r4col1 is sas, r4col2 is expensive, r4col3 is disgustingly, r4col4 is cool., r4col5 is houses

Assuming the output is consistent with your example, it should be fairly straight forward.假设 output 与您的示例一致,它应该相当简单。 The list contains tuples of 2 of which the second is a string with plenty of available operations in python.该列表包含 2 个元组,其中第二个是在 python 中具有大量可用操作的字符串。

str.split("+") will return a list split from str along the '+' character. str.split("+")将返回从 str 沿 '+' 字符拆分的列表。

To then extract the word and the score you could make use of the python package 're' for matching regular expressions.然后提取单词和分数,您可以使用 python package 're' 来匹配正则表达式。

score = re.search('\d+.?\d*', str)

word = re.search('".*"', str)

you then use.group() to get the match as such:然后你使用 .group() 来获得匹配:

score.group()

word.group()

You could also simply use split again along '*' this time to split the two parts.这次您也可以简单地沿“*”再次使用 split 来拆分这两个部分。 The returned list should be ordered.返回的列表应该是有序的。

l = str.split('*')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM