简体   繁体   中英

Python extracting contents from list

I am putting together a text analysis script in Python using pyLDAvis, and I am trying to clean up one of the outputs into something cleaner and easier to read. The function to return the top 5 important words for 4 topics is a list that looks like:

    [(0, '0.008*"de" + 0.007*"sas" + 0.004*"la" + 0.003*"et" + 0.003*"see"'),

     (1,
      '0.009*"sas" + 0.004*"de" + 0.003*"les" + 0.003*"recovery" + 0.003*"data"'),

     (2,
      '0.007*"sas" + 0.006*"data" + 0.005*"de" + 0.004*"recovery" + 0.004*"raid"'),

     (3,
      '0.019*"sas" + 0.009*"expensive" + 0.008*"disgustingly" + 0.008*"cool." + 0.008*"houses"')]

I ideally want to turn this into a dataframe where the first row contains the first words of each topic, as well as the corresponding score, and the columns represent the word and its score ie:

r1col1 is 'de', r1col2 is 0.008, r1col3 is 'sas', r1col4 is 0.009, etc, etc.

Is there a way to extract the contents of the list and separate the values given the format it is in?

Here is a solution, using regex "(.*?)" to extract the text between double quotes & use enumerate over extracted values to get expected result and join on delimeter , .

import re

for k, v in values:
    print(
        ", ".join([f"r{k + 1}col{i + 1} is {j}"
                   for i, j in enumerate(re.findall(r'"(.*?)"', v))])
    )

r1col1 is de, r1col2 is sas, r1col3 is la, r1col4 is et, r1col5 is see
r2col1 is sas, r2col2 is de, r2col3 is les, r2col4 is recovery, r2col5 is data
r3col1 is sas, r3col2 is data, r3col3 is de, r3col4 is recovery, r3col5 is raid
r4col1 is sas, r4col2 is expensive, r4col3 is disgustingly, r4col4 is cool., r4col5 is houses

Assuming the output is consistent with your example, it should be fairly straight forward. The list contains tuples of 2 of which the second is a string with plenty of available operations in python.

str.split("+") will return a list split from str along the '+' character.

To then extract the word and the score you could make use of the python package 're' for matching regular expressions.

score = re.search('\d+.?\d*', str)

word = re.search('".*"', str)

you then use.group() to get the match as such:

score.group()

word.group()

You could also simply use split again along '*' this time to split the two parts. The returned list should be ordered.

l = str.split('*')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM