I am working with a pandas dataframe that contains titles, sources, and links to various news articles sourced from the GoogleNews API. I have then categorized the data into the various keywords I used to find the articles. I am attempting to iterate through the 'keyword' column to print the data neatly, and then export the iterations to Word using python-docx.
To pull the GoogleNews data, I am using a for loop with various keywords set up in a list. It looks like:
for i in list:
googlenews=GoogleNews()
googlenews.get_news(i)
googlenews.set_lang('en')
googlenews.set_period('1d')
result=googlenews.result()
df_ivar = pd.DataFrame(result)
df_ivar = df_ivar[df_ivar['date'].notna()]
df_ivar = df_ivar[df_ivar["date"].str.contains('hours ago')] # to only pull articles from within the last 24 hours
df_ivar = df_ivar[['site', 'title', 'desc', 'link']]
df_ivar['keyword'] = i
df = df_ivar.append(df, ignore_index=True)
So far, I have found a way to print the data correctly, but I cannot find a way to only show each keyword once, and then print all the article titles, descriptions, and links below their appropriate keywords.
My data currently looks like this:
article 1 link 1 description 1 keyword 1
article 2 link 2 description 2 keyword 1
article 3 link 3 description 3 keyword 2
article 4 link 4 description 4 keyword 3
Upon export, I would like the python-docx document to display the data categorically, such as:
keyword 1
article 1
article 2
keyword 2
article 3
keyword 3
article 4
I have the python-docx script in working order, but every time I print the document, I am stuck with the keyword being presented ahead of every article name, when I would simply like the keyword displayed once, and any relevant articles posted below it. Currently, my for loop looks like:
for i in df.index:
document.add_heading(df['keyword'][i], level=1)
p = document.add_paragraph().add_run(dfs['title'][i]).underline = True
document.add_paragraph(df['desc'][i], style='List Bullet')
document.add_paragraph(df['link'][i], style='List Bullet')
document.add_paragraph('Source: ' + df['site'][i], style='List Bullet')
Any help or guidance would be greatly appreciated! Thank you in advance!
You could use Pandas groupby
using the keyword
as parameter. The return of this function will be the name of the group (the keyword
in this particular case) and the dataframe
for this keyword. You can then use the name
for the add_heading
function and use the remaining logic you already built, but iterating over the group variable ( for i in g.index
).
for name, g in df.groupby('keyword'):
document.add_heading(name, level=1)
for i in g.index:
p = document.add_paragraph().add_run(df['title'][i]).underline = True
document.add_paragraph(df['desc'][i], style='List Bullet')
document.add_paragraph(df['link'][i], style='List Bullet')
document.add_paragraph('Source: ' + df['site'][i], style='List Bullet')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.