I'm very new to pandas and matplotlib.
I have applied an questionnaire, and in a certain question people were asked the social networks they use. Options were Facebook, Instagram, Twitter, and others. They could select more than an option.
I want to organize this data to plot a bar chart. I have used the following code:
listsocial = df["SocialNetworks"].str.split(', ', expand=True)
listsocial.head()
listsocial = 100*listsocial.stack().value_counts(normalize=True)
and then:
sns.set(font_scale=1.4)
ax = listsocial.plot(kind='bar', figsize=(15,7), color=('#009C3B'), grid=True)
ax.yaxis.set_major_formatter(mtick.PercentFormatter(decimals=False))
plt.xticks(rotation=80)
plt.suptitle('Most used social networks', fontsize=20)
plt.xlabel('Social network', fontsize=14, labelpad=20)
plt.ylabel('Respondents\n(%)', fontsize=14, labelpad=20)
plt.show()
However, the result does not take in account the fact people could answer more than an option, thus the total should not be not 100%. I want the chart to display data like: 70% uses Facebook, 60% uses Instagram, etc.
Thanks in advance.
Splitting and stacking is not the way to go in this case.
I would create separate columns for each social network of interest and assign True
if it is included in the string (a sort of one-hot encoder)
social_networks = pd.DataFrame()
for sn in ['Facebook', 'Twitter', ...]:
social_networks[sn] = df['SocialNetworks'].str.contains(sn)
Then you can get the percentage with
social_networks = social_networks.mean()
Instead of calling value_counts(normalize=True)
you could divide by the number of rows:
from matplotlib import pyplot as plt
from matplotlib import ticker as mtick
import numpy as np
import pandas as pd
import seaborn as sns
networks = np.array(['facebook', 'twitter', 'instagram', 'other'])
socnetw = [", ".join(networks[np.random.randint(0, 2, 4, dtype=bool)]) for _ in range(100)]
df = pd.DataFrame({"SocialNetworks": socnetw})
listsocial = df["SocialNetworks"].str.split(', ', expand=True)
listsocial = 100 * listsocial.stack().value_counts() / len(listsocial)
listsocial = listsocial.iloc[:-1] # remove the last row (which contains the count for 'None')
sns.set(font_scale=1.4)
ax = listsocial.plot(kind='bar', figsize=(15, 7), color=('#009C3B'), grid=True)
ax.yaxis.set_major_formatter(mtick.PercentFormatter(decimals=False))
plt.xticks(rotation=80)
plt.suptitle('Most used social networks', fontsize=20)
plt.xlabel('Social network', fontsize=14, labelpad=20)
plt.ylabel('Respondents (%)', fontsize=14, labelpad=20)
plt.tight_layout()
plt.show()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.