如何從pandas數據框中的文本字段中提取數據？

Question

我想從這個數據幀中獲取標簽的分布：

df=pd.DataFrame([
    [43,{"tags":["webcom","start","temp","webcomfoto","dance"],"image":["https://image.com/Kqk.jpg"]}],
    [83,{"tags":["yourself","start",""],"image":["https://images.com/test.jpg"]}],
    [76,{"tags":["en","webcom"],"links":["http://webcom.webcomdb.com","http://webcom.webcomstats.com"],"users":["otole"]}],
    [77,{"tags":["webcomznakomstvo","webcomzhiznx","webcomistoriya","webcomosebe","webcomfotografiya"],"image":["https://images.com/nt4wzguoh/y_a3d735b4.jpg","https://images.com/sucb0u24x/b1sd_Naju.jpg"]}],
    [81,{"tags":["webcomfotografiya"],"users":["myself","boattva"],"links":["https://webcom.com/nk"]}],
],columns=["_id","tags"])

我需要獲得一個表格，其中包含具有特定標簽數量的“id”。 例如。

 Number of posts | Number of tags 
      31                9
      44                8
      ...
      129               1

當'tags'是唯一的字段時，我使用了這種方法。 在此數據框中，我還有“圖像”，“用戶”和其他帶有值的文本字段。 在這種情況下我應該如何處理數據？

謝謝

Answer 1

堅持collections.Counter ，這是一種方式：

from collections import Counter
from operator import itemgetter

c = Counter(map(len, map(itemgetter('tags'), df['tags'])))

res = pd.DataFrame.from_dict(c, orient='index').reset_index()
res.columns = ['Tags', 'Posts']

print(res)

   Tags  Posts
0     5      2
1     3      1
2     2      1
3     1      1

Answer 2

您可以使用str訪問器來獲取字典鍵和len與value_counts ：

df.tags.str['tags'].str.len().value_counts()\
  .rename('Posts')\
  .rename_axis('Tags')\
  .reset_index()

輸出：

   Tags  Posts
0     5      2
1     3      1
2     2      1
3     1      1

Answer 3

更新：使用f字符串，字典理解和列表理解的組合簡潔地提取tags列中所有列表的長度：

extract_dict = [{f'count {y}':len(z) for y,z in x.items()} for x in df.tags]

# construct new df with only extracted counts
pd.DataFrame.from_records(extract_dict)

# new df with extracted counts & original data
df.assign(**pd.DataFrame.from_records(extract_dict))

# outputs:

   _id                                               tags  count image  \
0   43  {'tags': ['webcom', 'start', 'temp', 'webcomfo...          1.0
1   83  {'tags': ['yourself', 'start', ''], 'image': [...          1.0
2   76  {'tags': ['en', 'webcom'], 'links': ['http://w...          NaN
3   77  {'tags': ['webcomznakomstvo', 'webcomzhiznx', ...          2.0
4   81  {'tags': ['webcomfotografiya'], 'users': ['mys...          NaN

   count links  count tags  count users
0          NaN           5          NaN
1          NaN           3          NaN
2          2.0           2          1.0
3          NaN           5          NaN
4          1.0           1          2.0

原始答案：

如果您事先知道列名稱，則可以使用列表推導來完成此任務

 extract = [(len(x.get('tags',[])), len(x.get('images',[])), len(x.get('users',[])))     
  for x in df.tags]
 # extract outputs:
 [(5, 0, 0), (3, 0, 0), (2, 0, 1), (5, 0, 0), (1, 0, 2)]

然后可以使用它創建新的數據框或分配其他列

# creates new df
pd.DataFrame.from_records(
  extract, 
  columns=['count tags', 'count images', 'count users']
)

# creates new dataframe with extracted data and original df
df.assign(
    **pd.DataFrame.from_records(
        extract, 
        columns=['count tags', 'count images', 'count users'])
)

最后一個語句產生了以下輸出：

   _id                                               tags  count tags  \
0   43  {'tags': ['webcom', 'start', 'temp', 'webcomfo...           5
1   83  {'tags': ['yourself', 'start', ''], 'image': [...           3
2   76  {'tags': ['en', 'webcom'], 'links': ['http://w...           2
3   77  {'tags': ['webcomznakomstvo', 'webcomzhiznx', ...           5
4   81  {'tags': ['webcomfotografiya'], 'users': ['mys...           1

   count images  count users
0             0            0
1             0            0
2             0            1
3             0            0
4             0            2

Answer 4

列tags中的數據是strings ，沒有dictionaries 。

所以需要第一步：

import ast

df['tags'] = df['tags'].apply(ast.literal_eval)

然后應用原始答案，如果多個字段工作非常好。

驗證：

df=pd.DataFrame([
    [43,{"tags":[],"image":["https://image.com/Kqk.jpg"]}],
    [83,{"tags":["yourself","start",""],"image":["https://images.com/test.jpg"]}],
    [76,{"tags":["en","webcom"],"links":["http://webcom.webcomdb.com","http://webcom.webcomstats.com"],"users":["otole"]}],
    [77,{"tags":["webcomznakomstvo","webcomzhiznx","webcomistoriya","webcomosebe","webcomfotografiya"],"image":["https://images.com/nt4wzguoh/y_a3d735b4.jpg","https://images.com/sucb0u24x/b1sd_Naju.jpg"]}],
    [81,{"tags":["webcomfotografiya"],"users":["myself","boattva"],"links":["https://webcom.com/nk"]}],
],columns=["_id","tags"])
#print (df)

#convert column to string for verify solution
df['tags'] = df['tags'].astype(str)

print (df['tags'].apply(type))
0    <class 'str'>
1    <class 'str'>
2    <class 'str'>
3    <class 'str'>
4    <class 'str'>
Name: tags, dtype: object

#convert back
df['tags'] = df['tags'].apply(ast.literal_eval)

print (df['tags'].apply(type))
0    <class 'dict'>
1    <class 'dict'>
2    <class 'dict'>
3    <class 'dict'>
4    <class 'dict'>
Name: tags, dtype: object

c = Counter([len(x['tags']) for x in df['tags']])

df = pd.DataFrame({'Number of posts':list(c.values()), ' Number of tags ': list(c.keys())})
print (df)
   Number of posts   Number of tags 
0                1                 0
1                1                 3
2                1                 2
3                1                 5
4                1                 1

如何從pandas數據框中的文本字段中提取數據？

問題描述

4 個解決方案

解決方案1
1 2018-06-06 16:05:50

解決方案2
1 2018-06-06 16:14:14

解決方案3
0 2018-06-06 16:23:26

解決方案4
0 已采納 2018-06-06 16:43:19

如何從pandas數據框中的文本字段中提取數據？

問題描述

4 個解決方案

解決方案1 1 2018-06-06 16:05:50

解決方案2 1 2018-06-06 16:14:14

解決方案3 0 2018-06-06 16:23:26

解決方案4 0 已采納 2018-06-06 16:43:19

解決方案1
1 2018-06-06 16:05:50

解決方案2
1 2018-06-06 16:14:14

解決方案3
0 2018-06-06 16:23:26

解決方案4
0 已采納 2018-06-06 16:43:19