繁体   English   中英

如何使用python pandas从未命名列excel中过滤包含关键字的文本数据并打印到txt文件

[英]How to filter text data containing key words from an unnamed column excel with python pandas and print to txt file

我对此很陌生,所以请耐心等待。

我有一个包含某些文本字符串的 Excel 工作表,我想提取并复制到一个文本文件中 - 我已经手动进行了很长时间,但我厌倦了它。

所以我的计划是编写一个脚本,从 Excel 表中提取这些数据并创建一个 txt 文件。

这是我已经走了多远:

#EXTRACT CLIPID FROM XCEL SHEET
import pandas as pd
from tkinter import Tk     # from tkinter import Tk for Python 3.x
from tkinter.filedialog import askopenfilename

Tk().withdraw() 
filename = askopenfilename()
data = pd.read_excel (filename)
df = pd.DataFrame(data)
print (df)

我想要的数据位于 A1 列,但并不总是在同一行。 我要查找 3 个单独的关键字:

  1. “流行音乐”
  2. “电视”
  3. “行星”

字符串看起来像这样:

Channel2021_1_DRU_POP_15s_16062021 Channel2021_2_FANT_POP_15s_16062021 Channel2021_3_ITA_POP_15s_16062021

Channel2021_1_DRU_TVS_15s_16062021 Channel2021_2_FANT_TVS_15s_16062021 Channel2021_3_ITA_TVS_15s_16062021

Channel2021_1_DRU_PLANET_15s_16062021 Channel2021_2_FANT_PLANET_15s_16062021 Channel2021_3_ITA_PLANET_15s_16062021

这是我想写在txt文件中的提取数据的形式。

所以本质上我想在 A1 列中搜索包含 POP 和打印的字符串,然后是包含 TVS 和打印的字符串,最后是包含 PLANET 和打印的字符串。

任何帮助将不胜感激!

谢谢!

杜尚

PS:这是df的输出:

                                         Unnamed: 0  ...                                        Unnamed: 16
0                                               NaN  ...                                                NaN
1                                               NaN  ...                                                NaN
2                                       Spot 1 15 s  ...                                                NaN
3                                               NaN  ...                                        Indicazioni
4                                         106290.01  ...                        dire tutto + grafica ITALIA
5                                         138575.01  ...                                                NaN
6                                         142956.01  ...                                                NaN
7                                          85146.01  ...                                                NaN
8      Eurospin2021_16bis_1_POP_ITA_15s_24_06_2021   ...                                                NaN
9       Eurospin2021_16bis_1_TVS_ITA_15s_24_06_2021  ...                                                NaN
10   Eurospin2021_16bis_1_PLANET_ITA_15s_24_06_2021  ...                                                NaN
11                                              NaN  ...                                                NaN
12                                              NaN  ...                                                NaN
13                                      Spot 2 15 s  ...                                                NaN
14                                              NaN  ...                                        Indicazioni
15                                        164171.01  ...                       dire tutto +  grafica ITALIA
16                                       9003309.01  ...                                                NaN
17                                         88310.01  ...                                                NaN
18      Eurospin2021_16bis_2_POP_ITA_15s_24_06_2021  ...                                                NaN
19      Eurospin2021_16bis_2_TVS_ITA_15s_24_06_2021  ...                                                NaN
20   Eurospin2021_16bis_2_PLANET_ITA_15s_24_06_2021  ...                                                NaN
21                                              NaN  ...                                                NaN
22                                              NaN  ...                                                NaN
23                                      Spot 3 15 s  ...                                                NaN
24                                              NaN  ...                                         Istruzione
25                                        800214.01  ...  dire tutto + dire al kg dopo il prezzo per la ...
26                                       9001392.01  ...                                                NaN
27                                       9002306.01  ...                                                NaN
28                                        147804.01  ...                                                NaN
29     Eurospin2021_16bis_3_POP_DRUZ_15s_24_06_2021  ...                                                NaN
30     Eurospin2021_16bis_3_TVS_DRUZ_15s_24_06_2021  ...                                                NaN
31  Eurospin2021_16bis_3_PLANET_DRUZ_15s_24_06_2021  ...                                                NaN

[32 rows x 17 columns]

如果您仍在寻找解决方案,这里有一个建议:

带样品架

df = pd.DataFrame({
    0: [
        'Channel2021_1_DRU_POP_15s_16062021',
        'Channel2021_2_FANT_POP_15s_16062021',
        'Channel2021_3_ITA_POP_15s_16062021',
        1.,
        2.,
        'Channel2021_1_DRU_TVS_15s_16062021',
        'Channel2021_2_FANT_TVS_15s_16062021',
        'Channel2021_3_ITA_TVS_15s_16062021',
        3.,
        4.,
        'Channel2021_1_DRU_PLANET_15s_16062021',
        'Channel2021_2_FANT_PLANET_15s_16062021',
        'Channel2021_3_ITA_PLANET_15s_16062021',
        5.
    ],
    1: '...',
})
                                         0    1
0       Channel2021_1_DRU_POP_15s_16062021  ...
1      Channel2021_2_FANT_POP_15s_16062021  ...
2       Channel2021_3_ITA_POP_15s_16062021  ...
3                                        1  ...
4                                        2  ...
5       Channel2021_1_DRU_TVS_15s_16062021  ...
6      Channel2021_2_FANT_TVS_15s_16062021  ...
7       Channel2021_3_ITA_TVS_15s_16062021  ...
8                                        3  ...
9                                        4  ...
10   Channel2021_1_DRU_PLANET_15s_16062021  ...
11  Channel2021_2_FANT_PLANET_15s_16062021  ...
12   Channel2021_3_ITA_PLANET_15s_16062021  ...
13                                       5  ...

这个

selection = df.iloc[:, 0].str.contains(r'POP|TVS|PLANET', na=False)
print(df.iloc[:, 0][selection])
df.iloc[:, 0][selection].to_csv('items.txt', index=False, header=False)

打印你想要的条目

0         Channel2021_1_DRU_POP_15s_16062021
1        Channel2021_2_FANT_POP_15s_16062021
2         Channel2021_3_ITA_POP_15s_16062021
5         Channel2021_1_DRU_TVS_15s_16062021
6        Channel2021_2_FANT_TVS_15s_16062021
7         Channel2021_3_ITA_TVS_15s_16062021
10     Channel2021_1_DRU_PLANET_15s_16062021
11    Channel2021_2_FANT_PLANET_15s_16062021
12     Channel2021_3_ITA_PLANET_15s_16062021

并将它们写入文件items.txt

Channel2021_1_DRU_POP_15s_16062021
Channel2021_2_FANT_POP_15s_16062021
Channel2021_3_ITA_POP_15s_16062021
Channel2021_1_DRU_TVS_15s_16062021
Channel2021_2_FANT_TVS_15s_16062021
Channel2021_3_ITA_TVS_15s_16062021
Channel2021_1_DRU_PLANET_15s_16062021
Channel2021_2_FANT_PLANET_15s_16062021
Channel2021_3_ITA_PLANET_15s_16062021

由于我不确定列名,因此我只使用了索引基选择 ( .iloc )。

如果你想按照你给出的顺序得到结果,那么这个

df = pd.concat([
         df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)]
         for tag in ('POP', 'TVS', 'PLANET')
     ])

应该可以工作(之后只需打印df或将其写入文件)。

顺便说一句:这太复杂了

data = pd.read_excel (filename)
df = pd.DataFrame(data)

你只需要pd.read_excel

df = pd.read_excel(filename)

编辑:关于评论:

with open('items.txt', 'wt') as file:
    file.write('The following has been sent:')
    for tag in ('POP', 'TVS', 'PLANET'):
        file.write(f'\n{tag}:\n')
        items = df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)].to_list()
        file.write('\n'.join(items))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM