如何使用python pandas从未命名列excel中过滤包含关键字的文本数据并打印到txt文件

Question

Im pretty new to this so please bear with me.我对此很陌生，所以请耐心等待。

I have an excel sheet that contains certain text strings i would like to extract and copy to a text file - i have been doing this manually for a long time and im sick of it.我有一个包含某些文本字符串的 Excel 工作表，我想提取并复制到一个文本文件中 - 我已经手动进行了很长时间，但我厌倦了它。

So my plan was to write a script that would extract this data from the excel sheet and create a txt file.所以我的计划是编写一个脚本，从 Excel 表中提取这些数据并创建一个 txt 文件。

This is how far i have gotten:这是我已经走了多远：

#EXTRACT CLIPID FROM XCEL SHEET
import pandas as pd
from tkinter import Tk     # from tkinter import Tk for Python 3.x
from tkinter.filedialog import askopenfilename

Tk().withdraw() 
filename = askopenfilename()
data = pd.read_excel (filename)
df = pd.DataFrame(data)
print (df)

The data i want is located in column A1, but is not always in the same row.我想要的数据位于 A1 列，但并不总是在同一行。 There are 3 separate keywords i want to look for:我要查找 3 个单独的关键字：

"POP" “流行音乐”
"TVS" “电视”
"PLANET" “行星”

The strings look something like this:字符串看起来像这样：

Channel2021_1_DRU_POP_15s_16062021 Channel2021_2_FANT_POP_15s_16062021 Channel2021_3_ITA_POP_15s_16062021 Channel2021_1_DRU_POP_15s_16062021 Channel2021_2_FANT_POP_15s_16062021 Channel2021_3_ITA_POP_15s_16062021

Channel2021_1_DRU_TVS_15s_16062021 Channel2021_2_FANT_TVS_15s_16062021 Channel2021_3_ITA_TVS_15s_16062021 Channel2021_1_DRU_TVS_15s_16062021 Channel2021_2_FANT_TVS_15s_16062021 Channel2021_3_ITA_TVS_15s_16062021

Channel2021_1_DRU_PLANET_15s_16062021 Channel2021_2_FANT_PLANET_15s_16062021 Channel2021_3_ITA_PLANET_15s_16062021 Channel2021_1_DRU_PLANET_15s_16062021 Channel2021_2_FANT_PLANET_15s_16062021 Channel2021_3_ITA_PLANET_15s_16062021

This is the form of the extracted data i would like to write in a txt file.这是我想写在txt文件中的提取数据的形式。

So in essence i want to search column A1 for strings containing POP and print, then strings containing TVS and print, and lastly strings containing PLANET and print.所以本质上我想在 A1 列中搜索包含 POP 和打印的字符串，然后是包含 TVS 和打印的字符串，最后是包含 PLANET 和打印的字符串。

Any help would be greatly appreciated!任何帮助将不胜感激！

Thank you!谢谢！

Dusan杜尚

PS: Here is the output of df : PS：这是df的输出：

                                         Unnamed: 0  ...                                        Unnamed: 16
0                                               NaN  ...                                                NaN
1                                               NaN  ...                                                NaN
2                                       Spot 1 15 s  ...                                                NaN
3                                               NaN  ...                                        Indicazioni
4                                         106290.01  ...                        dire tutto + grafica ITALIA
5                                         138575.01  ...                                                NaN
6                                         142956.01  ...                                                NaN
7                                          85146.01  ...                                                NaN
8      Eurospin2021_16bis_1_POP_ITA_15s_24_06_2021   ...                                                NaN
9       Eurospin2021_16bis_1_TVS_ITA_15s_24_06_2021  ...                                                NaN
10   Eurospin2021_16bis_1_PLANET_ITA_15s_24_06_2021  ...                                                NaN
11                                              NaN  ...                                                NaN
12                                              NaN  ...                                                NaN
13                                      Spot 2 15 s  ...                                                NaN
14                                              NaN  ...                                        Indicazioni
15                                        164171.01  ...                       dire tutto +  grafica ITALIA
16                                       9003309.01  ...                                                NaN
17                                         88310.01  ...                                                NaN
18      Eurospin2021_16bis_2_POP_ITA_15s_24_06_2021  ...                                                NaN
19      Eurospin2021_16bis_2_TVS_ITA_15s_24_06_2021  ...                                                NaN
20   Eurospin2021_16bis_2_PLANET_ITA_15s_24_06_2021  ...                                                NaN
21                                              NaN  ...                                                NaN
22                                              NaN  ...                                                NaN
23                                      Spot 3 15 s  ...                                                NaN
24                                              NaN  ...                                         Istruzione
25                                        800214.01  ...  dire tutto + dire al kg dopo il prezzo per la ...
26                                       9001392.01  ...                                                NaN
27                                       9002306.01  ...                                                NaN
28                                        147804.01  ...                                                NaN
29     Eurospin2021_16bis_3_POP_DRUZ_15s_24_06_2021  ...                                                NaN
30     Eurospin2021_16bis_3_TVS_DRUZ_15s_24_06_2021  ...                                                NaN
31  Eurospin2021_16bis_3_PLANET_DRUZ_15s_24_06_2021  ...                                                NaN

[32 rows x 17 columns]

Answer 1

Here's a proposal if you're still looking for a solution:如果您仍在寻找解决方案，这里有一个建议：

Withe sample frame带样品架

df = pd.DataFrame({
    0: [
        'Channel2021_1_DRU_POP_15s_16062021',
        'Channel2021_2_FANT_POP_15s_16062021',
        'Channel2021_3_ITA_POP_15s_16062021',
        1.,
        2.,
        'Channel2021_1_DRU_TVS_15s_16062021',
        'Channel2021_2_FANT_TVS_15s_16062021',
        'Channel2021_3_ITA_TVS_15s_16062021',
        3.,
        4.,
        'Channel2021_1_DRU_PLANET_15s_16062021',
        'Channel2021_2_FANT_PLANET_15s_16062021',
        'Channel2021_3_ITA_PLANET_15s_16062021',
        5.
    ],
    1: '...',
})

                                         0    1
0       Channel2021_1_DRU_POP_15s_16062021  ...
1      Channel2021_2_FANT_POP_15s_16062021  ...
2       Channel2021_3_ITA_POP_15s_16062021  ...
3                                        1  ...
4                                        2  ...
5       Channel2021_1_DRU_TVS_15s_16062021  ...
6      Channel2021_2_FANT_TVS_15s_16062021  ...
7       Channel2021_3_ITA_TVS_15s_16062021  ...
8                                        3  ...
9                                        4  ...
10   Channel2021_1_DRU_PLANET_15s_16062021  ...
11  Channel2021_2_FANT_PLANET_15s_16062021  ...
12   Channel2021_3_ITA_PLANET_15s_16062021  ...
13                                       5  ...

this这个

selection = df.iloc[:, 0].str.contains(r'POP|TVS|PLANET', na=False)
print(df.iloc[:, 0][selection])
df.iloc[:, 0][selection].to_csv('items.txt', index=False, header=False)

prints you the desired entries打印你想要的条目

0         Channel2021_1_DRU_POP_15s_16062021
1        Channel2021_2_FANT_POP_15s_16062021
2         Channel2021_3_ITA_POP_15s_16062021
5         Channel2021_1_DRU_TVS_15s_16062021
6        Channel2021_2_FANT_TVS_15s_16062021
7         Channel2021_3_ITA_TVS_15s_16062021
10     Channel2021_1_DRU_PLANET_15s_16062021
11    Channel2021_2_FANT_PLANET_15s_16062021
12     Channel2021_3_ITA_PLANET_15s_16062021

and writes them into a file items.txt并将它们写入文件items.txt

Channel2021_1_DRU_POP_15s_16062021
Channel2021_2_FANT_POP_15s_16062021
Channel2021_3_ITA_POP_15s_16062021
Channel2021_1_DRU_TVS_15s_16062021
Channel2021_2_FANT_TVS_15s_16062021
Channel2021_3_ITA_TVS_15s_16062021
Channel2021_1_DRU_PLANET_15s_16062021
Channel2021_2_FANT_PLANET_15s_16062021
Channel2021_3_ITA_PLANET_15s_16062021

Since I'm unsure about the column names I have only used the index base selection ( .iloc ).由于我不确定列名，因此我只使用了索引基选择 ( .iloc )。

If you want the results in the order you've given then this如果你想按照你给出的顺序得到结果，那么这个

df = pd.concat([
         df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)]
         for tag in ('POP', 'TVS', 'PLANET')
     ])

should work (just print df afterwards or write it to a file).应该可以工作（之后只需打印df或将其写入文件）。

Btw.: This is too complicated顺便说一句：这太复杂了

data = pd.read_excel (filename)
df = pd.DataFrame(data)

You only need pd.read_excel :你只需要pd.read_excel ：

df = pd.read_excel(filename)

EDIT : Regarding the comments:编辑：关于评论：

with open('items.txt', 'wt') as file:
    file.write('The following has been sent:')
    for tag in ('POP', 'TVS', 'PLANET'):
        file.write(f'\n{tag}:\n')
        items = df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)].to_list()
        file.write('\n'.join(items))

如何使用python pandas从未命名列excel中过滤包含关键字的文本数据并打印到txt文件

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-06-19 08:56:52

如何使用python pandas从未命名列excel中过滤包含关键字的文本数据并打印到txt文件

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-06-19 08:56:52

解决方案1
0 已采纳 2021-06-19 08:56:52