如何使用python pandas從未命名列excel中過濾包含關鍵字的文本數據並打印到txt文件

Question

我對此很陌生，所以請耐心等待。

我有一個包含某些文本字符串的 Excel 工作表，我想提取並復制到一個文本文件中 - 我已經手動進行了很長時間，但我厭倦了它。

所以我的計划是編寫一個腳本，從 Excel 表中提取這些數據並創建一個 txt 文件。

這是我已經走了多遠：

#EXTRACT CLIPID FROM XCEL SHEET
import pandas as pd
from tkinter import Tk     # from tkinter import Tk for Python 3.x
from tkinter.filedialog import askopenfilename

Tk().withdraw() 
filename = askopenfilename()
data = pd.read_excel (filename)
df = pd.DataFrame(data)
print (df)

我想要的數據位於 A1 列，但並不總是在同一行。 我要查找 3 個單獨的關鍵字：

“流行音樂”
“電視”
“行星”

字符串看起來像這樣：

Channel2021_1_DRU_POP_15s_16062021 Channel2021_2_FANT_POP_15s_16062021 Channel2021_3_ITA_POP_15s_16062021

Channel2021_1_DRU_TVS_15s_16062021 Channel2021_2_FANT_TVS_15s_16062021 Channel2021_3_ITA_TVS_15s_16062021

Channel2021_1_DRU_PLANET_15s_16062021 Channel2021_2_FANT_PLANET_15s_16062021 Channel2021_3_ITA_PLANET_15s_16062021

這是我想寫在txt文件中的提取數據的形式。

所以本質上我想在 A1 列中搜索包含 POP 和打印的字符串，然后是包含 TVS 和打印的字符串，最后是包含 PLANET 和打印的字符串。

任何幫助將不勝感激！

謝謝！

杜尚

PS：這是df的輸出：

                                         Unnamed: 0  ...                                        Unnamed: 16
0                                               NaN  ...                                                NaN
1                                               NaN  ...                                                NaN
2                                       Spot 1 15 s  ...                                                NaN
3                                               NaN  ...                                        Indicazioni
4                                         106290.01  ...                        dire tutto + grafica ITALIA
5                                         138575.01  ...                                                NaN
6                                         142956.01  ...                                                NaN
7                                          85146.01  ...                                                NaN
8      Eurospin2021_16bis_1_POP_ITA_15s_24_06_2021   ...                                                NaN
9       Eurospin2021_16bis_1_TVS_ITA_15s_24_06_2021  ...                                                NaN
10   Eurospin2021_16bis_1_PLANET_ITA_15s_24_06_2021  ...                                                NaN
11                                              NaN  ...                                                NaN
12                                              NaN  ...                                                NaN
13                                      Spot 2 15 s  ...                                                NaN
14                                              NaN  ...                                        Indicazioni
15                                        164171.01  ...                       dire tutto +  grafica ITALIA
16                                       9003309.01  ...                                                NaN
17                                         88310.01  ...                                                NaN
18      Eurospin2021_16bis_2_POP_ITA_15s_24_06_2021  ...                                                NaN
19      Eurospin2021_16bis_2_TVS_ITA_15s_24_06_2021  ...                                                NaN
20   Eurospin2021_16bis_2_PLANET_ITA_15s_24_06_2021  ...                                                NaN
21                                              NaN  ...                                                NaN
22                                              NaN  ...                                                NaN
23                                      Spot 3 15 s  ...                                                NaN
24                                              NaN  ...                                         Istruzione
25                                        800214.01  ...  dire tutto + dire al kg dopo il prezzo per la ...
26                                       9001392.01  ...                                                NaN
27                                       9002306.01  ...                                                NaN
28                                        147804.01  ...                                                NaN
29     Eurospin2021_16bis_3_POP_DRUZ_15s_24_06_2021  ...                                                NaN
30     Eurospin2021_16bis_3_TVS_DRUZ_15s_24_06_2021  ...                                                NaN
31  Eurospin2021_16bis_3_PLANET_DRUZ_15s_24_06_2021  ...                                                NaN

[32 rows x 17 columns]

Answer 1

如果您仍在尋找解決方案，這里有一個建議：

帶樣品架

df = pd.DataFrame({
    0: [
        'Channel2021_1_DRU_POP_15s_16062021',
        'Channel2021_2_FANT_POP_15s_16062021',
        'Channel2021_3_ITA_POP_15s_16062021',
        1.,
        2.,
        'Channel2021_1_DRU_TVS_15s_16062021',
        'Channel2021_2_FANT_TVS_15s_16062021',
        'Channel2021_3_ITA_TVS_15s_16062021',
        3.,
        4.,
        'Channel2021_1_DRU_PLANET_15s_16062021',
        'Channel2021_2_FANT_PLANET_15s_16062021',
        'Channel2021_3_ITA_PLANET_15s_16062021',
        5.
    ],
    1: '...',
})

                                         0    1
0       Channel2021_1_DRU_POP_15s_16062021  ...
1      Channel2021_2_FANT_POP_15s_16062021  ...
2       Channel2021_3_ITA_POP_15s_16062021  ...
3                                        1  ...
4                                        2  ...
5       Channel2021_1_DRU_TVS_15s_16062021  ...
6      Channel2021_2_FANT_TVS_15s_16062021  ...
7       Channel2021_3_ITA_TVS_15s_16062021  ...
8                                        3  ...
9                                        4  ...
10   Channel2021_1_DRU_PLANET_15s_16062021  ...
11  Channel2021_2_FANT_PLANET_15s_16062021  ...
12   Channel2021_3_ITA_PLANET_15s_16062021  ...
13                                       5  ...

這個

selection = df.iloc[:, 0].str.contains(r'POP|TVS|PLANET', na=False)
print(df.iloc[:, 0][selection])
df.iloc[:, 0][selection].to_csv('items.txt', index=False, header=False)

打印你想要的條目

0         Channel2021_1_DRU_POP_15s_16062021
1        Channel2021_2_FANT_POP_15s_16062021
2         Channel2021_3_ITA_POP_15s_16062021
5         Channel2021_1_DRU_TVS_15s_16062021
6        Channel2021_2_FANT_TVS_15s_16062021
7         Channel2021_3_ITA_TVS_15s_16062021
10     Channel2021_1_DRU_PLANET_15s_16062021
11    Channel2021_2_FANT_PLANET_15s_16062021
12     Channel2021_3_ITA_PLANET_15s_16062021

並將它們寫入文件items.txt

Channel2021_1_DRU_POP_15s_16062021
Channel2021_2_FANT_POP_15s_16062021
Channel2021_3_ITA_POP_15s_16062021
Channel2021_1_DRU_TVS_15s_16062021
Channel2021_2_FANT_TVS_15s_16062021
Channel2021_3_ITA_TVS_15s_16062021
Channel2021_1_DRU_PLANET_15s_16062021
Channel2021_2_FANT_PLANET_15s_16062021
Channel2021_3_ITA_PLANET_15s_16062021

由於我不確定列名，因此我只使用了索引基選擇 ( .iloc )。

如果你想按照你給出的順序得到結果，那么這個

df = pd.concat([
         df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)]
         for tag in ('POP', 'TVS', 'PLANET')
     ])

應該可以工作（之后只需打印df或將其寫入文件）。

順便說一句：這太復雜了

data = pd.read_excel (filename)
df = pd.DataFrame(data)

你只需要pd.read_excel ：

df = pd.read_excel(filename)

編輯：關於評論：

with open('items.txt', 'wt') as file:
    file.write('The following has been sent:')
    for tag in ('POP', 'TVS', 'PLANET'):
        file.write(f'\n{tag}:\n')
        items = df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)].to_list()
        file.write('\n'.join(items))

如何使用python pandas從未命名列excel中過濾包含關鍵字的文本數據並打印到txt文件

問題描述

1 個解決方案

解決方案1
0 已采納 2021-06-19 08:56:52

如何使用python pandas從未命名列excel中過濾包含關鍵字的文本數據並打印到txt文件

問題描述

1 個解決方案

解決方案1 0 已采納 2021-06-19 08:56:52

解決方案1
0 已采納 2021-06-19 08:56:52