![](/img/trans.png)
[英]How to read TXT file containing JSON data into python Pandas dataframe
[英]How to filter text data containing key words from an unnamed column excel with python pandas and print to txt file
我對此很陌生,所以請耐心等待。
我有一個包含某些文本字符串的 Excel 工作表,我想提取並復制到一個文本文件中 - 我已經手動進行了很長時間,但我厭倦了它。
所以我的計划是編寫一個腳本,從 Excel 表中提取這些數據並創建一個 txt 文件。
這是我已經走了多遠:
#EXTRACT CLIPID FROM XCEL SHEET
import pandas as pd
from tkinter import Tk # from tkinter import Tk for Python 3.x
from tkinter.filedialog import askopenfilename
Tk().withdraw()
filename = askopenfilename()
data = pd.read_excel (filename)
df = pd.DataFrame(data)
print (df)
我想要的數據位於 A1 列,但並不總是在同一行。 我要查找 3 個單獨的關鍵字:
字符串看起來像這樣:
Channel2021_1_DRU_POP_15s_16062021 Channel2021_2_FANT_POP_15s_16062021 Channel2021_3_ITA_POP_15s_16062021
Channel2021_1_DRU_TVS_15s_16062021 Channel2021_2_FANT_TVS_15s_16062021 Channel2021_3_ITA_TVS_15s_16062021
Channel2021_1_DRU_PLANET_15s_16062021 Channel2021_2_FANT_PLANET_15s_16062021 Channel2021_3_ITA_PLANET_15s_16062021
這是我想寫在txt文件中的提取數據的形式。
所以本質上我想在 A1 列中搜索包含 POP 和打印的字符串,然后是包含 TVS 和打印的字符串,最后是包含 PLANET 和打印的字符串。
任何幫助將不勝感激!
謝謝!
杜尚
PS:這是df
的輸出:
Unnamed: 0 ... Unnamed: 16
0 NaN ... NaN
1 NaN ... NaN
2 Spot 1 15 s ... NaN
3 NaN ... Indicazioni
4 106290.01 ... dire tutto + grafica ITALIA
5 138575.01 ... NaN
6 142956.01 ... NaN
7 85146.01 ... NaN
8 Eurospin2021_16bis_1_POP_ITA_15s_24_06_2021 ... NaN
9 Eurospin2021_16bis_1_TVS_ITA_15s_24_06_2021 ... NaN
10 Eurospin2021_16bis_1_PLANET_ITA_15s_24_06_2021 ... NaN
11 NaN ... NaN
12 NaN ... NaN
13 Spot 2 15 s ... NaN
14 NaN ... Indicazioni
15 164171.01 ... dire tutto + grafica ITALIA
16 9003309.01 ... NaN
17 88310.01 ... NaN
18 Eurospin2021_16bis_2_POP_ITA_15s_24_06_2021 ... NaN
19 Eurospin2021_16bis_2_TVS_ITA_15s_24_06_2021 ... NaN
20 Eurospin2021_16bis_2_PLANET_ITA_15s_24_06_2021 ... NaN
21 NaN ... NaN
22 NaN ... NaN
23 Spot 3 15 s ... NaN
24 NaN ... Istruzione
25 800214.01 ... dire tutto + dire al kg dopo il prezzo per la ...
26 9001392.01 ... NaN
27 9002306.01 ... NaN
28 147804.01 ... NaN
29 Eurospin2021_16bis_3_POP_DRUZ_15s_24_06_2021 ... NaN
30 Eurospin2021_16bis_3_TVS_DRUZ_15s_24_06_2021 ... NaN
31 Eurospin2021_16bis_3_PLANET_DRUZ_15s_24_06_2021 ... NaN
[32 rows x 17 columns]
如果您仍在尋找解決方案,這里有一個建議:
帶樣品架
df = pd.DataFrame({
0: [
'Channel2021_1_DRU_POP_15s_16062021',
'Channel2021_2_FANT_POP_15s_16062021',
'Channel2021_3_ITA_POP_15s_16062021',
1.,
2.,
'Channel2021_1_DRU_TVS_15s_16062021',
'Channel2021_2_FANT_TVS_15s_16062021',
'Channel2021_3_ITA_TVS_15s_16062021',
3.,
4.,
'Channel2021_1_DRU_PLANET_15s_16062021',
'Channel2021_2_FANT_PLANET_15s_16062021',
'Channel2021_3_ITA_PLANET_15s_16062021',
5.
],
1: '...',
})
0 1
0 Channel2021_1_DRU_POP_15s_16062021 ...
1 Channel2021_2_FANT_POP_15s_16062021 ...
2 Channel2021_3_ITA_POP_15s_16062021 ...
3 1 ...
4 2 ...
5 Channel2021_1_DRU_TVS_15s_16062021 ...
6 Channel2021_2_FANT_TVS_15s_16062021 ...
7 Channel2021_3_ITA_TVS_15s_16062021 ...
8 3 ...
9 4 ...
10 Channel2021_1_DRU_PLANET_15s_16062021 ...
11 Channel2021_2_FANT_PLANET_15s_16062021 ...
12 Channel2021_3_ITA_PLANET_15s_16062021 ...
13 5 ...
這個
selection = df.iloc[:, 0].str.contains(r'POP|TVS|PLANET', na=False)
print(df.iloc[:, 0][selection])
df.iloc[:, 0][selection].to_csv('items.txt', index=False, header=False)
打印你想要的條目
0 Channel2021_1_DRU_POP_15s_16062021
1 Channel2021_2_FANT_POP_15s_16062021
2 Channel2021_3_ITA_POP_15s_16062021
5 Channel2021_1_DRU_TVS_15s_16062021
6 Channel2021_2_FANT_TVS_15s_16062021
7 Channel2021_3_ITA_TVS_15s_16062021
10 Channel2021_1_DRU_PLANET_15s_16062021
11 Channel2021_2_FANT_PLANET_15s_16062021
12 Channel2021_3_ITA_PLANET_15s_16062021
並將它們寫入文件items.txt
Channel2021_1_DRU_POP_15s_16062021
Channel2021_2_FANT_POP_15s_16062021
Channel2021_3_ITA_POP_15s_16062021
Channel2021_1_DRU_TVS_15s_16062021
Channel2021_2_FANT_TVS_15s_16062021
Channel2021_3_ITA_TVS_15s_16062021
Channel2021_1_DRU_PLANET_15s_16062021
Channel2021_2_FANT_PLANET_15s_16062021
Channel2021_3_ITA_PLANET_15s_16062021
由於我不確定列名,因此我只使用了索引基選擇 ( .iloc
)。
如果你想按照你給出的順序得到結果,那么這個
df = pd.concat([
df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)]
for tag in ('POP', 'TVS', 'PLANET')
])
應該可以工作(之后只需打印df
或將其寫入文件)。
順便說一句:這太復雜了
data = pd.read_excel (filename)
df = pd.DataFrame(data)
你只需要pd.read_excel
:
df = pd.read_excel(filename)
編輯:關於評論:
with open('items.txt', 'wt') as file:
file.write('The following has been sent:')
for tag in ('POP', 'TVS', 'PLANET'):
file.write(f'\n{tag}:\n')
items = df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)].to_list()
file.write('\n'.join(items))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.