[英]How to filter text data containing key words from an unnamed column excel with python pandas and print to txt file
Im pretty new to this so please bear with me.我对此很陌生,所以请耐心等待。
I have an excel sheet that contains certain text strings i would like to extract and copy to a text file - i have been doing this manually for a long time and im sick of it.我有一个包含某些文本字符串的 Excel 工作表,我想提取并复制到一个文本文件中 - 我已经手动进行了很长时间,但我厌倦了它。
So my plan was to write a script that would extract this data from the excel sheet and create a txt file.所以我的计划是编写一个脚本,从 Excel 表中提取这些数据并创建一个 txt 文件。
This is how far i have gotten:这是我已经走了多远:
#EXTRACT CLIPID FROM XCEL SHEET
import pandas as pd
from tkinter import Tk # from tkinter import Tk for Python 3.x
from tkinter.filedialog import askopenfilename
Tk().withdraw()
filename = askopenfilename()
data = pd.read_excel (filename)
df = pd.DataFrame(data)
print (df)
The data i want is located in column A1, but is not always in the same row.我想要的数据位于 A1 列,但并不总是在同一行。 There are 3 separate keywords i want to look for:
我要查找 3 个单独的关键字:
The strings look something like this:字符串看起来像这样:
Channel2021_1_DRU_POP_15s_16062021 Channel2021_2_FANT_POP_15s_16062021 Channel2021_3_ITA_POP_15s_16062021 Channel2021_1_DRU_POP_15s_16062021 Channel2021_2_FANT_POP_15s_16062021 Channel2021_3_ITA_POP_15s_16062021
Channel2021_1_DRU_TVS_15s_16062021 Channel2021_2_FANT_TVS_15s_16062021 Channel2021_3_ITA_TVS_15s_16062021 Channel2021_1_DRU_TVS_15s_16062021 Channel2021_2_FANT_TVS_15s_16062021 Channel2021_3_ITA_TVS_15s_16062021
Channel2021_1_DRU_PLANET_15s_16062021 Channel2021_2_FANT_PLANET_15s_16062021 Channel2021_3_ITA_PLANET_15s_16062021 Channel2021_1_DRU_PLANET_15s_16062021 Channel2021_2_FANT_PLANET_15s_16062021 Channel2021_3_ITA_PLANET_15s_16062021
This is the form of the extracted data i would like to write in a txt file.这是我想写在txt文件中的提取数据的形式。
So in essence i want to search column A1 for strings containing POP and print, then strings containing TVS and print, and lastly strings containing PLANET and print.所以本质上我想在 A1 列中搜索包含 POP 和打印的字符串,然后是包含 TVS 和打印的字符串,最后是包含 PLANET 和打印的字符串。
Any help would be greatly appreciated!任何帮助将不胜感激!
Thank you!谢谢!
Dusan杜尚
PS: Here is the output of df
: PS:这是
df
的输出:
Unnamed: 0 ... Unnamed: 16
0 NaN ... NaN
1 NaN ... NaN
2 Spot 1 15 s ... NaN
3 NaN ... Indicazioni
4 106290.01 ... dire tutto + grafica ITALIA
5 138575.01 ... NaN
6 142956.01 ... NaN
7 85146.01 ... NaN
8 Eurospin2021_16bis_1_POP_ITA_15s_24_06_2021 ... NaN
9 Eurospin2021_16bis_1_TVS_ITA_15s_24_06_2021 ... NaN
10 Eurospin2021_16bis_1_PLANET_ITA_15s_24_06_2021 ... NaN
11 NaN ... NaN
12 NaN ... NaN
13 Spot 2 15 s ... NaN
14 NaN ... Indicazioni
15 164171.01 ... dire tutto + grafica ITALIA
16 9003309.01 ... NaN
17 88310.01 ... NaN
18 Eurospin2021_16bis_2_POP_ITA_15s_24_06_2021 ... NaN
19 Eurospin2021_16bis_2_TVS_ITA_15s_24_06_2021 ... NaN
20 Eurospin2021_16bis_2_PLANET_ITA_15s_24_06_2021 ... NaN
21 NaN ... NaN
22 NaN ... NaN
23 Spot 3 15 s ... NaN
24 NaN ... Istruzione
25 800214.01 ... dire tutto + dire al kg dopo il prezzo per la ...
26 9001392.01 ... NaN
27 9002306.01 ... NaN
28 147804.01 ... NaN
29 Eurospin2021_16bis_3_POP_DRUZ_15s_24_06_2021 ... NaN
30 Eurospin2021_16bis_3_TVS_DRUZ_15s_24_06_2021 ... NaN
31 Eurospin2021_16bis_3_PLANET_DRUZ_15s_24_06_2021 ... NaN
[32 rows x 17 columns]
Here's a proposal if you're still looking for a solution:如果您仍在寻找解决方案,这里有一个建议:
Withe sample frame带样品架
df = pd.DataFrame({
0: [
'Channel2021_1_DRU_POP_15s_16062021',
'Channel2021_2_FANT_POP_15s_16062021',
'Channel2021_3_ITA_POP_15s_16062021',
1.,
2.,
'Channel2021_1_DRU_TVS_15s_16062021',
'Channel2021_2_FANT_TVS_15s_16062021',
'Channel2021_3_ITA_TVS_15s_16062021',
3.,
4.,
'Channel2021_1_DRU_PLANET_15s_16062021',
'Channel2021_2_FANT_PLANET_15s_16062021',
'Channel2021_3_ITA_PLANET_15s_16062021',
5.
],
1: '...',
})
0 1
0 Channel2021_1_DRU_POP_15s_16062021 ...
1 Channel2021_2_FANT_POP_15s_16062021 ...
2 Channel2021_3_ITA_POP_15s_16062021 ...
3 1 ...
4 2 ...
5 Channel2021_1_DRU_TVS_15s_16062021 ...
6 Channel2021_2_FANT_TVS_15s_16062021 ...
7 Channel2021_3_ITA_TVS_15s_16062021 ...
8 3 ...
9 4 ...
10 Channel2021_1_DRU_PLANET_15s_16062021 ...
11 Channel2021_2_FANT_PLANET_15s_16062021 ...
12 Channel2021_3_ITA_PLANET_15s_16062021 ...
13 5 ...
this这个
selection = df.iloc[:, 0].str.contains(r'POP|TVS|PLANET', na=False)
print(df.iloc[:, 0][selection])
df.iloc[:, 0][selection].to_csv('items.txt', index=False, header=False)
prints you the desired entries打印你想要的条目
0 Channel2021_1_DRU_POP_15s_16062021
1 Channel2021_2_FANT_POP_15s_16062021
2 Channel2021_3_ITA_POP_15s_16062021
5 Channel2021_1_DRU_TVS_15s_16062021
6 Channel2021_2_FANT_TVS_15s_16062021
7 Channel2021_3_ITA_TVS_15s_16062021
10 Channel2021_1_DRU_PLANET_15s_16062021
11 Channel2021_2_FANT_PLANET_15s_16062021
12 Channel2021_3_ITA_PLANET_15s_16062021
and writes them into a file items.txt
并将它们写入文件
items.txt
Channel2021_1_DRU_POP_15s_16062021
Channel2021_2_FANT_POP_15s_16062021
Channel2021_3_ITA_POP_15s_16062021
Channel2021_1_DRU_TVS_15s_16062021
Channel2021_2_FANT_TVS_15s_16062021
Channel2021_3_ITA_TVS_15s_16062021
Channel2021_1_DRU_PLANET_15s_16062021
Channel2021_2_FANT_PLANET_15s_16062021
Channel2021_3_ITA_PLANET_15s_16062021
Since I'm unsure about the column names I have only used the index base selection ( .iloc
).由于我不确定列名,因此我只使用了索引基选择 (
.iloc
)。
If you want the results in the order you've given then this如果你想按照你给出的顺序得到结果,那么这个
df = pd.concat([
df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)]
for tag in ('POP', 'TVS', 'PLANET')
])
should work (just print df
afterwards or write it to a file).应该可以工作(之后只需打印
df
或将其写入文件)。
Btw.: This is too complicated顺便说一句:这太复杂了
data = pd.read_excel (filename)
df = pd.DataFrame(data)
You only need pd.read_excel
:你只需要
pd.read_excel
:
df = pd.read_excel(filename)
EDIT : Regarding the comments:编辑:关于评论:
with open('items.txt', 'wt') as file:
file.write('The following has been sent:')
for tag in ('POP', 'TVS', 'PLANET'):
file.write(f'\n{tag}:\n')
items = df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)].to_list()
file.write('\n'.join(items))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.