![](/img/trans.png)
[英]How to read TXT file containing JSON data into python Pandas dataframe
[英]How to filter text data containing key words from an unnamed column excel with python pandas and print to txt file
我对此很陌生,所以请耐心等待。
我有一个包含某些文本字符串的 Excel 工作表,我想提取并复制到一个文本文件中 - 我已经手动进行了很长时间,但我厌倦了它。
所以我的计划是编写一个脚本,从 Excel 表中提取这些数据并创建一个 txt 文件。
这是我已经走了多远:
#EXTRACT CLIPID FROM XCEL SHEET
import pandas as pd
from tkinter import Tk # from tkinter import Tk for Python 3.x
from tkinter.filedialog import askopenfilename
Tk().withdraw()
filename = askopenfilename()
data = pd.read_excel (filename)
df = pd.DataFrame(data)
print (df)
我想要的数据位于 A1 列,但并不总是在同一行。 我要查找 3 个单独的关键字:
字符串看起来像这样:
Channel2021_1_DRU_POP_15s_16062021 Channel2021_2_FANT_POP_15s_16062021 Channel2021_3_ITA_POP_15s_16062021
Channel2021_1_DRU_TVS_15s_16062021 Channel2021_2_FANT_TVS_15s_16062021 Channel2021_3_ITA_TVS_15s_16062021
Channel2021_1_DRU_PLANET_15s_16062021 Channel2021_2_FANT_PLANET_15s_16062021 Channel2021_3_ITA_PLANET_15s_16062021
这是我想写在txt文件中的提取数据的形式。
所以本质上我想在 A1 列中搜索包含 POP 和打印的字符串,然后是包含 TVS 和打印的字符串,最后是包含 PLANET 和打印的字符串。
任何帮助将不胜感激!
谢谢!
杜尚
PS:这是df
的输出:
Unnamed: 0 ... Unnamed: 16
0 NaN ... NaN
1 NaN ... NaN
2 Spot 1 15 s ... NaN
3 NaN ... Indicazioni
4 106290.01 ... dire tutto + grafica ITALIA
5 138575.01 ... NaN
6 142956.01 ... NaN
7 85146.01 ... NaN
8 Eurospin2021_16bis_1_POP_ITA_15s_24_06_2021 ... NaN
9 Eurospin2021_16bis_1_TVS_ITA_15s_24_06_2021 ... NaN
10 Eurospin2021_16bis_1_PLANET_ITA_15s_24_06_2021 ... NaN
11 NaN ... NaN
12 NaN ... NaN
13 Spot 2 15 s ... NaN
14 NaN ... Indicazioni
15 164171.01 ... dire tutto + grafica ITALIA
16 9003309.01 ... NaN
17 88310.01 ... NaN
18 Eurospin2021_16bis_2_POP_ITA_15s_24_06_2021 ... NaN
19 Eurospin2021_16bis_2_TVS_ITA_15s_24_06_2021 ... NaN
20 Eurospin2021_16bis_2_PLANET_ITA_15s_24_06_2021 ... NaN
21 NaN ... NaN
22 NaN ... NaN
23 Spot 3 15 s ... NaN
24 NaN ... Istruzione
25 800214.01 ... dire tutto + dire al kg dopo il prezzo per la ...
26 9001392.01 ... NaN
27 9002306.01 ... NaN
28 147804.01 ... NaN
29 Eurospin2021_16bis_3_POP_DRUZ_15s_24_06_2021 ... NaN
30 Eurospin2021_16bis_3_TVS_DRUZ_15s_24_06_2021 ... NaN
31 Eurospin2021_16bis_3_PLANET_DRUZ_15s_24_06_2021 ... NaN
[32 rows x 17 columns]
如果您仍在寻找解决方案,这里有一个建议:
带样品架
df = pd.DataFrame({
0: [
'Channel2021_1_DRU_POP_15s_16062021',
'Channel2021_2_FANT_POP_15s_16062021',
'Channel2021_3_ITA_POP_15s_16062021',
1.,
2.,
'Channel2021_1_DRU_TVS_15s_16062021',
'Channel2021_2_FANT_TVS_15s_16062021',
'Channel2021_3_ITA_TVS_15s_16062021',
3.,
4.,
'Channel2021_1_DRU_PLANET_15s_16062021',
'Channel2021_2_FANT_PLANET_15s_16062021',
'Channel2021_3_ITA_PLANET_15s_16062021',
5.
],
1: '...',
})
0 1
0 Channel2021_1_DRU_POP_15s_16062021 ...
1 Channel2021_2_FANT_POP_15s_16062021 ...
2 Channel2021_3_ITA_POP_15s_16062021 ...
3 1 ...
4 2 ...
5 Channel2021_1_DRU_TVS_15s_16062021 ...
6 Channel2021_2_FANT_TVS_15s_16062021 ...
7 Channel2021_3_ITA_TVS_15s_16062021 ...
8 3 ...
9 4 ...
10 Channel2021_1_DRU_PLANET_15s_16062021 ...
11 Channel2021_2_FANT_PLANET_15s_16062021 ...
12 Channel2021_3_ITA_PLANET_15s_16062021 ...
13 5 ...
这个
selection = df.iloc[:, 0].str.contains(r'POP|TVS|PLANET', na=False)
print(df.iloc[:, 0][selection])
df.iloc[:, 0][selection].to_csv('items.txt', index=False, header=False)
打印你想要的条目
0 Channel2021_1_DRU_POP_15s_16062021
1 Channel2021_2_FANT_POP_15s_16062021
2 Channel2021_3_ITA_POP_15s_16062021
5 Channel2021_1_DRU_TVS_15s_16062021
6 Channel2021_2_FANT_TVS_15s_16062021
7 Channel2021_3_ITA_TVS_15s_16062021
10 Channel2021_1_DRU_PLANET_15s_16062021
11 Channel2021_2_FANT_PLANET_15s_16062021
12 Channel2021_3_ITA_PLANET_15s_16062021
并将它们写入文件items.txt
Channel2021_1_DRU_POP_15s_16062021
Channel2021_2_FANT_POP_15s_16062021
Channel2021_3_ITA_POP_15s_16062021
Channel2021_1_DRU_TVS_15s_16062021
Channel2021_2_FANT_TVS_15s_16062021
Channel2021_3_ITA_TVS_15s_16062021
Channel2021_1_DRU_PLANET_15s_16062021
Channel2021_2_FANT_PLANET_15s_16062021
Channel2021_3_ITA_PLANET_15s_16062021
由于我不确定列名,因此我只使用了索引基选择 ( .iloc
)。
如果你想按照你给出的顺序得到结果,那么这个
df = pd.concat([
df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)]
for tag in ('POP', 'TVS', 'PLANET')
])
应该可以工作(之后只需打印df
或将其写入文件)。
顺便说一句:这太复杂了
data = pd.read_excel (filename)
df = pd.DataFrame(data)
你只需要pd.read_excel
:
df = pd.read_excel(filename)
编辑:关于评论:
with open('items.txt', 'wt') as file:
file.write('The following has been sent:')
for tag in ('POP', 'TVS', 'PLANET'):
file.write(f'\n{tag}:\n')
items = df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)].to_list()
file.write('\n'.join(items))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.