简体   繁体   English

如何使用python pandas从未命名列excel中过滤包含关键字的文本数据并打印到txt文件

[英]How to filter text data containing key words from an unnamed column excel with python pandas and print to txt file

Im pretty new to this so please bear with me.我对此很陌生,所以请耐心等待。

I have an excel sheet that contains certain text strings i would like to extract and copy to a text file - i have been doing this manually for a long time and im sick of it.我有一个包含某些文本字符串的 Excel 工作表,我想提取并复制到一个文本文件中 - 我已经手动进行了很长时间,但我厌倦了它。

So my plan was to write a script that would extract this data from the excel sheet and create a txt file.所以我的计划是编写一个脚本,从 Excel 表中提取这些数据并创建一个 txt 文件。

This is how far i have gotten:这是我已经走了多远:

#EXTRACT CLIPID FROM XCEL SHEET
import pandas as pd
from tkinter import Tk     # from tkinter import Tk for Python 3.x
from tkinter.filedialog import askopenfilename

Tk().withdraw() 
filename = askopenfilename()
data = pd.read_excel (filename)
df = pd.DataFrame(data)
print (df)

The data i want is located in column A1, but is not always in the same row.我想要的数据位于 A1 列,但并不总是在同一行。 There are 3 separate keywords i want to look for:我要查找 3 个单独的关键字:

  1. "POP" “流行音乐”
  2. "TVS" “电视”
  3. "PLANET" “行星”

The strings look something like this:字符串看起来像这样:

Channel2021_1_DRU_POP_15s_16062021 Channel2021_2_FANT_POP_15s_16062021 Channel2021_3_ITA_POP_15s_16062021 Channel2021_1_DRU_POP_15s_16062021 Channel2021_2_FANT_POP_15s_16062021 Channel2021_3_ITA_POP_15s_16062021

Channel2021_1_DRU_TVS_15s_16062021 Channel2021_2_FANT_TVS_15s_16062021 Channel2021_3_ITA_TVS_15s_16062021 Channel2021_1_DRU_TVS_15s_16062021 Channel2021_2_FANT_TVS_15s_16062021 Channel2021_3_ITA_TVS_15s_16062021

Channel2021_1_DRU_PLANET_15s_16062021 Channel2021_2_FANT_PLANET_15s_16062021 Channel2021_3_ITA_PLANET_15s_16062021 Channel2021_1_DRU_PLANET_15s_16062021 Channel2021_2_FANT_PLANET_15s_16062021 Channel2021_3_ITA_PLANET_15s_16062021

This is the form of the extracted data i would like to write in a txt file.这是我想写在txt文件中的提取数据的形式。

So in essence i want to search column A1 for strings containing POP and print, then strings containing TVS and print, and lastly strings containing PLANET and print.所以本质上我想在 A1 列中搜索包含 POP 和打印的字符串,然后是包含 TVS 和打印的字符串,最后是包含 PLANET 和打印的字符串。

Any help would be greatly appreciated!任何帮助将不胜感激!

Thank you!谢谢!

Dusan杜尚

PS: Here is the output of df : PS:这是df的输出:

                                         Unnamed: 0  ...                                        Unnamed: 16
0                                               NaN  ...                                                NaN
1                                               NaN  ...                                                NaN
2                                       Spot 1 15 s  ...                                                NaN
3                                               NaN  ...                                        Indicazioni
4                                         106290.01  ...                        dire tutto + grafica ITALIA
5                                         138575.01  ...                                                NaN
6                                         142956.01  ...                                                NaN
7                                          85146.01  ...                                                NaN
8      Eurospin2021_16bis_1_POP_ITA_15s_24_06_2021   ...                                                NaN
9       Eurospin2021_16bis_1_TVS_ITA_15s_24_06_2021  ...                                                NaN
10   Eurospin2021_16bis_1_PLANET_ITA_15s_24_06_2021  ...                                                NaN
11                                              NaN  ...                                                NaN
12                                              NaN  ...                                                NaN
13                                      Spot 2 15 s  ...                                                NaN
14                                              NaN  ...                                        Indicazioni
15                                        164171.01  ...                       dire tutto +  grafica ITALIA
16                                       9003309.01  ...                                                NaN
17                                         88310.01  ...                                                NaN
18      Eurospin2021_16bis_2_POP_ITA_15s_24_06_2021  ...                                                NaN
19      Eurospin2021_16bis_2_TVS_ITA_15s_24_06_2021  ...                                                NaN
20   Eurospin2021_16bis_2_PLANET_ITA_15s_24_06_2021  ...                                                NaN
21                                              NaN  ...                                                NaN
22                                              NaN  ...                                                NaN
23                                      Spot 3 15 s  ...                                                NaN
24                                              NaN  ...                                         Istruzione
25                                        800214.01  ...  dire tutto + dire al kg dopo il prezzo per la ...
26                                       9001392.01  ...                                                NaN
27                                       9002306.01  ...                                                NaN
28                                        147804.01  ...                                                NaN
29     Eurospin2021_16bis_3_POP_DRUZ_15s_24_06_2021  ...                                                NaN
30     Eurospin2021_16bis_3_TVS_DRUZ_15s_24_06_2021  ...                                                NaN
31  Eurospin2021_16bis_3_PLANET_DRUZ_15s_24_06_2021  ...                                                NaN

[32 rows x 17 columns]

Here's a proposal if you're still looking for a solution:如果您仍在寻找解决方案,这里有一个建议:

Withe sample frame带样品架

df = pd.DataFrame({
    0: [
        'Channel2021_1_DRU_POP_15s_16062021',
        'Channel2021_2_FANT_POP_15s_16062021',
        'Channel2021_3_ITA_POP_15s_16062021',
        1.,
        2.,
        'Channel2021_1_DRU_TVS_15s_16062021',
        'Channel2021_2_FANT_TVS_15s_16062021',
        'Channel2021_3_ITA_TVS_15s_16062021',
        3.,
        4.,
        'Channel2021_1_DRU_PLANET_15s_16062021',
        'Channel2021_2_FANT_PLANET_15s_16062021',
        'Channel2021_3_ITA_PLANET_15s_16062021',
        5.
    ],
    1: '...',
})
                                         0    1
0       Channel2021_1_DRU_POP_15s_16062021  ...
1      Channel2021_2_FANT_POP_15s_16062021  ...
2       Channel2021_3_ITA_POP_15s_16062021  ...
3                                        1  ...
4                                        2  ...
5       Channel2021_1_DRU_TVS_15s_16062021  ...
6      Channel2021_2_FANT_TVS_15s_16062021  ...
7       Channel2021_3_ITA_TVS_15s_16062021  ...
8                                        3  ...
9                                        4  ...
10   Channel2021_1_DRU_PLANET_15s_16062021  ...
11  Channel2021_2_FANT_PLANET_15s_16062021  ...
12   Channel2021_3_ITA_PLANET_15s_16062021  ...
13                                       5  ...

this这个

selection = df.iloc[:, 0].str.contains(r'POP|TVS|PLANET', na=False)
print(df.iloc[:, 0][selection])
df.iloc[:, 0][selection].to_csv('items.txt', index=False, header=False)

prints you the desired entries打印你想要的条目

0         Channel2021_1_DRU_POP_15s_16062021
1        Channel2021_2_FANT_POP_15s_16062021
2         Channel2021_3_ITA_POP_15s_16062021
5         Channel2021_1_DRU_TVS_15s_16062021
6        Channel2021_2_FANT_TVS_15s_16062021
7         Channel2021_3_ITA_TVS_15s_16062021
10     Channel2021_1_DRU_PLANET_15s_16062021
11    Channel2021_2_FANT_PLANET_15s_16062021
12     Channel2021_3_ITA_PLANET_15s_16062021

and writes them into a file items.txt并将它们写入文件items.txt

Channel2021_1_DRU_POP_15s_16062021
Channel2021_2_FANT_POP_15s_16062021
Channel2021_3_ITA_POP_15s_16062021
Channel2021_1_DRU_TVS_15s_16062021
Channel2021_2_FANT_TVS_15s_16062021
Channel2021_3_ITA_TVS_15s_16062021
Channel2021_1_DRU_PLANET_15s_16062021
Channel2021_2_FANT_PLANET_15s_16062021
Channel2021_3_ITA_PLANET_15s_16062021

Since I'm unsure about the column names I have only used the index base selection ( .iloc ).由于我不确定列名,因此我只使用了索引基选择 ( .iloc )。

If you want the results in the order you've given then this如果你想按照你给出的顺序得到结果,那么这个

df = pd.concat([
         df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)]
         for tag in ('POP', 'TVS', 'PLANET')
     ])

should work (just print df afterwards or write it to a file).应该可以工作(之后只需打印df或将其写入文件)。

Btw.: This is too complicated顺便说一句:这太复杂了

data = pd.read_excel (filename)
df = pd.DataFrame(data)

You only need pd.read_excel :你只需要pd.read_excel

df = pd.read_excel(filename)

EDIT : Regarding the comments:编辑:关于评论:

with open('items.txt', 'wt') as file:
    file.write('The following has been sent:')
    for tag in ('POP', 'TVS', 'PLANET'):
        file.write(f'\n{tag}:\n')
        items = df.iloc[:, 0][df.iloc[:, 0].str.contains(tag, na=False)].to_list()
        file.write('\n'.join(items))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将包含 JSON 数据的 TXT 文件读入 python Pandas Z6A8064B5DF479455500553C47C50 - How to read TXT file containing JSON data into python Pandas dataframe 打印包含txt文件中所有单词中的单词的行 - Print rows containing word from all words in txt file 从 csv 文件 Pandas Python 中删除未命名的列 - Deleting an unnamed column from a csv file Pandas Python 在带有python的excel中,如何在包含多个单词的单元格中从整个列中查找单词? - In excel with python, how do you find words, from an entire column, in a cell containing multiple words? 如何使用python根据句子中的关键字从xlsx文件中过滤数据? - How do I filter data from an xlsx file based on key words in a sentence using python? 我应该如何使用 python 从 txt 文件中打印格式化文本? - How should I print formated text from a txt file with python? Python Pandas-如何使用键的存在来过滤具有包含字典的列的数据框? - Python Pandas - How to filter a dataframe that has a column containing a dictionary, using the existence of a key? 如何从txt文件中搜索单词到python - How to search words from txt file to python Pandas excel 文件读取将第一列名称设为未命名 - Pandas excel file reading gives first column name as unnamed 如何使用 Python 将 .txt 文件中的数据导入特定的 Excel 工作表? - How to import data from .txt file to a specifc excel sheet with Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM