简体   繁体   English

标记熊猫数据框中的文本

[英]Tokenize text in Pandas dataframe

I have a Pandas DataFrame with scripts collected from an external source. 我有一个Pandas DataFrame,其中包含从外部来源收集的脚本。 The column text_content contains the script contents. text_content包含脚本内容。 The longest script consists of 85.617 characters. 最长的脚本包含85.617个字符。

A sample to give you an idea: 一个样本可以给你一个想法:

样品内容

The scripts contain table names and other useful information. 脚本包含表名和其他有用的信息。 Currently, the dataframe is written to a SQLite database table, which can then be searched using ad-hoc SQL statements (and distributed to a larger crowd). 当前,数据帧已写入SQLite数据库表,然后可以使用临时SQL语句进行搜索(并分发给更大的人群)。

A common use case is that we'll have a list of table names, and would like to know the scripts in which they appear. 一个常见的用例是我们将有一个表名列表,并且想知道它们出现的脚本。 If we need to do this in SQL, it would require us to execute wildcard searches using the LIKE operator, which kinda sucks performance-wise. 如果我们需要在SQL中执行此操作,则将需要我们使用LIKE运算符执行通配符搜索,这有点糟于性能。

Thus, I wanted to extract the words from the script while it's still in a DataFrame, resulting in a two columns table, with each row consisting of: 因此,我想在脚本仍在DataFrame中时从脚本中提取单词 ,从而得到两列表格,每一行包括:

  • a link to the original script row 指向原始脚本行的链接
  • a word that was found in the script 在脚本中找到的单词

Each script would result in a number of rows (depending on the amount of matches). 每个脚本将导致许多行(取决于匹配项的数量)。

So far, I wrote this to extract the words from the script: 到目前为止,我编写此代码是为了从脚本中提取单词:

DataFrame(df[df.text_type == 'DISCRIPT']
    .dropna(subset=['text_content'])
    .apply(lambda x: re.findall('([a-zA-Z]\w+)', x['text_content']), axis=1)
    .tolist())

The result: 结果:

代币化

So far, so good (?). 到现在为止还挺好 (?)。

There are two more steps I need to go through, but I'm a little stuck here. 我还需要执行两个步骤,但这里有些困难。

  1. Remove a list of common words (eg SQL reserved words). 删除常用单词列表(例如SQL保留单词)。
  2. Reshape the DataFrame so each row is a match, but with a link to the script in the original DataFrame. 重塑DataFrame的形状,使每一行都匹配,但在原始DataFrame中具有指向脚本的链接。

I can use T to transpose the DataFrame, use replace() in combination with a predefined list of keywords (replacing them with an NA value) and finally use dropna() to shorten the list to just the keywords. 我可以使用T来转置DataFrame,结合使用replace()和预定义的关键字列表(用NA值替换它们),最后使用dropna()将列表缩短为仅关键字。 However, I'm not sure if this is the best approach. 但是,我不确定这是否是最佳方法。

I'd very much appreciate your comments and suggestions! 非常感谢您的意见和建议!

IIUC you can try add index=df.index to df2 constructor, then reshape by stack and filter by isin : IIUC你可以尝试添加index=df.indexdf2构造,然后通过重塑stack和过滤的isin

print df
                            text_content text_name text_type
1614  CHECK FOR LOCK STATUS CACHETABLEDB      TEXT  DISCRIPT
1615  CHECK FOR LOCK STATUS CACHETABLEDB      TEXT  DISCRIPT

df2 = pd.DataFrame(df[df.text_type == 'DISCRIPT']
    .dropna(subset=['text_content'])
    .apply(lambda x: re.findall('([a-zA-Z]\w+)', x['text_content']), axis=1)
    .tolist(), index=df.index)
print df2
          0    1     2       3             4
1614  CHECK  FOR  LOCK  STATUS  CACHETABLEDB
1615  CHECK  FOR  LOCK  STATUS  CACHETABLEDB

#reshape all rows to column
df2 = df2.stack().reset_index(level=0)
df2.columns = ['id', 'words']

L = ['CACHETABLEDB','STATUS']
#remove reserved words
df2 = df2.loc[~df2.words.isin(L)].reset_index(drop=True)
print df2
     id  words
0  1614  CHECK
1  1614    FOR
2  1614   LOCK
3  1615  CHECK
4  1615    FOR
5  1615   LOCK

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM