[英]Tokenize text in Pandas dataframe
I have a Pandas DataFrame with scripts collected from an external source. 我有一个Pandas DataFrame,其中包含从外部来源收集的脚本。 The column
text_content
contains the script contents. 列
text_content
包含脚本内容。 The longest script consists of 85.617 characters. 最长的脚本包含85.617个字符。
A sample to give you an idea: 一个样本可以给你一个想法:
The scripts contain table names and other useful information. 脚本包含表名和其他有用的信息。 Currently, the dataframe is written to a SQLite database table, which can then be searched using ad-hoc SQL statements (and distributed to a larger crowd).
当前,数据帧已写入SQLite数据库表,然后可以使用临时SQL语句进行搜索(并分发给更大的人群)。
A common use case is that we'll have a list of table names, and would like to know the scripts in which they appear. 一个常见的用例是我们将有一个表名列表,并且想知道它们出现的脚本。 If we need to do this in SQL, it would require us to execute wildcard searches using the
LIKE
operator, which kinda sucks performance-wise. 如果我们需要在SQL中执行此操作,则将需要我们使用
LIKE
运算符执行通配符搜索,这有点糟于性能。
Thus, I wanted to extract the words from the script while it's still in a DataFrame, resulting in a two columns table, with each row consisting of: 因此,我想在脚本仍在DataFrame中时从脚本中提取单词 ,从而得到两列表格,每一行包括:
Each script would result in a number of rows (depending on the amount of matches). 每个脚本将导致许多行(取决于匹配项的数量)。
So far, I wrote this to extract the words from the script: 到目前为止,我编写此代码是为了从脚本中提取单词:
DataFrame(df[df.text_type == 'DISCRIPT']
.dropna(subset=['text_content'])
.apply(lambda x: re.findall('([a-zA-Z]\w+)', x['text_content']), axis=1)
.tolist())
The result: 结果:
So far, so good (?). 到现在为止还挺好 (?)。
There are two more steps I need to go through, but I'm a little stuck here. 我还需要执行两个步骤,但这里有些困难。
I can use T
to transpose the DataFrame, use replace()
in combination with a predefined list of keywords (replacing them with an NA value) and finally use dropna()
to shorten the list to just the keywords. 我可以使用
T
来转置DataFrame,结合使用replace()
和预定义的关键字列表(用NA值替换它们),最后使用dropna()
将列表缩短为仅关键字。 However, I'm not sure if this is the best approach. 但是,我不确定这是否是最佳方法。
I'd very much appreciate your comments and suggestions! 非常感谢您的意见和建议!
IIUC you can try add index=df.index
to df2
constructor, then reshape by stack
and filter by isin
: IIUC你可以尝试添加
index=df.index
到df2
构造,然后通过重塑stack
和过滤的isin
:
print df
text_content text_name text_type
1614 CHECK FOR LOCK STATUS CACHETABLEDB TEXT DISCRIPT
1615 CHECK FOR LOCK STATUS CACHETABLEDB TEXT DISCRIPT
df2 = pd.DataFrame(df[df.text_type == 'DISCRIPT']
.dropna(subset=['text_content'])
.apply(lambda x: re.findall('([a-zA-Z]\w+)', x['text_content']), axis=1)
.tolist(), index=df.index)
print df2
0 1 2 3 4
1614 CHECK FOR LOCK STATUS CACHETABLEDB
1615 CHECK FOR LOCK STATUS CACHETABLEDB
#reshape all rows to column
df2 = df2.stack().reset_index(level=0)
df2.columns = ['id', 'words']
L = ['CACHETABLEDB','STATUS']
#remove reserved words
df2 = df2.loc[~df2.words.isin(L)].reset_index(drop=True)
print df2
id words
0 1614 CHECK
1 1614 FOR
2 1614 LOCK
3 1615 CHECK
4 1615 FOR
5 1615 LOCK
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.