简体   繁体   English

在pandas数据框中插入新行

[英]Insert new rows in pandas dataframe

I have parsed an xml file containing some part-of-speech tagged text and I since the file is not perfect I am adding the data to a pandas dataframe in order to later clean it. 我已经解析了一个包含一些词性标记文本的xml文件,由于该文件不是完美的,所以我将数据添加到pandas数据框中,以便稍后对其进行清理。

At this point I will need to duplicate some rows based on certain values and modify only one or two values in the duplicated row and in the original one. 此时,我将需要根据某些值复制一些行,并仅在复制的行和原始行中修改一个或两个值。

This is what the actual dataframe looks like: 实际的数据框如下所示:

In [8]: df.head()
Out[8]: 
      text     lemma       pos markintext  doublemma  multiwordexpr nodetail
0      Per       per      epsf          0          0              0        0
1   correr   correre    vta2fp          0          0              0        0
2  miglior  migliore      a2fp          0          0              0        0
3    acque     acqua     sf1fp          0          0              0        0
4     alza    alzare  vta1ips3          0          0              0        0

Now, if, for example, multiwordexpr is equal to 1, I want to duplicate the row and insert it in the database. 现在,例如,如果multiwordexpr等于1,我想复制该行并将其插入数据库中。 So, I would like to go from this: 所以,我想从这里开始:

In [10]: df[df['multiwordexpr'] == 1]
Out[10]: 
          text     lemma      pos markintext  doublemma  multiwordexpr
16    dietro a  dietro a   eilksl          0          0              1  

to this: 对此:

          text     lemma      pos markintext  doublemma  multiwordexpr
16    dietro    dietro a   eilksl          0          0              1  
17    a         dietro a   eilksl          0          0              1  

This is my code 这是我的代码

#!/usr/bin/python
# -*- coding: latin-1 -*-

from lxml import etree
import locale
import sys
import os
import glob
import pandas as pd
import numpy as np
import re
from string import punctuation
import random
import unicodedata

def manage_tail(taillist):
    z = []
    for line in taillist:
        y = list(line.strip())
        for punkt in y:
            z.append(punkt)
    return z if len(z) > 0 else 0

def checkmark(text):
    pattern = re.compile("\w|'",re.UNICODE)
    if re.match(pattern,text[-1]):
        return 0
    else:
        return text[-1]

path = "~/working_corpus/"
output_path = "~/devel_output/"
f = "*.xml"

docs = [f for f in glob.glob(os.path.join(path,f))]
parser = etree.XMLParser(load_dtd= True,resolve_entities=True)

x = []
for d in docs:

    tree = etree.parse(d,parser)

    for node in [z for z in  tree.iterfind(".//LM")]:
        text = node.text.strip()
        multiwordexpr = 1 if (' ' in text.replace('  ', ' ')) else 0
        lemma = node.get('lemma')
        markintext = checkmark(text)
        pos = node.get('catg')
        doublemma = 1 if (node.getparent() is not None and node.getparent().tag == 'LM1') else 0
        nodetail = manage_tail(node.tail.splitlines()) if node.tail else None
        row = [text,lemma,pos,markintext,doublemma,multiwordexpr,nodetail]
        x.append(row)


df = pd.DataFrame(x,columns=('text','lemma','pos','markintext','doublemma','multiwordexpr','nodetail'))

I've thought about something like this for managing the case in which nodetail is true(so not the multiwordexpr problem exactly, but the point is the same: how to efficiently add a row in an arbitrary position, so not at the end), but I don't know how to really do it efficiently. 我已经考虑过用这种方法来管理nodetail为true的情况(因此,并不是精确的multiwordexpr问题,但要点是相同的:如何有效地在任意位置添加行,而不是最后添加行),但我不知道如何真正有效地做到这一点。 I am looking for a function that given one or more condition, inserts a certain number of duplicated rows under the selected row and modifyes one or two values in the other columns (in this case, it splits the text and duplicates the row). 我正在寻找一个给定一个或多个条件的函数,在选定的行下插入一定数量的重复行,并在其他列中修改一个或两个值(在这种情况下,它将拆分文本并复制行)。

l = []
i = 0
while i < len(df):
    if (df.iloc[i,6] != 0):
        ntail = df.iloc[i,6]
        df.iloc[i,6] = 0
        i += 1
        for w in range(len(ntail)):
            line = pd.DataFrame({'text': ntail[w],
            'lemma': ntail[w],
            'pos':'NaN',
            'markintext':0,
            'doublemma':0,
            'multiwordexpr':0,
            'nodetail':0},index=[i+w], columns=('text','lemma','pos','markintext','doublemma','multiwordexpr','nodetail'))
            l.append(line)
    else:
        pass
    i += 1
    sys.stdout.write("\r%d/%d" % (i,len(df)))
    sys.stdout.flush()
print "...done extracting."

for i in range(len(l)):    
    start = int((l[i].index[0])-1)
    end = int(l[i].index[0])
    df = pd.concat([df.ix[:start], l[i], df.ix[end:]]).reset_index(drop=True)
    sys.stdout.write("\r%d/%d" % (i,len(l)))
    sys.stdout.flush()

EDIT: You can preallocate your df, the required length will be len(df)+df.multiwordexpr.sum() then you can use .ix[] to set the correct rows. 编辑:您可以预分配df,所需的长度为len(df)+df.multiwordexpr.sum()然后可以使用.ix []设置正确的行。 You still have to iterate your original df and split it though. 您仍然必须迭代原始df并将其拆分。 That might be faster. 那可能更快。

row = ['','','',0,0,0,0]
#calculate correct length depending on your original df
df_len = len(orig_df)+orig_df.multiwordexpr.sum()

#allocate a new df
result_df = pd.DataFrame([row for x in xrange(df_len)],
                      columns=columns)
#write to it instead appending
result_df.ix[index] = ['Per','per','epsf',0,0,0,0]

EDIT END 编辑结束

Maybe creating a new dataframe and only appending to it will be faster than modifying the original? 也许创建一个新的数据框并仅将其追加比修改原始数据框要快?

You could iterate your original df and append to a new one while splitting the multiwordexpr rows. 您可以在分割multiwordexpr行的同时迭代原始df并追加到新的df中。 No idea if that will perform better though. 不知道这样做是否会更好。

import pandas as pd
columns=    ['text','lemma','pos','markintext','doublelemme','multiwordexpr','nodetail']

rows = [['Per','per','epsf',0,0,0,0],
    ['dietro a','dietro a','eilksl',0,0,1,0],
    ['Per','per','epsf',0,0,0,0]]

orig_f = pd.DataFrame(rows,columns=columns)
df = pd.DataFrame(columns=columns)


for index, row in orig_f.iterrows():
    # check for multiwordexpr
    if row[5] == 1:
        s = row.copy()
        s[0]   = row[0].split(' ')[0]     
        row[0] = row[0].split(' ')[1]        
        df = df.append(s)
        df = df.append(row)

    else:
        df = df.append(row)

df = df.reset_index(drop=True)
#there are no more multi words
df.ix[df['multiwordexpr']==1, 'multiwordexpr'] = 0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM