如何在循环中将许多文件附加到数据框中

Question

I am trying to extract data from many docs files and append them into a dataframe. 我正在尝试从许多docs文件中提取数据，并将其附加到数据框中。

The code I had written works great when it comes to a single file, but I cant seem to append into the dataframe for more files. 当涉及到单个文件时，我编写的代码效果很好，但是我似乎无法将其追加到数据帧中以获取更多文件。

import re
import docx2txt
import pandas as pd
import glob

df2=pd.DataFrame()
appennded_data=[]

for file in glob.glob("*.docx"):
    text = docx2txt.process(file)
    a1=text.split()
    d2=a1[37]
    doc2=re.findall("HB0....",text)
    units2=re.findall("00[0-9]...",text) 
    df2['Units']=units2
    df2['Doc']=doc2[0]
    df2['Date']=d2
df2

This gives an error "Length of values does not match length of index" 这给出了错误“值的长度与索引的长度不匹配”

Expected output- 预期产量

From docx1: (Which I get) 从docx1 ：（我知道了）

Units |  Doc    |   Date

001   |  HB00001 | 23/4/12

002   |  HB00001 | 23/4/12

003   |  HB00001 | 23/4/12

004   |  HB00001 | 23/4/12

005   |  HB00001 | 23/4/12

From docx2: 从docx2：

Units |  Doc    |   Date

010   |  HB00002 | 2/6/16

011   |  HB00002 | 2/6/16

Final output: 最终输出：

Units |  Doc    |   Date

001   |  HB00001 | 23/4/12

002   |  HB00001 | 23/4/12

003   |  HB00001 | 23/4/12

004   |  HB00001 | 23/4/12

005   |  HB00001 | 23/4/12

010   |  HB00002 | 2/6/16

011   |  HB00002 | 2/6/16

Any help would be appreciated 任何帮助，将不胜感激

Answer 1

The error is because the lengths of the columns are not the same. 该错误是因为列的长度不同。 The moment the second file is processed, it will be trying to set the columns to values that have a different length to the first file. 在处理第二个文件时，它将尝试将列设置为与第一个文件具有不同长度的值。 You cannot assign a column with values that are different to the existing columns. 您不能为列分配与现有列不同的值。

Since you want the final df to have columns ['Units', 'Doc', 'Date'] , what you can do is to create a blank df and just append the new results to it. 由于您希望最终的df包含列['Units', 'Doc', 'Date'] ，因此您可以做的是创建一个空白df，然后将新结果附加到其上。 Use ignore_index=True to just append it below without trying to match row indexes. 使用ignore_index=True将其追加到下面，而不尝试匹配行索引。

import re
import docx2txt
import pandas as pd
import glob


final_df = pd.DataFrame()

for file in glob.glob("*.docx"):
    text = docx2txt.process(file)
    a1 = text.split()
    d2 = a1[37]
    doc2 = re.findall("HB0....", text)
    units2 = re.findall("00[0-9]...", text)

    # because columns are different length, create them as separate df and concat them
    df2 = pd.DataFrame()
    unit_df = pd.DataFrame(units2)
    doc_df = pd.DataFrame(doc2[0])
    date_df = pd.DataFrame(d2)
    # join them by columns. Any blanks will become NaN, but that's because your data has uneven lengths 
    df2 = pd.concat([df2, unit_df, doc_df, date_df], axis=1)

    # at the end of the loop, append it to the final_df
    final_df = pd.concat([final_df, df2], ignore_index=True)

print(final_df)

Answer 2

My suggestion is to first build a dict with the contents and create the DataFrame in the end: 我的建议是首先用内容构建一个字典，最后创建DataFrame：

import re
import docx2txt
import pandas as pd
import glob

columns = ['Units', 'Doc', 'Date']

data = {col: [] for col in columns}

for file in glob.glob("*.docx"):
    text = docx2txt.process(file)
    a1=text.split()
    d2=a1[37]
    doc2=re.findall("HB0....",text)
    units2=re.findall("00[0-9]...",text) 
    data['Units'].extend(units2)
    data['Doc'].extend(doc2[0])
    data['Date'].extend(d2)

df2 = pd.DataFrame(data)

如何在循环中将许多文件附加到数据框中

问题描述

2 个解决方案

解决方案1
0 2019-07-23 11:20:37

解决方案2
0 2019-07-23 11:22:58

如何在循环中将许多文件附加到数据框中

问题描述

2 个解决方案

解决方案1 0 2019-07-23 11:20:37

解决方案2 0 2019-07-23 11:22:58

解决方案1
0 2019-07-23 11:20:37

解决方案2
0 2019-07-23 11:22:58