简体   繁体   English

Pandas pd.merge“ TypeError:字符串索引必须是整数,而不是str”

[英]Pandas pd.merge “TypeError: string indices must be integers, not str”

I have researched this simple problem extensively but can't find an answer. 我已经广泛研究了这个简单的问题,但找不到答案。 I am trying to merge two files using pandas' pd.merge based on a common column named "JN". 我试图基于一个名为“ JN”的公共列,使用pandas的pd.merge合并两个文件。 I believe it is treating my 'joined' ( os.path.join ) filename as a string instead of a dataframe/csv file. 我相信它将“连接”( os.path.join )文件名视为字符串而不是dataframe / csv文件。 After I call the pd.merge function the error says "string indices must be integers, not str". 在我调用pd.merge函数后,错误提示“字符串索引必须是整数,而不是str”。

import pandas as pd
import os

path = r"C:/Users/St/Documents/House/m2"

dirs = os.listdir(path)

for file in dirs:
    if file.endswith("J.csv"):
        J = file
        if len(J) is 12: #some filenames are 12 chars others 11
            jroot = J[:7]
        else:
            jroot = J[:6]

for file in dirs:
    if file.endswith("2.csv"):
        W = file
        if len(W) is 12:
            root2 = W[:7]
        else:
            root2 = W[:6]

JJ = os.path.join(path, J)
WW = os.path.join(path, W)

if jroot == root2:          # if the first 7 (or 6) characters match, then merge
    JW = pd.merge(JJ, WW, on="JN")

In associated with the above pd.merge function call, I am getting this error: 与上面的pd.merge函数调用相关联,我收到此错误:

TypeError: string indices must be integers, not str

I am wondering how to make it read my filename string as an actual file or dataframe. 我想知道如何使其读取我的文件名字符串作为实际文件或数据帧。 JJ and WW are the equivalent to full paths when printed out. 当打印时,JJ和WW等效于完整路径。 I tried make these 'filenames' dataFrames using pd.DataFrame but wasn't able to do so. 我尝试使用pd.DataFrame制作这些“文件名” pd.DataFrame但无法这样做。

You cannot merge two strings. 您不能merge两个字符串。 I think you're confused about what os.path.join returns. 我认为您对os.path.join返回的结果感到困惑。 It returns a string. 它返回一个字符串。 You have to actually read in the DataFrame s from the files named JJ and WW , then perform the merge . 您实际上必须从名为JJWW的文件中读取DataFrame ,然后执行merge

Here's a full example of writing 2 DataFrame s, reading them back with read_csv and then merging them on a column group : 这是编写2个DataFrame ,使用read_csv读回read_csv ,然后将它们合并到一个列group的完整示例:

In [49]: df1 = DataFrame(randn(10, 1), columns=['a'])

In [50]: df1['group'] = np.random.choice(['b', 'c'], size=len(df1))

In [51]: df2 = DataFrame(randn(10, 1), columns=['b'])

In [52]: df2['group'] = np.random.choice(['b', 'c'], size=len(df1))

In [53]: df1.to_csv('df1.csv', index=False)

In [54]: cat df1.csv
a,group
-1.590035935931282,b
0.5496398501891229,c
-0.6484689548035797,b
0.19162302248253205,b
-0.9852064283582675,c
0.5975155551821989,b
0.29443634291217047,b
-0.7929994157215382,b
-1.9546460886048795,b
0.19195457928475546,c

In [55]: df2.to_csv('df2.csv', index=False)

In [56]: cat df2.csv
b,group
-1.2874060006117918,c
1.1037959548210117,b
0.47172389260467507,c
0.12802538607490285,c
-0.8753708425917293,b
-0.09187827793091947,b
1.140204215271196,c
0.4862940170888638,b
-1.1080430563137758,b
-1.3698112665693232,c

In [57]: df1_csv = read_csv('df1.csv', index_col=None)

In [58]: df2_csv = read_csv('df2.csv', index_col=None)

In [59]: df1_csv
Out[59]:
       a group
0 -1.590     b
1  0.550     c
2 -0.648     b
3  0.192     b
4 -0.985     c
5  0.598     b
6  0.294     b
7 -0.793     b
8 -1.955     b
9  0.192     c

In [60]: df2_csv
Out[60]:
       b group
0 -1.287     c
1  1.104     b
2  0.472     c
3  0.128     c
4 -0.875     b
5 -0.092     b
6  1.140     c
7  0.486     b
8 -1.108     b
9 -1.370     c

In [61]: df3 = pd.merge(df1_csv, df2_csv, on='group')

In [62]: df3
Out[62]:
        a group      b
0  -1.590     b  1.104
1  -1.590     b -0.875
2  -1.590     b -0.092
3  -1.590     b  0.486
4  -1.590     b -1.108
5  -0.648     b  1.104
6  -0.648     b -0.875
7  -0.648     b -0.092
8  -0.648     b  0.486
9  -0.648     b -1.108
10  0.192     b  1.104
11  0.192     b -0.875
12  0.192     b -0.092
13  0.192     b  0.486
14  0.192     b -1.108
15  0.598     b  1.104
16  0.598     b -0.875
17  0.598     b -0.092
18  0.598     b  0.486
19  0.598     b -1.108
20  0.294     b  1.104
21  0.294     b -0.875
22  0.294     b -0.092
23  0.294     b  0.486
24  0.294     b -1.108
25 -0.793     b  1.104
26 -0.793     b -0.875
27 -0.793     b -0.092
28 -0.793     b  0.486
29 -0.793     b -1.108
30 -1.955     b  1.104
31 -1.955     b -0.875
32 -1.955     b -0.092
33 -1.955     b  0.486
34 -1.955     b -1.108
35  0.550     c -1.287
36  0.550     c  0.472
37  0.550     c  0.128
38  0.550     c  1.140
39  0.550     c -1.370
40 -0.985     c -1.287
41 -0.985     c  0.472
42 -0.985     c  0.128
43 -0.985     c  1.140
44 -0.985     c -1.370
45  0.192     c -1.287
46  0.192     c  0.472
47  0.192     c  0.128
48  0.192     c  1.140
49  0.192     c -1.370

Couple of other things: 其他几件事:

Don't use is to compare objects for equality, use == . 不要使用is比较对象是否相等,请使用== Only in the case of small integers will this work reliably, and even then you shouldn't rely on it because that's an implementation detail of CPython. 只有在小整数的情况下,这才能可靠地工作,即使那样,您也不应该依赖它,因为这是CPython的实现细节。

Instead of checking the file name with str.endswith , just iterate over what you want by first globbing: 无需使用str.endswith检查文件名,只需通过遍历遍历所需内容即可:

import glob

for f in glob.glob(os.path.join(path, '*J.csv')):
    if len(f) == 12:
        # do all the thingz!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM