简体   繁体   中英

How can I read in values from a text file and calculate how many times a value repeats and then find the average?

I have a text file called text.txt which looks like this:

5.H6 7.891 0.3
6.H6 7.693 0.3
7.H8 8.16859 0.3
8.H6 7.446 0.3
5.H6 7.72158 0.3
9.H8 8.1053 0.3
8.H6 7.65014 0.3
10.H6 7.54 0.3
12.H6 8.067 0.3
13.H6 8.047 0.3
14.H6 7.69624 0.3
6.H6 7.70272 0.3
17.H8 7.169 0.3
16.H8 8.27957 0.3
18.H6 7.385 0.3
19.H8 7.657 0.3
20.H8 7.78512 0.3
21.H8 8.06057 0.3

I want to create a new output text file which looks like this:

 Atom nVa  predppm   avgppm    
  7.H2   2   7.674   7.853    
  9.H2   2   7.434   7.458    
  20.H2  2   7.602   7.898   
  21.H2  1   7.959   7.898   
  8.H1'  1   5.363   5.238   

Essentially I want to read in values from text.txt and see if values in the first column repeat. For example, 5.H6 from text.txt repeats in row 1 and 5. The values in the second columns for 5.H6 are 7.891 and 7.72158, I want to calculate the average for them and put them in a column in my output file under avgppm in my sample output file. Also, in my second column of my sample output file, called nVa I want to count how many times my a value from the first column of text.txt is repeated. For example, 5.H6 is repeated twice so the second column should be 2 for Atom 5.H6 .

Right now, I'm just trying to code to get the first, second and fourth column from my sample output file. But later on I want to add separate columns to my file like predppm , stdev , delta , etc.

This is my current code:

import pandas as pd

filename = 'text.txt'
df = pd.read_csv(filename,sep = r'/s+', header = None)
df[df.duplicated([' '], keep=False)]
df.sum(axis=1) / len(df.columns)


df.to_csv("output.txt",sep = r'/s+',header=None)

I'm not sure how to proceed, I can't test my code out because I keep getting errors.

Edit: Error

  gb = (df.groupby("Atom", as_index=False).agg({"ppm":["count","mean"]}).rename(columns={"count":"nVa", "mean":"avgppm"}))
  File "/Library/Python/2.7/site-packages/pandas-0.20.3-py2.7-macosx-10.11-intel.egg/pandas/core/generic.py", line 4416, in groupby
**kwargs)
  File "/Library/Python/2.7/site-packages/pandas-0.20.3-py2.7-macosx-10.11-intel.egg/pandas/core/groupby.py", line 1699, in groupby
return klass(obj, by, **kwds)
  File "/Library/Python/2.7/site-packages/pandas-0.20.3-py2.7-macosx-10.11-intel.egg/pandas/core/groupby.py", line 392, in __init__
mutated=self.mutated)
  File "/Library/Python/2.7/site-packages/pandas-0.20.3-py2.7-macosx-10.11-intel.egg/pandas/core/groupby.py", line 2690, in _get_grouper
raise KeyError(gpr)
KeyError: 'Atom'

With df as:

     Atom      ppm  unclear
0    5.H6  7.89100      0.3
1    6.H6  7.69300      0.3
2    7.H8  8.16859      0.3
3    8.H6  7.44600      0.3
4    5.H6  7.72158      0.3
5    9.H8  8.10530      0.3
6    8.H6  7.65014      0.3
7   10.H6  7.54000      0.3
8   12.H6  8.06700      0.3
9   13.H6  8.04700      0.3
10  14.H6  7.69624      0.3
11   6.H6  7.70272      0.3
12  17.H8  7.16900      0.3
13  16.H8  8.27957      0.3
14  18.H6  7.38500      0.3
15  19.H8  7.65700      0.3
16  20.H8  7.78512      0.3
17  21.H8  8.06057      0.3

Use groupby() to collect information per- Atom , then apply aggregation functions as desired:

gb = (df.groupby("Atom", as_index=False)
        .agg({"ppm":["count","mean"]})
        .rename(columns={"count":"nVa", "mean":"avgppm"}))
gb.head()
     Atom ppm         
          nVa   avgppm
0   10.H6   1  7.54000
1   12.H6   1  8.06700
2   13.H6   1  8.04700
3   14.H6   1  7.69624
4   16.H8   1  8.27957

That gives the workflow for grouping and aggregating, but it's not quite in the format you requested. We can drop the multi-level column structure, although it's not strictly necessary to compute the values you're interested in:

gb.columns = gb.columns.droplevel()
gb = gb.rename(columns={"":"Atom"})

     Atom  nVa   avgppm
0   10.H6    1  7.54000
1   12.H6    1  8.06700
2   13.H6    1  8.04700
3   14.H6    1  7.69624
4   16.H8    1  8.27957
5   17.H8    1  7.16900
6   18.H6    1  7.38500
7   19.H8    1  7.65700
8   20.H8    1  7.78512
9   21.H8    1  8.06057
10   5.H6    2  7.80629
11   6.H6    2  7.69786
12   7.H8    1  8.16859
13   8.H6    2  7.54807
14   9.H8    1  8.10530

See groupby() docs for a full treatment.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM