简体   繁体   中英

split column into two columns using numpy

I have a text file containing 11 columns and I opened it with np.genfromtxt .

The third column is as the following

   The Column
+220.18094-0.28421
+58.24577+0.08044
+58.24498+0.08177
+58.24552+0.08175
+86.55739-0.04768
+179.60575-0.34409
+86.55622-0.04726
+86.55649-0.04723
+86.55548-0.04718
+86.55879-0.04705
+86.55696-0.04685
+43.95906+0.14121
+356.95494+0.21770
+356.95594+0.21763 

and I want to save only this column to a new text file and split the column to be two columns as the following

Txt file:

+220.18094 -0.28421
+58.24577  +0.08044
+58.24498  +0.08177
+58.24552  +0.08175
+86.55739  -0.04768
+179.60575 -0.34409
+86.55622  -0.04726
+86.55649  -0.04723
+86.55548  -0.04718
+86.55879  -0.04705
+86.55696  -0.04685
+43.95906  +0.14121
+356.95494 +0.21770
+356.95594 +0.21763 

How can I do this?

assuming you've read this two columns of data as a list of strings, use re to split the strings into numbers:

In [479]: d
Out[479]: 
['+220.18094-0.28421',
 '+58.24577+0.08044',
 '+58.24498+0.08177',
 '+58.24552+0.08175',
 '+86.55739-0.04768',
 '+179.60575-0.34409',
 '+86.55622-0.04726',
 '+86.55649-0.04723',
 '+86.55548-0.04718',
 '+86.55879-0.04705',
 '+86.55696-0.04685',
 '+43.95906+0.14121',
 '+356.95494+0.21770',
 '+356.95594+0.21763']

In [480]: import re
     ...: [map(float, re.findall('[-+][^-+]*', i)) for i in d]
Out[480]: 
[[220.18094, -0.28421],
 [58.24577, 0.08044],
 [58.24498, 0.08177],
 [58.24552, 0.08175],
 [86.55739, -0.04768],
 [179.60575, -0.34409],
 [86.55622, -0.04726],
 [86.55649, -0.04723],
 [86.55548, -0.04718],
 [86.55879, -0.04705],
 [86.55696, -0.04685],
 [43.95906, 0.14121],
 [356.95494, 0.2177],
 [356.95594, 0.21763]]

EDIT:

when I define the column as d = data[:,2] d gave array([ nan, nan, nan, ..., nan, nan, nan]), Why?

your file may contain a mixture of numbers and strings, use np.genfromtxt(fname, dtype=object) and print it to check if you succeed in reading.

def edit_file():
    f = open('file.txt', 'r')
    lines = f.readlines()
    f.close()

    f1 = open('file.txt', 'w')
    for line in lines:
        line = line.replace('+','  +')
        line = line.replace('-','  -')
        f1.write(line)
    f1.close()

file.txt:

  +220.18094  -0.28421
  +58.24577  +0.08044
  +58.24498  +0.08177
  +58.24552  +0.08175
  +86.55739  -0.04768
  +179.60575  -0.34409
  +86.55622  -0.04726
  +86.55649  -0.04723
  +86.55548  -0.04718
  +86.55879  -0.04705
  +86.55696  -0.04685
  +43.95906  +0.14121
  +356.95494  +0.21770
  +356.95594  +0.21763

This is can be done in simple way using "replace()" if you prefer.

To read in only the third column do:

d = np.genfromtxt('yourfile.txt',usecols=(2),dtype=None)

To split and convert to floats you could do this:

g = np.array([re.split(' ',y.replace('-',' -')) for y in [x.replace('+',' ') for x in d]],dtype=float)

And to save to file:

np.savetxt('yournewfile.txt',g)

Answer for 2021

Things have changed in the seven years since this question was originally asked, and the previous answers don't seem to actually answer the question as it was defined. I recently ran into this problem and discovered a solution after finding this incompletely answered question. If other people stumble across this question when they are trying to do the same thing, I hope this solution gets them back in action.

The Problem

The original question states there are 11 columns which are loaded using numpy's genfromtxt function. The third column should be split and saved to a separate file in a fixed column format.

A proper answer to the question will show how to take that specific column, split it, then write it to a separate file in the correct format. The method we use will work on any column in a numpy array, so this solution can be applied to other problems very easily.

The Solution

This solution is how I did it. If there is a more efficient way, comment below.

1. Import libraries

We're working with numpy, so we need to import it.

import numpy as np

We'll also use re to split the string in the column, so import that as well.

import re

2. Read the data

First, we will read using genfromtxt , as required. The question does not state which parameters were used, so we will rely on many defaults.

d = np.genfromtxt('data.csv', dtype=str, delimiter=',', skip_header=1, encoding='UTF-8')

In this line, we are loading a data.csv that has comma-separated string values, one header row, and everything encoded in UTF-8 . You'll notice dtype is set explicitly to str . This is important. The structure of the array will be different if you use None and the code below will fail, so make sure you use dtype=str .

3. Split column three

Here is the tricky part. We need to take out the single column to split, run a map function on it to split the string in the column, then put it all back together again.

c1 = np.hstack([*map(lambda x: re.findall(r'[-+]\d+\.\d+',x), d[:, 2])]).reshape(d.shape[0], 2)

That was a lot to unpack, so let's take a closer look. The lambda x: re.findall('[-+]\\d+',x) function splits the input string into two separate strings and retains their sign. It is used inside map(..., d[:,2]) which maps the function on each row of the third column (column index 2 since it is zero-based). A recent change to np.hstack will throw a warning if you place the map function in it directly, so we need to convert it to a list before using it as an argument in the np.hstack function call. One way to do that is [*map(...)] and that is what we've done. That explains the np.hstack function, but we're not done there. It will return a 1d array instead of a 2d array. We need to reshape it based on how many columns there are. That number is not 11 in this case because we are only working from one single column which was split into two.

4. Join columns (optional)

The question does not ask to reassemble the columns, but I needed to do it. I imagine other people do too, so this is how I did it. It is easy with np.hstack like so:

d = np.hstack((d[:,0:2], c1, d[:,3:]))

Notice how we're passing a list of three arrays. The first represents the columns leading up to column three, then we've got the two columns that column three turned into, finally we've got the columns after column three. The double parenthasis is not a typo. The np.hstack function takes a single argument, so we create a list to use as that argument rather than passing three arguments. If you split the first or last column, you would only have two items in your list.

5. Write to file

Whew! That was a lot, but we're not done. Now we need to write the split column to a data file in the specified format. That format appears to be a left aligned, 10-character string for the first column, and an 8-character string for the right column, separated by a space. We will use np.savetxt for this.

np.savetxt('data.txt',c1,fmt='%-10s %8s')

Final notes

If you've made it this far, you've split a single column out of many and possibly recombined them to make a table with one more column than you started with or you have a file with space-separated values. Great! There is a caveat though. We've forced everything to be strings for the duration of this exercise. If you want to work on the values as floating point numbers, or anything else, you'll have to convert the numpy array.

As it stands, I believe this answer fully answers the original question and hopefully proves to be useful to other people who are trying to split a column of a 2d numpy array.

Here is everything put all together:

import numpy as np
import re
d = np.genfromtxt('data.csv', dtype=str, delimiter=',', skip_header=1, encoding='UTF-8')
c1 = np.hstack([*map(lambda x: re.findall(r'[-+]\d+\.\d+',x), d[:, 2])]).reshape(d.shape[0], 2)
d = np.hstack((d[:,0:2], c1, d[:,3:]))
np.savetxt('data.txt',c1,fmt='%-10s %8s')

Cheers!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM