[英]How to convert the “rows” of a pandas Series into columns of a DataFrame?
我有以下熊貓系列,形狀為ser1
(100)。
import pandas as pd
ser1 = pd.Series(...)
print(len(ser1))
## prints (100,)
該系列中每個ndarray的長度為150000,其中每個元素都是一個字符。
len(print(ser1[0]))
## prints 150000
ser1.head()
sample1 xhtrcuviuvjhgfsrexvuvhfgshgckgvghfsgfdsdsg...
sample2 jhkjhgkjvkjgfjyqerwqrbxcvmkoshfkhgjknlkdfk...
sample3 sdfgfdxcvybnjbvtcyuikjhbgfdftgyhujhghjkhjn...
sample4 bbbbbbadfashdwkjhhguhoadfopnpbfjhsaqeqjtyi...
sample5 gfjyqedxcvrexvuvcvmkoshdftgyhujhgcvmkoshfk...
dtype: object
我想將此pandas系列轉換為pandas DataFrame,以便此pandas系列“行”的每個元素都是一個DataFrame列。 也就是說,該Series數組的每個元素都是一個單獨的列。 在這種情況下, ser1
將具有150000列。
print(type(df_ser1)) # DataFrame of ser1
## outputs <class 'pandas.core.frame.DataFrame'>
df_ser1.head()
samples char1 char2 char3 char4 char5 char6
0 sample1 x h t r c u
1 sample2 j h k j h g
2 sample3 s d f g f d
3 sample4 b b b b b b
........
如何將熊貓系列以這種方式轉換為DataFrame?
最明顯的想法是
df_ser = ser1.to_frame
但這不會將元素分成單獨的Dataframe列:
df_ser = ser1.to_frame
df_ser.head()
0
sample1 xhtrcuviuvjhgfsrexvuvhfgshgckgvghfsgfdsdsg...
sample2 jhkjhgkjvkjgfjyqerwqrbxcvmkoshfkhgjknlkdfk...
sample3 sdfgfdxcvybnjbvtcyuikjhbgfdftgyhujhghjkhjn...
......
盡管我不確定在計算上如何可行,但還是會以某種方式遍歷“系列行”的每個元素並創建一列。 (不是很pythonic。)
一個人怎么做?
考慮樣本系列ser1
ser1 = pd.Series(
'abc def ghi'.split(),
'sample1 sample2 sample3'.split())
將字符串pd.Series
字符列表后,請與pd.Series
應用。
ser1.apply(lambda x: pd.Series(list(x))) \
.rename(columns=lambda x: 'char{}'.format(x + 1))
char1 char2 char3
sample1 a b c
sample2 d e f
sample3 g h i
我的方法是將數據作為numpy數組使用,然后將最終產品存儲在pandas DataFrame中。 但是總的來說,在數據框中創建100k +列似乎很慢。
與piRSquareds解決方案相比,我的並沒有什么更好的選擇,但我認為無論如何我都會發布它,因為這是另一種方法。
import pandas as pd
from timeit import default_timer as timer
# setup some sample data
a = ["c"]
a = a*100
a = [x*10**5 for x in a]
a = pd.Series(a)
print("shape of the series = %s" % a.shape)
print("length of each string in the series = %s" % len(a[0]))
輸出:
shape of the series = 100
length of each string in the series = 100000
# get a numpy array representation of the pandas Series
b = a.values
# split each string in the series into a list of individual characters
c = [list(x) for x in b]
# save it as a dataframe
df = pd.DataFrame(c)
piRSquared已經發布了解決方案,因此我應該包括運行時分析。
execTime=[]
start = timer()
# get a numpy array representation of the pandas Series
b = a.values
end = timer()
execTime.append(end-start)
start = timer()
# split each string in the series into a list of individual characters
c = [list(x) for x in b]
end = timer()
execTime.append(end-start)
start = timer()
# save it as a dataframe
df = pd.DataFrame(c)
end = timer()
execTime.append(end-start)
start = timer()
a.apply(lambda x: pd.Series(list(x))).rename(columns=lambda x: 'char{}'.format(x + 1))
end = timer()
execTime.append(end-start)
print("get numpy array = %s" % execTime[0])
print("Split each string into chars runtime = %s" % execTime[1])
print("Save 2D list as Dataframe runtime = %s" % execTime[2])
print("piRSquared's solution runtime = %s" % execTime[3])
輸出:
get numpy array = 7.788003131281585e-06
Split each string into chars runtime = 0.17509693499960122
Save 2D list as Dataframe runtime = 12.092364584001189
piRSquareds solution runtime = 13.954442440001003
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.