简体   繁体   English

如何在不影响数据集本身的情况下将 aa 字符串数据拆分为同一列/单列(Python Pandas)?

[英]How to split a a string data into the same column / single column - without affecting the dataset itself (Python Pandas)?

The data set I'm working on has got a column with zipcodes in it.我正在处理的数据集有一个包含邮政编码的列。 Some entries only have one zipcode, some have 2, 5, or 10+ zipcodes.有些条目只有一个邮政编码,有些条目有 2、5 或 10+ 个邮政编码。 Like this:像这样:

Zipcode(s)邮政编码
1245 1245
5863, 5682, 1995 5863, 5682, 1995
6978, 1123, 5659, 34554 6978, 1123, 5659, 34554
4539, 6453 4539, 6453

I want to do some simple analysis -- apply a value_counts() on the column to see what zipcodes are the most popular.我想做一些简单的分析——在列上应用 value_counts() 以查看哪些邮政编码最受欢迎。 But I can't properly do it since most cells have multiple zipcodes on them.但我不能正确地做到这一点,因为大多数单元格上都有多个邮政编码。 That's also the reason why I want a way where it won't affect the dataset itself, just that specific instance where all zipcodes are split and are in one column.这也是为什么我想要一种不会影响数据集本身的方式,只是所有邮政编码都被拆分并位于一列中的特定实例。

I've tried splitting them into multiple columns with .str.split(',',n=20, expand=True) but that's not really what I'm looking for.我尝试使用.str.split(',',n=20, expand=True)将它们分成多列,但这并不是我真正想要的。 I want them all split into a single column.我希望它们都分成一列。

I think pandas.DataFrame.explode is what you're looking for.我认为pandas.DataFrame.explode是您要找的。
With this, you take all values from lists (which you created with the split function) to a new row.这样,您就可以将列表(您使用split函数创建的列表)中的所有值放到一个新行中。

import pandas as pd

df = pd.DataFrame({
    "Zipcodes":["8000", "2000, 2002, 3003", "8000, 2002", "3004, 2004, 3003"]
})

df

(
    df.Zipcodes
    .str.replace(" ", "") # optional, if you don't need this then 
    .str.split(",")       # use ", " instead of ","
    .explode()
    .value_counts()
)

Output: Output:

8000    2
2002    2
3003    2
2000    1
3004    1
2004    1

You can use this python snippet below:您可以使用下面的 python 片段:

import pandas as pd
df = pd.DataFrame({
    "Zipcode(s)" : ["1245", "5863, 5682, 1995", "6978, 1123, 5659, 34554", "4539, 6453"]
})
df["Zipcode(s)"] = df["Zipcode(s)"].map(lambda zcode: zcode.split(", "))
zipcodes = sum(df["Zipcode(s)"].to_list(), [])
#create dummy(empty) dataframe
dummydf = pd.DataFrame({"Zipcode(s)" : zipcodes})
print(dummydf["Zipcode(s)"].value_counts())

Output: Output:

1245     1
5863     1
5682     1
1995     1
6978     1
1123     1
5659     1
34554    1
4539     1
6453     1
Name: Zipcode(s), dtype: int64

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM