[英]How to split a a string data into the same column / single column - without affecting the dataset itself (Python Pandas)?
The data set I'm working on has got a column with zipcodes in it.我正在处理的数据集有一个包含邮政编码的列。 Some entries only have one zipcode, some have 2, 5, or 10+ zipcodes.
有些条目只有一个邮政编码,有些条目有 2、5 或 10+ 个邮政编码。 Like this:
像这样:
Zipcode(s)![]() |
---|
1245 ![]() |
5863, 5682, 1995 ![]() |
6978, 1123, 5659, 34554 ![]() |
4539, 6453 ![]() |
I want to do some simple analysis -- apply a value_counts() on the column to see what zipcodes are the most popular.我想做一些简单的分析——在列上应用 value_counts() 以查看哪些邮政编码最受欢迎。 But I can't properly do it since most cells have multiple zipcodes on them.
但我不能正确地做到这一点,因为大多数单元格上都有多个邮政编码。 That's also the reason why I want a way where it won't affect the dataset itself, just that specific instance where all zipcodes are split and are in one column.
这也是为什么我想要一种不会影响数据集本身的方式,只是所有邮政编码都被拆分并位于一列中的特定实例。
I've tried splitting them into multiple columns with .str.split(',',n=20, expand=True)
but that's not really what I'm looking for.我尝试使用
.str.split(',',n=20, expand=True)
将它们分成多列,但这并不是我真正想要的。 I want them all split into a single column.我希望它们都分成一列。
I think pandas.DataFrame.explode is what you're looking for.我认为pandas.DataFrame.explode是您要找的。
With this, you take all values from lists (which you created with the split
function) to a new row.这样,您就可以将列表(您使用
split
函数创建的列表)中的所有值放到一个新行中。
import pandas as pd
df = pd.DataFrame({
"Zipcodes":["8000", "2000, 2002, 3003", "8000, 2002", "3004, 2004, 3003"]
})
df
(
df.Zipcodes
.str.replace(" ", "") # optional, if you don't need this then
.str.split(",") # use ", " instead of ","
.explode()
.value_counts()
)
Output: Output:
8000 2
2002 2
3003 2
2000 1
3004 1
2004 1
You can use this python snippet below:您可以使用下面的 python 片段:
import pandas as pd
df = pd.DataFrame({
"Zipcode(s)" : ["1245", "5863, 5682, 1995", "6978, 1123, 5659, 34554", "4539, 6453"]
})
df["Zipcode(s)"] = df["Zipcode(s)"].map(lambda zcode: zcode.split(", "))
zipcodes = sum(df["Zipcode(s)"].to_list(), [])
#create dummy(empty) dataframe
dummydf = pd.DataFrame({"Zipcode(s)" : zipcodes})
print(dummydf["Zipcode(s)"].value_counts())
Output: Output:
1245 1
5863 1
5682 1
1995 1
6978 1
1123 1
5659 1
34554 1
4539 1
6453 1
Name: Zipcode(s), dtype: int64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.