繁体   English   中英

使用 Python、sklearn、Z251D1BBFE9A3B678CEBAZ5366

[英]Character replacement and split for new columns in CSV dataframe using Python, sklearn, Pandas

目前,我正在尝试将第 6 列从使用反斜杠(例如 2/4/09)的日期格式转换为破折号而不是 0(2-4-9)。 此外,我想获取每个值并给它自己的列(如所需输出所示)。 我尝试研究和实施一些解决方案,但我似乎无法弄清楚。 我仍在试图弄清楚如何替换字符/删除字符(如下所示)。 我对使用 Python 处理数据帧很陌生。 任何提示或帮助将不胜感激。 谢谢你。

from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn import ensemble
import pandas as pd
import numpy as np

df = pd.read_csv('file.csv')

df[6].replace(['\/'],['-'],regex=True, regex=True)
df[6].replace('0','',regex=True,inplace=True)

错误:

classifier_v1.4.py:18: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True, subset=['Name', 'TRY', 'LOC', 'OUTPUT', 'TYPE_A', 'SIGNAL', 'A-B', 'SPOT'])
Traceback (most recent call last):
  File "/Users/namel/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 5

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "file.py", line 20, in <module>
    df[5].replace(['\/'],['-'],regex=True)
  File "/Users/name/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2800, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/name/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 5

当前 dataframe:

         0    1    2        3          4       5        6     7  
0     Name  TRY  LOC   OUTPUT     TYPE_A   SIGNAL     A-B  SPOT 
1    inc 1    2   20   TYPE-1    TORPEDO   ULTRA   2/4/09   -21
2    inc 2    3   16   TYPE-2    TORPEDO     ILH   2/4/09   -14
3    inc 3    2   20  BLACK47    TORPEDO    LION   2/4/09    49
4    inc 4    3   12   TYPE-2  CENTRALPA    LION   2/4/09    25
5    inc 5    3   10   TYPE-2      THREE    LION   2/4/09   -21
6    inc 6    2   20   TYPE-2        ATF    LION   2/4/09   -48
7    inc 7    4    2  NIVEA-1        ATF    LION   7/3/03   -23
8    inc 8    3   16  NIVEA-1        ATF    LION   7/3/03    18
9    inc 9    3   18  BLENDER  CENTRALPA    LION   7/3/03    48
10   inc 10   4   20    DELCO        ATF    LION   7/3/03   -26
11   inc 11   3   20    VE248        ATF    LION   7/3/03    44
12   inc 12   1   20   SILVER  CENTRALPA    LION   5/9/02   -35
13   inc 13   2   20  CALVIN3     SEVENX    LION   5/9/02   -20
14   inc 14   3   14  DECK-BT  CENTRALPA    LION   5/9/02   -38
15   inc 15   4    4  10-LEVI    BERWYEN     OWL   5/9/02   -29
16   inc 16   4   14   TYPE-2        ATF     NOV   5/9/02   -31
17   inc 17   4   10     NYNY    TORPEDO     NOV   5/9/02    21
18   inc 18   2   20  NIVEA-1  CENTRALPA     NOV   1/7/06    45
19   inc 19   3   27   FMRA97    TORPEDO     NOV   1/7/06   -26
20   inc 20   4   18   SILVER        ATF     NOV   1/7/06   -46

所需的 output:

         0    1    2        3          4       5       6   7   8   9     7   
0     Name  TRY  LOC   OUTPUT     TYPE_A   SIGNAL    A-B  D1  D2  D3  SPOT 
1    inc 1    2   20   TYPE-1    TORPEDO   ULTRA   2-4-9   2   4   9   -21
2    inc 2    3   16   TYPE-2    TORPEDO     ILH   2-4-9   2   4   9   -14
3    inc 3    2   20  BLACK47    TORPEDO    LION   2-4-9   2   4   9    49
4    inc 4    3   12   TYPE-2  CENTRALPA    LION   2-4-9   2   4   9    25
5    inc 5    3   10   TYPE-2      THREE    LION   2-4-9   2   4   9   -21
6    inc 6    2   20   TYPE-2        ATF    LION   2-4-9   2   4   9   -48
7    inc 7    4    2  NIVEA-1        ATF    LION   7-3-3   7   3   3   -23
8    inc 8    3   16  NIVEA-1        ATF    LION   7-3-3   7   3   3    18
9    inc 9    3   18  BLENDER  CENTRALPA    LION   7-3-3   7   3   3    48
10   inc 10   4   20    DELCO        ATF    LION   7-3-3   7   3   3   -26
11   inc 11   3   20    VE248        ATF    LION   7-3-3   7   3   3    44
12   inc 12   1   20   SILVER  CENTRALPA    LION   5-9-2   5   9   2   -35
13   inc 13   2   20  CALVIN3     SEVENX    LION   5-9-2   5   9   2   -20
14   inc 14   3   14  DECK-BT  CENTRALPA    LION   5-9-2   5   9   2   -38
15   inc 15   4    4  10-LEVI    BERWYEN     OWL   5-9-2   5   9   2   -29
16   inc 16   4   14   TYPE-2        ATF     NOV   5-9-2   5   9   2   -31
17   inc 17   4   10     NYNY    TORPEDO     NOV   5-9-2   5   9   2    21
18   inc 18   2   20  NIVEA-1  CENTRALPA     NOV   1-7-6   1   7   6    45
19   inc 19   3   27   FMRA97    TORPEDO     NOV   1-7-6   1   7   6   -26
20   inc 20   4   18   SILVER        ATF     NOV   1-7-6   1   7   6   -46

可能有一种更有效的方法可以做到这一点,但下面的代码将实现您想要的。

from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn import ensemble
import pandas as pd
import numpy as np

df = pd.read_csv('file.csv')

# insert columns
df.insert(7, 'D1', '')
df.insert(8, 'D2', '')
df.insert(9, 'D3', '')

# replace
df['A-B'] = df['A-B'].str.replace('/', '-')
df['A-B'] = df['A-B'].str.replace('0', '')

# update new columns values
df['D1'] = df.apply(lambda x: str(x['A-B']).split('-')[0], axis=1)
df['D2'] = df.apply(lambda x: str(x['A-B']).split('-')[1], axis=1)
df['D3'] = df.apply(lambda x: str(x['A-B']).split('-')[2], axis=1)

print(df)

鉴于您正在处理日期,您可以在读取 csv 时将日期加载为DateTime并进一步处理它们。 由于您希望实现的年份的不常见格式(没有零填充),它确实需要额外的处理:

dateparser = lambda x: pd.datetime.strptime(x, '%d/%m/%y')
df = pd.read_csv('file.csv', parse_dates=['A-B'], date_parser=dateparser)
df['D1'] = df['A-B'].dt.day
df['D2'] = df['A-B'].dt.month
df['D3'] = df['A-B'].dt.year
df['D3'] = df.apply(lambda row: int(str(row['A-B'].year)[3:]), axis=1)
df['A-B'] = df['A-B'].apply(lambda x: str(x.strftime('%d-%m-%y')).replace("0", ""))

Output:

姓名 尝试 LOC OUTPUT TYPE_A 信号 AB D1 D2 D3
0 公司1 2 20 TYPE-1 鱼雷 极端主义者 2-4-9 -21 2 4 9
1 公司2 3 16 TYPE-2 鱼雷 ILH 2-4-9 -14 2 4 9
2 公司3 2 20 黑色47 鱼雷 狮子 2-4-9 49 2 4 9
3 公司4 3 12 TYPE-2 中央帕 狮子 2-4-9 25 2 4 9
4 公司5 3 10 TYPE-2 狮子 2-4-9 -21 2 4 9

KeyError: 5表示密钥 5 不存在。 在这种情况下,它不是 integer 而是一个字符串,所以你需要使用引号。

另一种(可能更实用)的方法是删除第一行并将第 1 行用作列标题。

.replace使用原始值和新值的列表没有问题。 有几种替代方式,下面显示其中两种。

使用如下所示的split ,您可以同时添加三个新列。

from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn import ensemble
import pandas as pd
import numpy as np

df = pd.read_csv('/Users/ciit2/downloads/test.csv', header=1)
df['A-B'].replace({'/': '-'}, regex=True, inplace=True)
df['A-B'].replace('0', '', regex=True, inplace=True)
df[['D1', 'D2', 'D3']] = pd.DataFrame(df['A-B'].str.split('-').tolist())
df

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM