简体   繁体   English

Python 多键和转换为字典

[英]Python Multiple Keys and Transform to Dictionary

I am importing a txt file in Python 2.6.6, and need to do some data wrangling.我在 Python 2.6.6 中导入了一个 txt 文件,需要做一些数据整理。 I am new to Python and am struggling to google every step to complete the task.我是 Python 的新手,正在努力用谷歌搜索完成任务的每一步。 Could you help or suggest?你能帮忙或建议吗?

Here is my input myData.txt shown below.这是我的输入myData.txt,如下所示。 The header is not in the data, but I put it here for easier reading.标题不在数据中,但我把它放在这里以便于阅读。

key1|key2|group|v1|v2|v3|v4
1|A|-1|10|100|1|2
1|A|2|20|35|2|3
1|B|1|15|5|3|5
2|B|5|23|25|4|2
2|B|2|33|20|22|98    
2|D|4|23|21|20|32
...

Here is my desired output in a panda dataframe shown below.这是我在如下所示的熊猫数据框中所需的输出。 Basically, I want to merge key1 and key2 and form a combo key, and put group, v1, and v2 into a dictionary with group as the key, and v1 v2 as the values in a list (v1 being the first element, and v2 being the second element).基本上,我想合并 key1 和 key2 并形成一个组合键,然后将 group、v1 和 v2 放入一个字典中,以 group 作为键,v1 v2 作为列表中的值(v1 是第一个元素,v2是第二个元素)。 I don't need v3 or v4 in the output.我不需要输出中的 v3 或 v4。

     comboKey1  new_v1
     1_A        {"-1":[10,100], "2":[20,35]}
     1_B        {"1":[15,5]}
     2_B        {"2":[33,20], "5":[23,25]}
     2_D        {"4":[23,21]}

Here is what I have now.这是我现在所拥有的。 Could someone kindly advise?有人可以建议吗?

import pandas as pd
df1 = pd.read_csv('myData.txt', header=None, sep='|')
df1.columns = ('key1','key2','group','v1','v2')
df1['comboKey1'] = df1['key1'].map(str)+"_"+df1['key2']
 import pandas as pd

 # Reading file, 'r' -> read
 file = open('data.txt', 'r')
 lines = file.readlines()

 # Fict where info will be stored
 main_dict = {}

 for line in lines:
     # Getting the list of values in the line
     # values -> [key1, key2, group, v1, v2, v3, v4]
     # indexs ->   0     1      2     3   4   5   6
     values = line.split('|')

     #creating combo_key
     combo_key = str(values[0])+"_"+str(values[1])

     #tests if key already exists
     #if not, creats a new dict into it
     if combo_key not in main_dict.keys():
         main_dict[combo_key] = {}   #adding new dict to dict key

     main_dict[combo_key][str(values[2])] = [values[3], values[4]]

 data = []
 for key in main_dict.keys():
     data.append([key, str(main_dict[key])])

 df = pd.DataFrame(data, columns = ['ComboKey1', "new_v1"])

 print(df)

Just sort the dict, then (:只需对 dict 进行排序,然后 (:

If just achieve the desired expected output, then the following code can apply too.如果只是达到所需的预期输出,那么下面的代码也可以应用。

import pandas as pd
from io import StringIO

YOUR_TXT_DATA = """\
1|A|-1|10|100|1|2
1|A|2|20|35|2|3
1|B|1|15|5|3|5
2|B|5|23|25|4|2
2|B|2|33|20|22|98    
2|D|4|23|21|20|32
"""

df = pd.read_csv(StringIO(YOUR_TXT_DATA), header=None,
                 usecols=[_ for _ in range(0, 5)],
                 names=['key1', 'key2', 'group', 'v1', 'v2'],
                 sep='|')
result_dict = dict(comboKey1=[], new_v1=[])
for key1, key2, group, v1, v2 in df.values:
    key = str(key1) + '_' + str(key2)
    if key not in result_dict['comboKey1']:
        result_dict['comboKey1'].append(key)
        result_dict['new_v1'].append({str(group): [v1, v2]})
    else:
        index = result_dict['comboKey1'].index(key)
        result_dict['new_v1'][index].update({str(group): [v1, v2]})

result_df = pd.DataFrame.from_dict(result_dict)
print(result_df)

output输出

  comboKey1                            new_v1
0       1_A  {'-1': [10, 100], '2': [20, 35]}
1       1_B                    {'1': [15, 5]}
2       2_B    {'5': [23, 25], '2': [33, 20]}
3       2_D                   {'4': [23, 21]}

About Test Data关于测试数据

I think there is some special cases that you may need to consider, assuming the data is as follows.我认为有一些特殊情况您可能需要考虑,假设数据如下。

key1|key2|group|v1|v2|v3|v4
1|A|-1|10|100|1|2
1|A|-1|10|100|1|2
1|A|-1|20|35|2|3

what is your expected output?你的预期输出是什么? (case 1 ~ 3) (案例 1 ~ 3)

  • case 1: be subject to last.情况1:以最后为准。 1_A {'-1': [20, 35]} (solution: dict) 1_A {'-1': [20, 35]} (解: dict)
  • case 2: keep all but not duplicate: {('-1', (10, 100)), ('-1', (20, 35))} (solution: set)情况 2:保留所有但不重复: {('-1', (10, 100)), ('-1', (20, 35))} (解决方案:设置)
  • case 3: keep all 1_A [('-1', (10, 100)), ('-1', (10, 100)), ('-1', (20, 35))] (solution: list)情况 3:保留所有1_A [('-1', (10, 100)), ('-1', (10, 100)), ('-1', (20, 35))] (解决方案:list )

code:代码:

from unittest import TestCase
import pandas as pd
from io import StringIO

OTHER_TXT_DATA = """\
1|A|-1|10|100|1|2
1|A|-1|10|100|1|2
1|A|-1|20|35|2|3
"""

class MyTests(TestCase):
    def __init__(self, *args, **options):
        super().__init__(*args, **options)
        self.df = pd.read_csv(StringIO(OTHER_TXT_DATA), header=None,
                              usecols=[_ for _ in range(0, 5)],
                              names=['key1', 'key2', 'group', 'v1', 'v2'],
                              sep='|')

    def setUp(self) -> None:
        # init on every test case.
        self.result_dict = dict(comboKey1=[], new_v1=[])

    def solution_base(self, new_v1_fun, update_v1_fun) -> pd.DataFrame:

        result_dict = self.result_dict

        for key1, key2, group, v1, v2 in self.df.values:
            key = str(key1) + '_' + str(key2)
            if key not in result_dict['comboKey1']:
                result_dict['comboKey1'].append(key)
                new_v1_fun(group, v1, v2)  # result_dict['new_v1'].append({str(group): [v1, v2]})
            else:
                index = result_dict['comboKey1'].index(key)
                update_v1_fun(index, group, v1, v2)  # result_dict['new_v1'][index].update({str(group): [v1, v2]})

        df = pd.DataFrame.from_dict(result_dict)
        print(df)
        return df

    def test_case_1_dict(self):
        df = self.solution_base(new_v1_fun=lambda group, v1, v2: self.result_dict['new_v1'].append({str(group): [v1, v2]}),
                                update_v1_fun=lambda index, group, v1, v2: self.result_dict['new_v1'][index].update({str(group): [v1, v2]}))
        self.assertTrue(df.equals(pd.DataFrame(
            columns=['comboKey1', 'new_v1'],
            data=[
                ['1_A', {'-1': [20, 35]}],
            ]
        )))

    def test_case_2_set(self):
        df = self.solution_base(new_v1_fun=lambda group, v1, v2: self.result_dict['new_v1'].append({(str(group), (v1, v2))}),
                                update_v1_fun=lambda index, group, v1, v2: self.result_dict['new_v1'][index].add((str(group), (v1, v2))))
        self.assertTrue(df.equals(pd.DataFrame(
            columns=['comboKey1', 'new_v1'],
            data=[
                ['1_A', {('-1', (20, 35)), ('-1', (10, 100))}],
            ]
        )))

    def test_case_3_list(self):
        df = self.solution_base(new_v1_fun=lambda group, v1, v2: self.result_dict['new_v1'].append([(str(group), (v1, v2))]),
                                update_v1_fun=lambda index, group, v1, v2: self.result_dict['new_v1'][index].append((str(group), (v1, v2))))
        self.assertTrue(df.equals(pd.DataFrame(
            columns=['comboKey1', 'new_v1'],
            data=[
                ['1_A', [('-1', (10, 100)), ('-1', (10, 100)), ('-1', (20, 35))]],
            ]
        )))

note: annotation (see PEP484 ) is not supported Python 2.注意: Python 2 不支持注释(参见PEP484 )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM