遍历大熊猫行以获得最小限度

Question

Here is my dataframe: 这是我的数据框：

Date         cell         tumor_size(mm)
25/10/2015    113           51
22/10/2015    222           50
22/10/2015    883           45
20/10/2015    334           35
19/10/2015    564           47
19/10/2015    123           56  
22/10/2014    345           36
13/12/2013    456           44

What I want to do is compare the size of the tumors detected on the different days. 我想做的是比较不同天检测到的肿瘤大小。 Let's consider the cell 222 as an example; 让我们以单元222为例。 I want to compare its size to different cells but detected on earlier days eg I will not compare its size with cell 883, because they were detected on the same day. 我想将其大小与不同的单元格进行比较，但是要在较早的日期进行检测，例如，我不会将其大小与883单元格进行比较，因为它们是在同一天检测到的。 Or I will not compare it with cell 113, because it was detected later on. 否则我不会将其与单元格113进行比较，因为稍后会检测到它。 As my dataset is too large, I have iterate over the rows. 由于我的数据集太大，因此需要对行进行迭代。 If I explain it in a non-pythonic way: 如果我以非Python方式进行解释：

for the cell 222:
     get_size_distance(absolute value):
          (50 - 35 = 15), (50 - 47 = 3), (50 - 56 = 6), (50 - 36 = 14), (44 - 36 = 8)
     get_minumum = 3, I got this value when I compared it with 564, so I will name it as a pait for the cell 222
Then do it for the cell 883

The resulting output should look like this: 结果输出应如下所示：

   Date         cell         tumor_size(mm)   pair    size_difference
    25/10/2015    113           51            222        1
    22/10/2015    222           50            123        6
    22/10/2015    883           45            456        1
    20/10/2015    334           35            345        1
    19/10/2015    564           47            456        3
    19/10/2015    123           56            456        12
    22/10/2014    345           36            456        8
    13/12/2013    456           44            NaN        NaN

I will really appreciate your help 非常感谢您的帮助

Answer 1

It's not pretty, but I believe it does the trick 它不漂亮，但我相信它可以解决问题

a = pd.read_clipboard()
# Cut off last row since it was a faulty date. You can skip this.
df = a.copy().iloc[:-1]

# Convert to dates and order just in case (not really needed I guess).
df['Date'] = df.Date.apply(lambda x: datetime.strptime(x, '%d/%m/%Y'))
df.sort_values('Date', ascending=False)
# Rename column
df = df.rename(columns={"tumor_size(mm)": 'tumor_size'})

# These will be our lists of pairs and size differences.
pairs = []
diffs = []

# Loop over all unique dates
for date in df.Date.unique():
    # Only take dates earlier then current date.
    compare_df = df.loc[df.Date < date].copy()
    # Loop over each cell for this date and find the minimum
    for row in df.loc[df.Date == date].itertuples():
        # If no cells earlier are available use nans.
        if compare_df.empty:
            pairs.append(float('nan'))
            diffs.append(float('nan'))
        # Take lowest absolute value and fill in otherwise
        else:
            compare_df['size_diff'] = abs(compare_df.tumor_size - row.tumor_size)
            row_of_interest = compare_df.loc[compare_df.size_diff == compare_df.size_diff.min()]
            pairs.append(row_of_interest.cell.values[0])
            diffs.append(row_of_interest.size_diff.values[0])

df['pair'] = pairs
df['size_difference'] = diffs

returns: 收益：

Date    cell    tumor_size  pair    size_difference
0   2015-10-25  113 51  222.0   1.0
1   2015-10-22  222 50  564.0   3.0
2   2015-10-22  883 45  564.0   2.0
3   2015-10-20  334 35  345.0   1.0
4   2015-10-19  564 47  345.0   11.0
5   2015-10-19  123 56  345.0   20.0
6   2014-10-22  345 36  NaN NaN

遍历大熊猫行以获得最小限度

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-11-15 14:15:33

遍历大熊猫行以获得最小限度

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-11-15 14:15:33

解决方案1
2 已采纳 2017-11-15 14:15:33