简体   繁体   English

将python输出控制到控制台

[英]Controlling python outputs to console

I'm building a Movie recommendation using Hadoop/MapReduce. 我正在使用Hadoop / MapReduce建立电影推荐。
Now I'm using only python to implement the MapReduce process. 现在,我仅使用python来实现MapReduce流程。

So what I'm basically doing is running each mapper and reducer separately and using the console outputs from the mapper to the reducer. 因此,我基本上要做的是分别运行每个映射器和化简器,并使用从映射器到化简器的控制台输出。

The issue I'm having is that python outputs values as strings in the terminal, so if I'm working with numbers the numbers are printed as strings, which makes it difficult to simplify the process as the conversion of it adds more load on the server. 我遇到的问题是python在终端中将值输出为字符串,因此,如果我使用数字,则数字将打印为字符串,这使得简化过程变得很困难,因为它的转换会增加字符串的负担。服务器。

So how do I resolve this issue, I'm looking to implement it using pure python and no 3rd-party libs. 因此,我该如何解决此问题,我希望使用纯python而不使用第3方库来实现。

import sys

def mapper():
    '''
        From Mapper1 : we need only UserID , (MovieID , rating)
        as output.
    '''

    #* First mapper

    # Read input line
    for line in sys.stdin:
        # Strip whitespace and delimiter - ','
        print line
        data = line.strip().split(',')

        if len(data) == 4:
            # Using array to print out values
            # Direct printing , makes python interpret
            # values with comma in between as tuples
            # tempout = []
            userid , movieid , rating , timestamp = data
            # tempout.append(userid)
            # tempout.append((movieid , float(rating)))
            # print tempout

            #
            print "{0},({1},{2})".format(userid , movieid , rating)

Here's the reducer print statement: 这是reducer的打印语句:

def reducer():

    oldKey = None
    rating_arr = []

    for line in sys.stdin:
        # So we'll recieve user, (movie,rating)
        # We need to group the tuples for unique users
        # we'll append the tuples to an array
        # Given that we have two data points , we'll split the
        # data at only first occurance of ','
        # This splits the string only at first comma

        data = line.strip().split(',',1)
        # print len(data) , data
        # check for 2 data values
        if len(data) != 2:
            continue

        x , y = data

        if oldKey and oldKey != x:

            print "{0},{1}".format(oldKey , rating_arr)
            oldKey = x
            rating_arr = []
        oldKey = x
        rating_arr.append(y)
        # print rating_arr
    if oldKey != None:
        print "{0},{1}".format(oldKey , rating_arr)

` `

Input is: 输入为:

671,(4973,4.5)\\n671,(4993,5.0)\\n670,(4995,4.0)

The output is : 输出为:

671,['(4973,4.5)', '(4993,5.0)']
670,['(4995,4.0)']

I need the the tuples as it is, no strings. 我需要的是元组,没有字符串。

The fact that data is a string, that you then split and assign y to it, it is still a string . data是一个字符串,然后拆分并为其分配y的事实,它仍然是string

If you want the raw values of the tuple, as numbers, you need to parse them. 如果您想要元组的原始值作为数字,则需要解析它们。

ast.literal_eval can help. ast.literal_eval可以提供帮助。

For example, 例如,

In [1]: line = """671,(4973,4.5)"""

In [2]:  data = line.strip().split(',',1)

In [3]: data
Out[3]: ['671', '(4973,4.5)']

In [4]: x , y = data

In [5]: type(y)
Out[5]: str

In [6]: import ast

In [7]: y = ast.literal_eval(y)

In [8]: y
Out[8]: (4973, 4.5)

In [9]: type(y)
Out[9]: tuple

In [10]: type(y[0])
Out[10]: int

Now, if you would like to switch to PySpark, you would have more control over the variable/object types rather than all strings with Hadoop Streaming 现在,如果您想切换到PySpark,则可以通过Hadoop Streaming更好地控制变量/对象类型,而不是所有字符串

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM