简体   繁体   English

如何使用python重建和更改数据集的结构?

[英]How to reconstruct and change structure of a dataset using python?

I have a dataset and I need to reconstruct some data from this dataset to a new style 我有一个数据集,我需要从该数据集重构一些数据为新样式

My dataset is something like below (Stored in a file named train1.txt): 我的数据集如下所示(存储在名为train1.txt的文件中):

2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2342728,2414939,2397722,2386848,2398737,2367906,2384003,2399896,2359702,2414293,2411228,2416802,2322710,2387437,2397274,2344681,2396522,2386676,2413824,2328225,2413833,2335374,2328594,497966, 2372746、2386538、2348518、2380037、2374364、2352504、2377990、2367915、2412520、2348070、2356469、2353541、2413446、2391930、2366968、2364762、2347618、2396550、2370538、2393212、2364244、2387901、4752、2343855、2331855 2341328、2413686、2359209、2332027、2414843、2377801、2367772、2357576、2416791、2398673、2415237、2383922、2371110、2365017、2406357、2383444、2385709、2392694、2378109、2394742、2318516、2354062、2380081、2395546, 2396727,2316901,2400923,2360206,971,2350695,2341332、2357275、2369945,2325241、2408952、2322395、2415137、2372785、2382132、2323580、2368945、2413009,2348581、2365287、2408766、2382349、2355549、2406839、23746 2344619、2362449、2380907、2333272、2347183、2384375、2368019、2365927、2370027、2343649、2415694、2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468 2389182、2354073、2363977、2346358、2373500、2411328、2348913、2372324、2368727、2323717、2409571、2403981、2353188、2343632、285721、2376836、2368107、2404464、2417233、2382750、2366329、675、2360991、2341475、346346 2391969、2345287、2321367、2416019、2343732、2384793、2347111、2332212、138、2342178、2405886、2372686、2365963、2342468

I need to convert to below style (I need to store in a new file as train.txt): 我需要转换为以下样式(我需要将新文件存储为train.txt):

2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
And other numbers ….

My python version is 2.7.13 My operating system is Ubuntu 14.04 LTS I will appreciate you for any help. 我的python版本是2.7.13,我的操作系统是Ubuntu 14.04 LTS,感谢您的帮助。 Thank you so much. 非常感谢。

I would suggest using regex (regular expressions). 我建议使用正则表达式(正则表达式)。 This might be a little overkill, but in the long run, knowing regex is super powerful. 这可能有点矫kill过正,但是从长远来看,知道正则表达式非常强大。

import re
def return_no_commas(string):
    regex = r'\d*'
    matches = re.findall(regex, string)
    for match in matches:
        print(match)


numbers = """
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
"""

return_no_commas(numbers)

Let me explain what everything does. 让我解释一下一切。

import re

just imports regular expressions. 只是导入正则表达式。 The regular expression I wrote is 我写的正则表达式是

regex = r'\d*'

the "r" at the beginning says it's a regex and it just looks for any number (which is the "\\d" part) and says it can repeat any number of times (which is the "*" part). 开头的“ r”表示这是一个正则表达式,它只查找任何数字(这是“ \\ d”部分),并且可以重复任意次数(这是“ *”部分)。 Then we print out all the matches. 然后我们打印出所有匹配项。

I saved your numbers in a string called numbers, but you could just as easily read in a file and worked with those contents. 我将您的数字保存在名为数字的字符串中,但是您可以轻松地将其读取到文件中并使用这些内容。

You'll get something like: 您会得到类似的信息:

2342728


2414939


2397722


2386848


2398737


2367906


2384003


2399896


2359702


2414293


2411228


2416802


2322710


2387437


2397274


2344681


2396522


2386676


2413824


2328225


2413833


2335374


2328594


497966


2384001


2372746


2386538


2348518


2380037


2374364


2352054


2377990


2367915


2412520


2348070


2356469


2353541


2413446


2391930


2366968


2364762


2347618


2396550


2370538


2393212

It sounds to me like your original data is separated by commas. 在我看来,您的原始数据用逗号分隔。 However, you want the data separated by new-line characters ( \\n ) instead. 但是,您希望用换行符( \\n )分隔数据。 This is very easy to do. 这很容易做到。

def covert_comma_to_newline(rfilename, wfilename):
    """
    rfilename -- name of file to read-from
    wfilename -- name of file to write-to
    """
    assert(rfilename != wfilename)
    # open two files, one in read-mode
    # the other in write-mode
    rfile = open(rfilename, "r")
    wfile = open(wfilename, "w")

    # read the file into a string
    rstryng = rfile.read()

    lyst = rstryng.split(",")
    # EXAMPLE:
    #     rstryng == "1,2,3,4"
    #     lyst    == ["1", "2", "3", "4"]

    # remove leading and trailing whitespace
    lyst = [s.strip() for s in lyst]

    wstryng = "\n".join(lyst)
    wfile.writelines(wstryng)
    rfile.close()
    wfile.close()
    return


covert_comma_to_newline("train1.txt", "train.txt")
# open and check the contents of `train.txt`

Since others have added answers, I will include one using numpy . 由于其他人都添加了答案,因此我将使用numpy包含一个答案。 If you are ok using numpy , it is as simple as: 如果您可以使用numpy ,那么它很简单:

 data = np.genfromtxt('train1.txt', dtype=int, delimiter=',')

If you want a list instead of numpy array, 如果您要使用列表而不是numpy数组,

data.tolist()

[2342728,
 2414939,
 2397722,
 2386848,
 2398737,
 2367906,
 2384003,
 2399896,
 ....
]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM