如何使用python重建和更改数据集的结构？

Question

I have a dataset and I need to reconstruct some data from this dataset to a new style 我有一个数据集，我需要从该数据集重构一些数据为新样式

My dataset is something like below (Stored in a file named train1.txt): 我的数据集如下所示（存储在名为train1.txt的文件中）：

2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2342728，2414939，2397722，2386848，2398737，2367906，2384003，2399896，2359702，2414293，2411228，2416802，2322710，2387437，2397274，2344681，2396522，2386676，2413824，2328225，2413833，2335374，2328594，497966， 2372746、2386538、2348518、2380037、2374364、2352504、2377990、2367915、2412520、2348070、2356469、2353541、2413446、2391930、2366968、2364762、2347618、2396550、2370538、2393212、2364244、2387901、4752、2343855、2331855 2341328、2413686、2359209、2332027、2414843、2377801、2367772、2357576、2416791、2398673、2415237、2383922、2371110、2365017、2406357、2383444、2385709、2392694、2378109、2394742、2318516、2354062、2380081、2395546， 2396727，2316901，2400923，2360206，971，2350695，2341332、2357275、2369945，2325241、2408952、2322395、2415137、2372785、2382132、2323580、2368945、2413009，2348581、2365287、2408766、2382349、2355549、2406839、23746 2344619、2362449、2380907、2333272、2347183、2384375、2368019、2365927、2370027、2343649、2415694、2335035， 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468 2389182、2354073、2363977、2346358、2373500、2411328、2348913、2372324、2368727、2323717、2409571、2403981、2353188、2343632、285721、2376836、2368107、2404464、2417233、2382750、2366329、675、2360991、2341475、346346 2391969、2345287、2321367、2416019、2343732、2384793、2347111、2332212、138、2342178、2405886、2372686、2365963、2342468

I need to convert to below style (I need to store in a new file as train.txt): 我需要转换为以下样式（我需要将新文件存储为train.txt）：

2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
And other numbers ….

My python version is 2.7.13 My operating system is Ubuntu 14.04 LTS I will appreciate you for any help. 我的python版本是2.7.13，我的操作系统是Ubuntu 14.04 LTS，感谢您的帮助。 Thank you so much. 非常感谢。

Answer 1

I would suggest using regex (regular expressions). 我建议使用正则表达式（正则表达式）。 This might be a little overkill, but in the long run, knowing regex is super powerful. 这可能有点矫kill过正，但是从长远来看，知道正则表达式非常强大。

import re
def return_no_commas(string):
    regex = r'\d*'
    matches = re.findall(regex, string)
    for match in matches:
        print(match)


numbers = """
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
"""

return_no_commas(numbers)

Let me explain what everything does. 让我解释一下一切。

import re

just imports regular expressions. 只是导入正则表达式。 The regular expression I wrote is 我写的正则表达式是

regex = r'\d*'

the "r" at the beginning says it's a regex and it just looks for any number (which is the "\\d" part) and says it can repeat any number of times (which is the "*" part). 开头的“ r”表示这是一个正则表达式，它只查找任何数字（这是“ \\ d”部分），并且可以重复任意次数（这是“ *”部分）。 Then we print out all the matches. 然后我们打印出所有匹配项。

I saved your numbers in a string called numbers, but you could just as easily read in a file and worked with those contents. 我将您的数字保存在名为数字的字符串中，但是您可以轻松地将其读取到文件中并使用这些内容。

You'll get something like: 您会得到类似的信息：

Answer 2

It sounds to me like your original data is separated by commas. 在我看来，您的原始数据用逗号分隔。 However, you want the data separated by new-line characters ( \\n ) instead. 但是，您希望用换行符（ \\n ）分隔数据。 This is very easy to do. 这很容易做到。

def covert_comma_to_newline(rfilename, wfilename):
    """
    rfilename -- name of file to read-from
    wfilename -- name of file to write-to
    """
    assert(rfilename != wfilename)
    # open two files, one in read-mode
    # the other in write-mode
    rfile = open(rfilename, "r")
    wfile = open(wfilename, "w")

    # read the file into a string
    rstryng = rfile.read()

    lyst = rstryng.split(",")
    # EXAMPLE:
    #     rstryng == "1,2,3,4"
    #     lyst    == ["1", "2", "3", "4"]

    # remove leading and trailing whitespace
    lyst = [s.strip() for s in lyst]

    wstryng = "\n".join(lyst)
    wfile.writelines(wstryng)
    rfile.close()
    wfile.close()
    return


covert_comma_to_newline("train1.txt", "train.txt")
# open and check the contents of `train.txt`

Answer 3

Since others have added answers, I will include one using numpy . 由于其他人都添加了答案，因此我将使用numpy包含一个答案。 If you are ok using numpy , it is as simple as: 如果您可以使用numpy ，那么它很简单：

 data = np.genfromtxt('train1.txt', dtype=int, delimiter=',')

If you want a list instead of numpy array, 如果您要使用列表而不是numpy数组，

data.tolist()

[2342728,
 2414939,
 2397722,
 2386848,
 2398737,
 2367906,
 2384003,
 2399896,
 ....
]

如何使用python重建和更改数据集的结构？

问题描述

3 个解决方案

解决方案1
1 2019-06-17 01:26:39

解决方案2
0 已采纳 2019-06-17 01:41:04

解决方案3
0 2019-06-17 01:44:42

如何使用python重建和更改数据集的结构？

问题描述

3 个解决方案

解决方案1 1 2019-06-17 01:26:39

解决方案2 0 已采纳 2019-06-17 01:41:04

解决方案3 0 2019-06-17 01:44:42

解决方案1
1 2019-06-17 01:26:39

解决方案2
0 已采纳 2019-06-17 01:41:04

解决方案3
0 2019-06-17 01:44:42