[英]generate pandas dataframe from txt file
I have a large txt file with the following format:我有一个大的txt文件,格式如下:
0
1
2
3
4
La situacion es preocupante
5
6
Radio es parte de la vida
7
Dejare de querer muy pronto
I need to generate a pandas dataframe similar to:我需要生成一个类似于以下内容的熊猫数据框:
Texto
0 NaN
1 NaN
2 NaN
3 NaN
4 La situacion es preocupante
5 NaN
6 Radio es parte de la vida
7 Dejare de querer muy pronto
with the following code I get an incorrect output:使用以下代码,我得到一个不正确的输出:
import pandas as pd
data = pd.read_csv("nohup.out",sep="\\n")
0
0 1
1 2
2 3
3 4
4 La situacion es preocupante
5 5
6 6
7 Radio es parte de la vida
8 7
9 Dejare de querer muy pronto
Thank you for your time感谢您的时间
You can use DataFrame.replace like so:您可以像这样使用DataFrame.replace :
df['0'].replace(to_replace=r'^\d*$', value=np.nan, regex=True)
0 NaN
1 NaN
2 NaN
3 NaN
4 La situacion es preocupante
5 NaN
6 NaN
7 Radio es parte de la vida
8 NaN
9 Dejare de querer muy pronto
Though you may need to tidy up your input file to get exactly what you want.尽管您可能需要整理您的输入文件才能获得您想要的内容。
You are reading the CSV which does not have a header.您正在阅读没有标题的 CSV。 In this case, you can specify the column name while importing the dataframe .
在这种情况下,您可以在导入数据框时指定列名。
Also, I guess you need to replace the numeric values to null.另外,我猜您需要将数值替换为空。 Try the following:
请尝试以下操作:
import pandas as pd
data = pd.read_csv("C:/Test/list.txt", names=['Texto']) # read csv with header 'Texto', you don't need to specify separator
print (data)
Out[74]:
Texto
0 0
1 1
2 2
3 3
4 4
5 La situacion es preocupante
6 5
7 6
8 Radio es parte de la vida
9 7
10 Dejare de querer muy pronto
This is the default result.这是默认结果。 Now, to replace the digits with NaN , try:
现在,要用 NaN 替换数字,请尝试:
data['Texto'] = data['Texto'].str.replace('\d+', 'NaN')
print (data)
Out[76]:
Texto
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 La situacion es preocupante
6 NaN
7 NaN
8 Radio es parte de la vida
9 NaN
10 Dejare de querer muy pronto
Edit : As hinted by @jezrael, changed '\\d' to \\d+' to include multiple digits in the below code:编辑:正如@jezrael 所暗示的,将 '\\d' 更改为 \\d+' 以在以下代码中包含多个数字:
data['Texto'] = data['Texto'].str.replace('\d+', 'NaN')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.