简体   繁体   English

如何将多个不同语言的 CSV 文件合并为一个 CSV 文件?

[英]How to merge multiple CSV files with different languages into one CSV file?

I have a lot of CSV files and I want to merge them into one CSV file.我有很多 CSV 文件,我想将它们合并到一个 CSV 文件中。 The thing is that the CSV files contain data in different languages like Russian, English, Croatian, Spanish, etc. Some of the CSV files even have their data written in multiple languages.问题是 CSV 文件包含不同语言的数据,如俄语、英语、克罗地亚语、西班牙语等。一些 CSV 文件甚至将其数据写入多种语言。

When I open the CSV files, the data looks perfectly fine, written properly in their languages and I want to read all the CSV files in their language, and write them to one big CSV file as they are.当我打开 CSV 文件时,数据看起来非常好,用他们的语言正确编写,我想用他们的语言读取所有 CSV 文件,并将它们写入一个大 CSV 文件,它们是

The code I use is this:我使用的代码是这样的:

directory_path = os.getcwd()
all_files=glob.glob(os.path.join(directory_path,"DR_BigData_*.csv"))
print(all_files)
merge_file='data_5.csv'
df_from_each_file=(pd.read_csv(f,encoding='latin1') for f in all_files)
df_merged=pd.concat(df_from_each_file,ignore_index=True)
df_merged.to_csv(merge_file,index=False)

If I use "encoding='latin1'", it successfully writes all the CSV files into one but as you might guess, the characters are so messed up.如果我使用“encoding='latin1'”,它会成功地将所有 CSV 文件写入一个文件,但正如您可能猜到的那样,这些字符非常混乱。 Here is a part of the output as an example:下面以output的一部分为例:

可怕的是,OP 似乎使用 Excel 来查看数据

I also tried to write them into.xlsx with using encoding='latin1', I still encountered the same issue.我还尝试使用 encoding='latin1' 将它们写入.xlsx,但我仍然遇到了同样的问题。 In addition to these, I tried many different encoding, but those gave me decoding errors.除了这些,我尝试了许多不同的编码,但那些给了我解码错误。

When you force the input encoding to Latin-1, you are basically wrecking any input files which are not actually Latin-1.当您将输入编码强制为 Latin-1 时,您基本上会破坏任何实际上不是 Latin-1 的输入文件。 For example, a Russian text file containing the text привет in code page 1251 will silently be translated to ïðèâåò .例如,包含代码页 1251 中文本привет的俄语文本文件将被静默翻译为ïðèâåò (The same text in the UTF-8 encoding would map to the similarly bogus but completely different string пÑÐ¸Ð²ÐµÑ .) (UTF-8 编码中的相同文本将 map 与类似的伪造但完全不同的字符串пÑÐ¸Ð²ÐµÑ 。)

The sustainable solution is to, first, correctly identify the input encoding of each file, and then, second, choose an output encoding which can accommodate all of the input encodings correctly.可持续的解决方案是,首先正确识别每个文件的输入编码,其次,选择能够正确容纳所有输入编码的 output 编码。

I would choose UTF-8 for output, but any Unicode variant will technically work.我会为 output 选择 UTF-8,但任何 Unicode 变体在技术上都可以工作。 If you need to pass the result to something more or less braindead ( cough Microsoft cough Java) maybe UTF-16 will be more convenient for your use case.如果您需要将结果传递给或多或少的脑残微软Java) ,那么 UTF-16 可能对您的用例更方便。

data = dict()
for file in glob.glob("DR_BigData_*.csv"):
   if 'ru' in file:
      enc = 'cp1251'
   elif 'it' in file:
      enc = 'latin-1'
   # ... add more here
   else:
      raise KeyError("I don't know the encoding for %s" % file)
   data[file] = pd.read_csv(file, encoding=enc)
# ... merge data[] as previously

The if statement is really just a placeholder for something more useful; if语句实际上只是一个更有用的占位符; without access to your files, I have no idea how your files are named, or which encodings to use for which ones.如果无法访问您的文件,我不知道您的文件是如何命名的,也不知道使用哪种编码。 This simplistically assumes that files in Russian would all have the substring "ru" in their names, and that you want to use a specific encoding for all of those.这简单地假设俄语文件的名称中都会包含 substring “ru”,并且您希望对所有这些文件使用特定的编码。

If you only have two encodings, and one of them is UTF-8, this is actually quite easy;如果你只有两种编码,其中一种是UTF-8,这其实很简单; try to decode as UTF-8, then if that doesn't work, fall back to the other encoding:尝试解码为 UTF-8,然后如果这不起作用,则回退到其他编码:

for file in glob.glob("DR_BigData_*.csv"):
    try:
       data[file] = pd.read_csv(file, encoding='utf-8')
    except UnicodeDecodeError:
       data[file] = pd.read_csv(file, encoding='latin-1')

This is likely to work simply because text which is not valid UTF-8 will typically raise a UnicodeDecodeError very quickly.这可能只是因为无效的文本 UTF-8 通常会很快引发UnicodeDecodeError The encoding is designed so that bytes with the 8th bit set have to adhere to a very specific pattern.编码的设计使第 8 位设置的字节必须遵守非常特定的模式。 This is a useful feature, not something you should feel frustrated about.这是一个有用的功能,不应该让您感到沮丧。 Not getting the correct data from the file is much worse.没有从文件中获取正确的数据会更糟。

If you don't know what encodings are, now would be a good time to finally read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)如果您不知道什么是编码,现在是阅读 Joel Spolsky 的《每个软件开发人员绝对、肯定必须了解 Unicode 和字符集的绝对最小值》(没有借口!)的好时机

As an aside, your computer already knows which directory it's in;顺便说一句,您的计算机已经知道它在哪个目录中。 you basically never need to call os.getcwd() unless you require to find out the absolute path of the current directory.除非您需要找出当前目录的绝对路径,否则您基本上不需要调用os.getcwd()

If I understood your question correctly, you can easily merge all your csv files (as they are) using cat command :如果我正确理解了您的问题,您可以使用cat command轻松合并所有csv文件(原样):

cat file1.csv file2.csv file3.csv ... > Merged.csv

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM