在python中的第一個實例后刪除字符串字符

Question

我認為這應該是非常簡單的，但它是一個星期五的下午，我的大腦顯然沒有裝備。

我正在編寫一個小文件解析，下面的代碼將一組字符串轉換為數據幀，將字符串拆分。

以下是一些示例字符串：

1. NC_002523_1  Serratia entomophila plasmid pADAP, complete sequence.

2. NZ_CM003366_0    Pantoea ananatis strain CFH 7-1 plasmid CFH1-7plasmid2, whole genome shotgun sequence.

3. NZ_CP014491_0    Escherichia coli strain G749 plasmid pG749_3, complete sequence.

4. NC_015062_0  Rahnella sp. Y9602 plasmid pRAHAQ01, complete sequence.

我沒想到. 在第4個條目的sp之后，正如你在下面的代碼中看到的，我分開了. 獲得排名的第一個整數。 因此，我得到一個ValueError，列數多於預期。

# Define the column headers for the section since the file's are too verbose and ambiguous
SigHit.Columns = ["Rank", "ID", "Description"]

# Store the table of loci and associated data (tab separated, removing last blank column.

# Use StringIO object to imitate a file, which means that we can use read_table and have the dtypes
# assigned automatically (necessary for functions like min() to work correctly on integers)

SigHit.Table = pd.read_table(
               io.StringIO(u'\n'.join([row.rstrip('.') for row in sighits_section])),
               sep='\.|\t',
               engine='python',
               names=SigHit.Columns)

我能想到的最簡單的解決方案（直到其他邊緣情況破壞它）是替換每一個. 除了第一次出現。 如何才能做到這一點？

我看到.replace有一個maxreplace 參數，但是這會與我想要的相反，並且只會替換第一個實例。

有什么建議么？ （更強大的解析方法也是一個有效的選項，但我必須越少越好地改變代碼）。

Answer 1

使用正向lookbehind確保點前面有一個數字 - sep='(?<=\\d)\\.|\\t'

例如：

import pandas as pd
import io

columns = ["Rank", "ID", "Description"]

sighits_section = '''1. NC_002523_1\tSerratia entomophila plasmid pADAP, complete sequence.
2. NZ_CM003366_0\tPantoea ananatis strain CFH 7-1 plasmid CFH1-7plasmid2, whole genome shotgun sequence.
3. NZ_CP014491_0\tEscherichia coli strain G749 plasmid pG749_3, complete sequence.
4. NC_015062_0\tRahnella sp. Y9602 plasmid pRAHAQ01, complete sequence.'''.splitlines()

tab = pd.read_table(io.StringIO(u'\n'.join([row.rstrip('.') for row in sighits_section])),
                    sep='(?<=\d)\.|\t',
                    engine='python',
                    names=columns)

print(tab)

版畫

   Rank              ID                                        Description
0     1     NC_002523_1  Serratia entomophila plasmid pADAP, complete s...
1     2   NZ_CM003366_0  Pantoea ananatis strain CFH 7-1 plasmid CFH1-7...
2     3   NZ_CP014491_0  Escherichia coli strain G749 plasmid pG749_3, ...
3     4     NC_015062_0  Rahnella sp. Y9602 plasmid pRAHAQ01, complete ...

為了更加安全，您可能希望將空格作為分隔符添加到點 - sep='(?<=\\d)\\.\\s|\\t' - 以便在您的描述中有例如10.1情況下進行緩解。 這無論如何都不是防彈的。

更安全 - 當你一次處理一行數據時，你可以添加一個斷言，即數字是字符串中的第一個字符， sep='(?<=^\\d)\\.\\s|\\t' 。 但是，這將在高於10的數字上崩潰。

Answer 2

天真的方法

替換每一個. 除了第一次出現

line = "4. NC_015062_0  Rahnella sp. Y9602 plasmid pRAHAQ01, complete sequence."
count = line.count(".")
line = line[::-1].replace(".", "", count-1)[::-1]

這是一個班輪

row[::-1].replace(".","",row.count(".")-1)[::-1]

在python中的第一個實例后刪除字符串字符

問題描述

2 個解決方案

解決方案1
2 已采納 2018-04-13 15:09:48

解決方案2
1 2018-04-13 15:10:59

在python中的第一個實例后刪除字符串字符

問題描述

2 個解決方案

解決方案1 2 已采納 2018-04-13 15:09:48

解決方案2 1 2018-04-13 15:10:59

解決方案1
2 已采納 2018-04-13 15:09:48

解決方案2
1 2018-04-13 15:10:59