简体   繁体   English

如何通过标点符号拆分 CSV 内容

[英]How to split CSV content by punctuation marks

I have a csv file with two columns, one with the name of a person and the other with words defined by the person, the problem is that in this column there are many words that are separated by punctuation marks.我有一个 csv 文件,它有两列,一列是人名,另一列是人定义的单词,问题是在这一列中有很多用标点符号分隔的单词。 I need to separate these words so that each person only has one word per column, that is:我需要将这些词分开,这样每个人每列只有一个词,即:

name,word
Oliver,"water,surf,windsurf"
Tom,"football, striker, ball"
Anna,"mountain;wind;sun"
Sara,"basketball; nba; ball"
Mark,"informatic/web3.0/e-learning"
Christian,"doctor - medicine"
Sergi,"runner . athletics"

These are an example of the CSV data.这些是 CSV 数据的示例。 As you can see, there are data separated by different punctuation marks (there are still some more) where they are separated by a space and others that are not.如您所见,有些数据由不同的标点符号分隔(还有更多),它们由空格分隔,而另一些则不是。 The result I would like to achieve is:我想要达到的结果是:

name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom,stricker
Tom,ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara,nba
Sara,ball
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian,medicine
Sergi,runner
Sergi,athletics

I have opened the file using pandas where I have created a dataframe with the data and this is where I have to separate the data.我已经使用 pandas 打开了文件,在这里我用数据创建了一个 dataframe,这是我必须分离数据的地方。 What I have tried is:我尝试过的是:

def splitter(df):

    df['word'] = df['word'].str.split(",")
    df = df.explode("word")

    df['word'] = df['word'].str.split(", ")
    df = df.explode("word")

    df['word'] = df['word'].str.split(" , ")
    df = df.explode("word")

    df['word'] = df['word'].str.split("- ")
    df = df.explode("word")

    df['word'] = df['word'].str.split(" -")
    df = df.explode("word")

    df['word'] = df['word'].str.split("\. ")  
    df = df.explode("word")

    df['word'] = df['word'].str.split(";")
    df = df.explode("word")

    df['word'] = df['word'].str.split("; ")
    df = df.explode("word")

    df['word'] = df['word'].str.split(" ;")
    df = df.explode("word")

    df['word'] = df['word'].str.split(" ; ")
    df = df.explode("word")

    df['word'] = df['word'].str.split("/ ")
    df = df.explode("word")

return df

The result I get is the one I want but with some spaces and they don't have to appear:我得到的结果是我想要的,但有一些空格,它们不必出现:

name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom, stricker
Tom, ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara, nba
Sara, ball
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian, medicine
Sergi,runner
Sergi, athletics

How could I solve this problem and improve the code I have put in, since I do not know how to modify it so that everything works correctly?我怎么能解决这个问题并改进我输入的代码,因为我不知道如何修改它以使一切正常?

Simply简单地

df['word'] = df['word'].str.strip()

and it should remove all spaces , tabs and new lines from both sides of text.它应该从文本的两侧删除所有spacestabsnew lines


BTW:顺便提一句:

Probably you could even use split(";") without split("; ") , split(";") , etc. because strip() will remove these spaces.可能你甚至可以在没有split("; ")split(";")等的情况下使用split(";") ;") ,因为strip()会删除这些空格。


If you want to use variants like split(";") , split("; ") , split(";") , split("; ") then you should start with the longest split("; ") and later use shorter split("; ") , split(";") and at the end the shortest split(";") - and this way maybe you could remove spaces.如果你想使用像split(";")split("; ")split(";")split("; ")这样的变体,那么你应该从最长的split("; ")开始,然后再使用更短的split("; ")split(";")和最后最短的split(";") - 这样也许你可以删除空格。


You could even try to use only one split('[;,-./]') instead all of split()您甚至可以尝试只使用一个split('[;,-./]')而不是全部split()

df = df['word'].str.split('[;,-./]').explode().str.strip()

Eventually you could use |最终你可以使用| as OR作为OR


EDIT:编辑:

Minimal working example with data directly in code - so everyone can test it.直接在代码中使用数据的最小工作示例 - 因此每个人都可以对其进行测试。

import pandas as pd
import io

text = '''name,word
Oliver,"water,surf,windsurf"
Tom,"football, striker, ball"
Anna,"mountain;wind;sun"
Sara,"basketball; nba; ball"
Mark,"informatic/web3.0/e-learning"
Christian,"doctor - medicine"
Sergi,"runner . athletics"'''

# text to dataframe
df = pd.read_csv(io.StringIO(text))

df['word'] = df['word'].str.split('[;,/]|\. |- | -')
df = df.explode('word')
df['word'] = df['word'].str.strip()

# dataframe to text
output = io.StringIO()
df.to_csv(output, index=False)
output.seek(0)
text = output.read() 

print(text)

Result:结果:

name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom,striker
Tom,ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara,nba
Sara,ball
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian,medicine
Sergi,runner
Sergi,athletics

EDIT:编辑:

The same without strip() .没有strip()也一样。

I use '?'我用'?' to get optional space after chars ;,/ and before char .在 chars ;,/和 char 之前获得可选space .

I also use ' - ' before '- ' and ' -' to find the longest version.我还在'- '和“ ' - ' ' -'来查找最长的版本。

df['word'] = df['word'].str.split('[;,/] ?| ?\. | - |- | -')
df = df.explode('word')

EDIT:编辑:

Example which use replacements to keep (data, science) as one string without spliting.使用替换将(data, science)保留为一个字符串而不拆分的示例。

import pandas as pd
import io

text = '''name,word
Oliver,"water,surf,windsurf"
Tom,"football, striker, ball"
Anna,"mountain;wind;sun"
Sara,"basketball; nba; ball; (date1, time1)"
Mark,"informatic/web3.0/e-learning"
Christian,"doctor - medicine - (date2, time2) - date3, time3"
Sergi,"runner . athletics"'''

# text to dataframe
df = pd.read_csv(io.StringIO(text))


# Find all `(...)`
found = df['word'].str.findall(r'\(.*?\)')
print(found)

# Flatten it
found = sum(found, [])
print(found)

# Create dict to put pattern in place of `(...)`.
# Because later I will use `regex=True` so I have to use `\(...\)` instead of `(...)`
patterns = {f'\({value[1:-1]}\)':f'XXX{i}' for i, value in enumerate(found)}
print(patterns)

df['word'] = df['word'].replace(patterns, regex=True)

# --- nromal spliting ---


df['word'] = df['word'].str.split('[;,/]|\. |- | -')
df = df.explode('word')
df['word'] = df['word'].str.strip()

# Create dict to put later `(...)` in place of pattern.
patterns_back = {f'XXX{i}':value for i, value in enumerate(found)}
print(patterns_back)

df['word'] = df['word'].replace(patterns_back, regex=True)

# dataframe to text
output = io.StringIO()
df.to_csv(output, index=False)
output.seek(0)
text = output.read() 

print(text)

Result:结果:

0                  []
1                  []
2                  []
3    [(date1, time1)]
4                  []
5    [(date2, time2)]
6                  []
Name: word, dtype: object

['(date1, time1)', '(date2, time2)']

{'\\(date1, time1\\)': 'XXX0', '\\(date2, time2\\)': 'XXX1'}

{'XXX0': '(date1, time1)', 'XXX1': '(date2, time2)'}

name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom,striker
Tom,ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara,nba
Sara,ball
Sara,"(date1, time1)"
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian,medicine
Christian,"(date2, time2)"
Christian,date3
Christian,time3
Sergi,runner
Sergi,athletics

I do not know much about pandas, but perhaps the following code is helpful for you.我对 pandas 了解不多,但也许下面的代码对你有帮助。

import re

# [name,word]
data = [["Oliver", "water,surf,windsurf"],
        ["Tom", "football, striker, ball"],
        ["Anna", "mountain;wind;sun"],
        ["Sara", "basketball; nba; ball"],
        ["Mark", "informatic/web3.0/e-learning"],
        ["Christian", "doctor - medicine"],
        ["Sergi", "runner . athletics"]]

result = []

for item in data:
    words = re.split(r'\s*;\s*|\s*,\s*|/|\s+-\s+|\s+.\s+', item[1])
    result.extend([(item[0], w) for w in words])

You can split the words with the re-module.您可以使用重新模块拆分单词。 Then you get the result in a list of tuples.然后你得到一个元组列表中的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM