简体   繁体   English

在python中的字符串中的所有单词周围插入引号

[英]insert quotation marks around all words in string in python

I have a string in a column that looks likes this我在一个看起来像这样的列中有一个字符串

x = pd.DataFrame(data={'a': 
                       [({"Name":"any","all":[{"First":"True","Second":False},{"Last":True,"Second":False}],"Entry":0})]}).applymap(str)

some random fields are not wrapped in quotation marks, but they should be.一些随机字段没有用引号括起来,但应该用引号括起来。 I tried stripping all quotes and then reinserting quotes around all words within the string with the following, but its not quite working correctly - as shown below我尝试剥离所有引号,然后使用以下内容在字符串中的所有单词周围重新插入引号,但它不能正常工作 - 如下所示

x['a'][0].strip(' \" ')

' '.join('"{}"'.format(w) for w in x['a'][0].split(' '))

which gives the following output这给出了以下输出

"{'Name':" "'any'," "'all':" "[{'First':" "'True'," "'Second':" "False}," "{'Last':" "True," "'Second':" "False}]," "'Entry':" "0}"

the expected output would be this预期的输出是这样的

{"Name":"any","all":[{"First":"True","Second":"False"},{"Last":"True","Second":"False"}],"Entry":"0"}

any advice would be great.任何建议都会很棒。 thanks so much!非常感谢!

This will do the trick:这将解决问题:

import pandas as pd
import re
import json

x = pd.DataFrame(data={'a': 
                       [({"Name":"any","all":[{"First":"True","Second":False},{"Last":True,"Second":False}],"Entry":0})]})

replacer = re.compile("(\w+)")
x['a'] = replacer.sub(r'"\1"', json.dumps(x['a'][0]).replace('"', ''))

Explanation See the Python regex docs for more info...说明请参阅 Python regex 文档以获取更多信息...
(\\w+) : () matches whatever is in the parentheses and denotes a capturing group. (\\w+) : ()匹配括号中的任何内容并表示捕获组。 \\w matches unicode word characters. \\w匹配 unicode 单词字符。 + with one or more occurrences. +出现一次或多次。

The sub method has the signature Pattern.sub(repl, string, count=0) . sub方法具有签名Pattern.sub(repl, string, count=0) This returns the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.这将返回通过替换 repl 替换 string 中最左边的不重叠模式出现的字符串。
"\\1" is the first captured group in the regex expression with double quotation marks. "\\1"是正则表达式中带双引号的第一个捕获组。

We need to use JSON since we want to serialise the now Pandas object into a JSON string.我们需要使用 JSON,因为我们想将现在的 Pandas 对象序列化为 JSON 字符串。 This leads to the expected behavior, instead of casting to a string in pandas which adds the ' you experienced.这会导致预期的行为,而不是在 Pandas 中强制转换为字符串,从而添加'您所经历的。
The .replace('"', '') gets rid of the occurrences of double quotes such that the regex sub method doesn't add additional ones. .replace('"', '')消除了双引号的出现,这样正则表达式子方法就不会添加额外的双引号。

Trying to insert quotation marks via regular expressions has a number of disadvantages:尝试通过正则表达式插入引号有许多缺点:

  1. It disregards the fact that the given string is machine-readable它忽略给定字符串是机器可读的事实
  2. It will fail (or at least produce strange results) as soon as any key or value contains something that's not matched by \\w , eg, "First-one" instead of "First"只要任何键或值包含与\\w不匹配的内容,它就会失败(或至少产生奇怪的结果),例如, "First-one"而不是“First”
  3. Using regular expressions is generally not very fast使用正则表达式通常不是很快

If these concerns are irrelevant for your task, you can certainly do it that way – but it's quite hacky and not guaranteed to work all the time, so here's a cleaner approach.如果这些问题与您的任务无关,您当然可以这样做 - 但它非常棘手并且不能保证一直有效,所以这里有一个更简洁的方法。

The thing is that事情是这样的

s = """{"Name":"any","all":[{"First":"True","Second":False},{"Last":True,"Second":False}],"Entry":0}"""

is a structured string – from the looks of it, it's a stringified Python dict that can be turned back into a proper data structure with eval() :是一个结构化的字符串——从它的外观来看,它是一个字符串化的 Python dict ,可以使用eval()将其转换回正确的数据结构:

d = eval(s)

(I wonder where you got that string though? If it was a Python dict in the first place, turning it into a string first and trying to "fix" it by messing with it later on is generally not a good idea.) (我想知道你从哪里得到那个字符串?如果它首先是一个 Python dict ,先把它变成一个字符串,然后再尝试通过弄乱它来“修复”它通常不是一个好主意。)

The values that miss quotation marks are not "random" values, but a bool and an int , respectively, ie, they are missing quotation marks because they aren't strings .缺少引号的值不是“随机”值,而是分别是boolint ,即它们缺少引号,因为它们不是 strings However, they can be stringified by calling str() on each of them individually.但是,它们可以通过分别对它们中的每一个调用str()来进行字符串化。 So, a clean way to turn those values into strings is something like the following, which might look a bit different depending on how the complete data set is structured:因此,将这些值转换为字符串的一种干净的方法如下所示,根据完整数据集的结构方式,它可能看起来有些不同:

for i,elem in enumerate(d['all']):
    for k,v in elem.items():
        d['all'][i][k] = str(v)
        
d['Entry'] = str(d['Entry'])

Result:结果:

{'Name': 'any', 'all': [{'First': 'True', 'Second': 'False'}, {'Last': 'True', 'Second': 'False'}], 'Entry': '0'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM