简体   繁体   English

在 Python 中切片列表时的奇怪行为

[英]Weird Behavior When Slicing a List in Python

I have some data in pandas that I want to use for named entity recognition.我在 pandas 中有一些数据要用于命名实体识别。 Sample of the data is below数据样本如下

text
['Angie', '’s', 'is', 'my', 'favorite', 'but', 'the', 'prices', 'at', 'little', 'Tonys', 'are', 'better', '.']

tags
['B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O']

I ran sklearn.model_selection import train_test_split on the data我在数据上运行了sklearn.model_selection import train_test_split

# split data
train_texts, test_texts, train_tags, test_tags = train_test_split(dataset["text"].tolist(),
                                                                dataset["tags"].tolist(),
                                                                test_size=0.20,
                                                                random_state=15)

However, when I try to print the list it gives me some weird behavior, specifically, it counts the square brackets [] and quotes '' around the text and tags as part of the test and tags.但是,当我尝试打印列表时,它给了我一些奇怪的行为,具体来说,它会将文本和标签周围的方括号[]和引号''作为测试和标签的一部分。 For example, when I write例如,当我写

print(train_texts[0][0:9], train_tags[0][0:9], sep='\n')

output
['Angie',
['B-ORG',

My question is, why is it counting the brackets and quote characters as part of the string?我的问题是,为什么将括号和引号字符计为字符串的一部分? How can I fix it?我该如何解决?

I have used DataFrame for declaration and performed the same task of splitting train_texts and test_texts and train_tags and test_tags .我已使用DataFrame进行声明,并执行了拆分train_texts and test_texts以及train_tags and test_tags的相同任务。 Kindly refer to a Solution Stated below.请参考以下解决方案。 Then we will move ahead with the issue of [] and '' in your scenario.然后我们将继续处理您的场景中的[]''问题。

# Import all the important libraries
import pandas as pd
from sklearn.model_selection import train_test_split

# Store all String data into the 'data' variable
data = {
'text' : ['Angie', '’s', 'is', 'my', 'favorite', 'but', 'the', 'prices', 'at', 'little', 'Tonys', 'are', 'better', '.'],
'tags' : ['B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O']}

# Store above Initialized Data into DataFrame
dataset = pd.DataFrame(data)

NOTE:- Always Print few Records of the dataset before moving ahead.注意:-在继续之前,请始终打印dataset的少量记录。 Because it may happen sometimes that there was issue in your dataset which can deflect your expected result.因为有时您的dataset可能会出现问题,这可能会影响您的预期结果。

# Print a few records of 'dataset'
dataset

    text        tags
0   Angie       B-ORG
1   ’s          I-ORG
2   is          O
3   my          O
4   favorite    O
5   but         O
6   the         O
7   prices      O
8   at          O
9   little      B-ORG
10  Tonys       I-ORG
11  are         O
12  better      O
13  .           O

Now we can pursue the splitting part.现在我们可以进行拆分部分了。 I have used the same method which was mentioned in your question part.我使用了您问题部分中提到的相同方法。

# split data
train_texts, test_texts, train_tags, test_tags = train_test_split(
    dataset["text"].tolist(),
    dataset["tags"].tolist(),
    test_size=0.20,
    random_state=15)

So, after Splitting we can print a Sliced list of train_texts and train_tags因此,在拆分之后,我们可以打印train_texts and train_tags的切片列表

print(train_texts[0][0:9], train_tags[0][0:9], sep='\n')

Output of the above cell is stated below:-上述单元的Output如下所述:-

favorite
O

As you can see, it was not printing any [] and '' in Output.如您所见,它没有在 Output 中打印任何[]''

Your Question:-你的问题:-

Q.) Why is it counting the brackets and quote characters as part of the string? Q.) 为什么将括号和引号字符计为字符串的一部分? How can I fix it?我该如何解决? A.) I don't know a proper reason behind this issue. A.) 我不知道这个问题背后的正确原因。 But it may happen sometimes if your data haven't declared properly or due to any other declaration issue.但是,如果您的数据没有正确申报或由于任何其他申报问题,有时可能会发生这种情况。 But printing dataset before moving ahead is a great practice.但是在继续之前打印dataset是一种很好的做法。 Because you can identify the behavior of data from this method.因为你可以从这个方法中识别数据的行为。

Solution:- Usage of DataFrame worked for me perfectly.解决方案:- DataFrame的使用对我来说非常有效。 You can use that.你可以用那个。

Hope this Solution helps you.希望此解决方案对您有所帮助。 If you are still facing an issue kindly upload the full code.如果您仍然遇到问题,请上传完整代码。 So, that we can find a solution accordingly.因此,我们可以找到相应的解决方案。

try:尝试:

text
Angie ’s is my favorite but the prices at little Tonys are better.

tags
B-ORG I-ORG O O O O O O O B-ORG I-ORG O O O

It looks like you are trying to turn a string that is formatted to look like a list into a list.看起来您正在尝试将格式化为看起来像列表的字符串转换为列表。 It doesn't know any better so the brackets and apostrophes are going along for the ride.它不知道有什么更好的,所以括号和撇号一直在进行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM