簡體   English   中英

從 R 中的文本中提取數據框或表格

[英]Extracting a dataframe or table from text in R

這是一個具有挑戰性的問題,因為對於存在的可變性來說可能有點困難。 讓我們從示例開始:

example <- list(c("Birth Centenary of K.S.Stanislavsky.Series:Birth CentenariesCatalog codes:Mi:SU 2710, Sn:SU 2695, Yt:SU 2626, Sg:SU 2797, AFA:SU 2698Variants:Click to see variantsThemes:Actors | Anniversaries and Jubilees | Famous People | MenIssued on:1963-01-15Size:30 x 42 mmColors:Blackish grey greenFormat:StampEmission:CommemorativePerforation:line 12½Printing:RecessPaper:hard thick whiteWatermark:UnwmkFace value:4 Russian kopekPrint run:2,000,000Score:29%\tAccuracy: Very HighBuy Now:2 sale offers from US$ 0.16", 
"Birth Centenary of A.S.Serafimovich.Series:Birth CentenariesCatalog codes:Mi:SU 2711, Sn:SU 2696, Yt:SU 2627, Sg:SU 2800, AFA:SU 2699Themes:Anniversaries and Jubilees | Authors | Famous People | Literary People (Poets and Writers) | Literature | MenIssued on:1963-01-19Size:28 x 40 mmFormat:StampEmission:CommemorativePerforation:frame 11½Printing:PhotogravurePaper:ordinaryFace value:4 Russian kopekPrint run:2,500,000Score:26%\tAccuracy: Very HighBuy Now:3 sale offers from US$ 0.11", 
"Children in nurserySeries:Soviet ChildrenCatalog codes:Mi:SU 2712, Sn:SU 2697, Yt:SU 2629, Sg:SU 2806, AFA:SU 2700Themes:ChildrenIssued on:1963-01-31Size:42 x 28 mmColors:MulticolorFormat:StampEmission:CommemorativePerforation:comb 11½Printing:PhotogravureFace value:4 Russian kopekPrint run:3,000,000Score:27%\tAccuracy: Very HighDescription:Designer: A. Shmidshtein. Paper: ordinary.Buy Now:2 sale offers from US$ 0.08", 
"Children with nurseSeries:Soviet ChildrenCatalog codes:Mi:SU 2713, Sn:SU 2698, Yt:SU 2628, Sg:SU 2807, AFA:SU 2701Themes:ChildrenIssued on:1963-01-31Size:42 x 28 mmFormat:StampEmission:CommemorativePerforation:comb 11½Printing:PhotogravureFace value:4 Russian kopekPrint run:3,000,000Score:25%\tAccuracy: Very HighDescription:Designer: A. Shmidshtein. Paper: ordinary.Buy Now:3 sale offers from US$ 0.08", 
"Pioneer campSeries:Soviet ChildrenCatalog codes:Mi:SU 2714, Sn:SU 2699, Yt:SU 2630, Sg:SU 2808, AFA:SU 2702Themes:ChildrenIssued on:1963-01-31Size:42 x 28 mmFormat:StampEmission:CommemorativePerforation:comb 11½Printing:PhotogravureFace value:4 Russian kopekPrint run:3,000,000Score:22%\tAccuracy: Very HighDescription:Designer: A. Shmidshtein. Paper: ordinary.Buy Now:4 sale offers from US$ 0.11", 
"Soviet Children.Series:Soviet ChildrenCatalog codes:Mi:SU 2715, Sn:SU 2700, Yt:SU 2631, Sg:SU 2809, AFA:SU 2703Themes:ChildrenIssued on:1963-01-31Size:40 x 28 mmFormat:StampEmission:CommemorativePerforation:comb 11½Printing:PhotogravurePaper:ordinaryFace value:4 Russian kopekPrint run:3,000,000Score:25%\tAccuracy: Very HighBuy Now:2 sale offers from US$ 0.08", 
"Dymkov's and Zagorsk toysSeries:Decorative ArtsCatalog codes:Mi:SU 2716, Sn:SU 2701, Yt:SU 2632, Sg:SU 2810, AFA:SU 2704Themes:Art | ToysIssued on:1963-01-31Size:30 x 42 mmFormat:StampEmission:CommemorativePerforation:comb 12 x 12½Printing:Offset lithographyFace value:4 Russian kopekPrint run:3,000,000Score:22%\tAccuracy: Very HighDescription:Designer: E. Komarov. Paper: ordinary.Buy Now:2 sale offers from US$ 0.11", 
"Oposhnya potterySeries:Decorative ArtsCatalog codes:Mi:SU 2717, Sn:SU 2702, Yt:SU 2633, Sg:SU 2811, AFA:SU 2705Themes:ArtIssued on:1963-01-31Size:30 x 42 mmFormat:StampEmission:CommemorativePerforation:comb 12 x 12½Printing:Offset lithographyFace value:6 Russian kopekPrint run:3,000,000Score:24%\tAccuracy: Very HighDescription:Designer: E. Komarov. Paper: ordinary.Buy Now:3 sale offers from US$ 0.08", 
"Embossing booksSeries:Decorative ArtsCatalog codes:Mi:SU 2718, Sn:SU 2703, Yt:SU 2634, Sg:SU 2812, AFA:SU 2706Themes:Art | BooksIssued on:1963-01-31Size:30 x 42 mmFormat:StampEmission:CommemorativePerforation:comb 12 x 12½Printing:Offset lithographyFace value:10 Russian kopekPrint run:3,000,000Score:27%\tAccuracy: Very HighDescription:Designer: E. Komarov. Paper: ordinary.Buy Now:2 sale offers from US$ 0.44", 
"Decorative Arts.Series:Decorative ArtsCatalog codes:Mi:SU 2719, Sn:SU 2704, Yt:SU 2635, Sg:SU 2813, AFA:SU 2707Themes:ArtIssued on:1963-01-31Size:30 x 42 mmFormat:StampEmission:CommemorativePerforation:comb 12 x 12½Printing:Offset lithographyPaper:ordinaryFace value:12 Russian kopekPrint run:3,000,000Score:26%\tAccuracy: Very HighBuy Now:3 sale offers from US$ 0.16"
), NULL, NULL, NULL)

如您所見,它是一個包含 4 個對象的列表。 我們可以通過使用unlist()它們unlist()來制作一個向量。 隨你(由你決定。

關鍵是每個元素都來自一個帶有他的標題的表格,如下所示:

在此處輸入圖片說明

我想從文本中獲取相同的表或數據框。 我觀察到有關 info 結構的幾點:

  • 有大寫字母不同的組合詞,分別對應行值的開頭和最后一個單詞的結尾。
  • 一些變量(如目錄代碼和主題)由不同的元素組成。
  • 有時,某些行可能不存在於其他元素中。 在上圖中,行 Variants 出現在該元素中,但沒有出現在其余元素中。

我嘗試了tidyverse環境的一些功能,但是這種情況超出了我的能力。

您的數據似乎來自網絡抓取。 我建議查看 rvest::html_table() 以嘗試獲得更好的格式化結果。 否則它會非常混亂(即正則表達式)。

非常非常混亂的示例代碼:

untangle <- function(element) {
  Title = gsub("^(.*)Series:.*", "\\1", element)
  Series = gsub(".*Series:(.*)(Catalog codes:.*)", "\\1", element)
  CatalogCodes = gsub(".*Catalog codes:(.*)(Variants|Themes.*)", "\\1", element)
  return(data.frame(Title, Series, CatalogCodes, stringsAsFactors=FALSE))
}

for (e in unlist(example)) {
  print(untangle(e))
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM