簡體   English   中英

從 jpeg 中提取表到 R 中的數據幀

[英]extracting tables from jpeg into a dataframe in R

我有以下兩個鏈接:

https://pbs.twimg.com/media/Dv3pIsIUwAEdu--.jpg:large

https://pbs.twimg.com/media/Dv3lKfjV4AAkIpY.jpg:large

數據以表格格式顯示,但作為 jpeg,我想捕獲此信息並將其轉換為 df 或 tibble。

我嘗試使用tesseract但結果不好,我的代碼如下:

library(tesseract)
text <- ocr_data(input_1, engine = eng)
text <- tesseract::ocr_data("https://pbs.twimg.com/media/Dv3lKfjV4AAkIpY.jpg:large", engine = eng)

有任何想法嗎?

嘗試一些預處理,例如轉換為黑色/白色並刪除網格。 這應該讓你開始:

library(magrittr)
library(magick)
#> Linking to ImageMagick 6.9.9.38
#> Enabled features: cairo, fontconfig, freetype, fftw, ghostscript, lcms, pango, rsvg, webp, x11
#> Disabled features:

# download file
url <- "https://pbs.twimg.com/media/Dv3pIsIUwAEdu--.jpg:large"
download.file(url, destfile = "table.jpg")

# convert to black and white
convert_bw <- 'convert table.jpg -fill white -fuzz 20% +opaque "#000000" table_bw.jpg'
system(convert_bw)

# remove grid
remove_grid <- "convert table_bw.jpg -negate -define morphology:compose=darken -morphology Thinning 'Rectangle:1x80+0+0<' -negate table_wo_grid.jpg"
system(remove_grid)

# read img and ocr
data <- image_read("table_wo_grid.jpg") %>%
  image_crop(geometry_area(0, 0, 80, 25)) %>%
  image_ocr() %>%
  stringi::stri_split(fixed = "\n")

head(data[[1]])
#> [1] "10/3/2013 112.32 -0.12 0.11 0.04 0.55 0.05 0.45 555 155 5.55 143,115 23,439 505"         
#> [2] "10/5/2013 112.94 -0.44 0.15 0.04 0.53 0.05 0.45 1,572 2,255 0.75 143,091 23,335 504"     
#> [3] "10/4/2013 115.53 -0.47 0.10 0.04 0.55 0.05 0.45 27,212 4,955 775,473 142,357 27,334 5 22"
#> [4] "10/5/2013 115.35 -0.57 0.00 0.04 0.51 0.05 0.29 25,522 5,312 4.05 131,320 25,340 513"    
#> [5] "10/2/2013 114.42 -0.51 0.01 0.04 0.44 0.05 0.19 470 994 0.47 121,250 25,901 74.53"       
#> [6] "9/23/2013 11495 -0.03 0.07 0.04 0.57 0.05 0.11 20,075 594 50 55 121,437 25,341 774773"

reprex 包(v0.2.1) 於 2019 年 1 月 2 日創建

沒有系統調用的編輯轉換

library(magrittr)
library(magick)
#> Linking to ImageMagick 6.9.9.38
#> Enabled features: cairo, fontconfig, freetype, fftw, ghostscript, lcms, pango, rsvg, webp, x11
#> Disabled features:

# download file
url <- "https://pbs.twimg.com/media/Dv3pIsIUwAEdu--.jpg:large"
download.file(url, destfile = "table.jpg")

# preprocessing
img <- image_read("table.jpg") %>% 
  image_transparent("white", fuzz=82) %>% 
  image_background("white") %>%
  image_negate() %>%
  image_morphology(method = "Thinning", kernel = "Rectangle:20x1+0+0^<") %>%
  image_negate() %>%
  image_crop(geometry_area(0, 0, 80, 25)) 

img

# read img and ocr
data <- img %>%
  image_ocr() 

# some wrangling
data %>%
  stringi::stri_split(fixed = "\n") %>%
  purrr::map(~ stringi::stri_split(str = ., fixed = "‘")) %>%
  .[[1]] %>%
  purrr::map_df(~ tibble::tibble(Date = .[1], Price = .[2], Change = .[3])) %>%
  dplyr::glimpse()
#> Observations: 61
#> Variables: 3
#> $ Date   <chr> "10/3/2013", "10/5/2013", "10/4/2013", "10/5/2013", "10...
#> $ Price  <chr> "11232", "11294", "11553", "11535", "114.42", "11495", ...
#> $ Change <chr> " -0.12", " -0.44", " -0.47", " -0.57", " -0.51", " -0....

reprex 包(v0.2.1) 於 2019 年 1 月 3 日創建

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM