INPUTしたらOUTPUT!

忘れっぽいんでメモっとく

{lexRankr}で文章要約を試す

長期休暇中に消化するはずだった積ん読が全く消化できていない。全部真面目に読もうとするから消化できないわけだけど、 そもそも本を読むときに読むべき場所は7〜11%しかないらしい。*1


ということで文章要約で書籍の中から読むべき部分を抽出したい。文章要約については以下が非常に参考になる。

qiita.com

抽出型のアプローチでグラフベースのアルゴリズムであるLexRankのRパッケージが提供されているので試してみる。


英語文章の要約

stackoverflow.com

の通りなのだが、英語が不自由な自分にとって適切に要約されているか判断できないため「Alice's Adventures in Wonderland」で試してみる。


本文は青空文庫の海外版Gutenbergから取得する。

library(dplyr)
library(xml2)
library(rvest)
library(stringr)
library(lexRankr)

html <- xml2::read_html("http://www.gutenberg.org/files/11/11-h/11-h.htm")
text <- html %>% 
  # 本文を取得
  rvest::html_nodes(xpath = "//p") %>% 
  rvest::html_text() %>% 
  # 改行コードをスペースに置換
  stringr::str_replace_all(pattern = "\r\n", replacement = " ") %>% 
  # スペースの繰り返しを単一のスペースに置換
  stringr::str_replace_all(pattern = " +", replacement = " ") %>% 
  stringr::str_trim() %>% 
  # 空白行を削除
  .[. != ""]

# 確認
head(text)
[1] "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversations?’"                                                                                                                                                                                                                                                                                                                                                                                                                                             
[2] "So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her."                                                                                                                                                                                                                                                                                                                                                                                                                                                              
[3] "There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, ‘Oh dear! Oh dear! I shall be late!’ (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge."
[4] "In another moment down went Alice after it, never once considering how in the world she was to get out again."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[5] "The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[6] "Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards and book-shelves; here and there she saw maps and pictures hung upon pegs. She took down a jar from one of the shelves as she passed; it was labelled ‘ORANGE MARMALADE’, but to her great disappointment it was empty: she did not like to drop the jar for fear of killing somebody, so managed to put it into one of the cupboards as she fell past it."    

length(text)
[1] 752


一つ一つのセンテンスが長いように思うがlexRankr::lexRank()に突っ込んでみる。全体で752センテンスあるので10%とすると75センテンスが適切かもしれないがまずは10センテンスで試す。

top_10 <- lexRankr::lexRank(text,
                            docId = rep(1, length(text)),
                            n = 10,
                            continuous = TRUE)
Parsing text into sentences and tokens...DONE
Calculating pairwise sentence similarities...DONE
Applying LexRank...DONE
Formatting Output...DONE


結果は次のようになる。(長くなるため先頭3行のみ表示)

top_10 %>% 
  head(3) %>% 
  knitr::kable()
docId sentenceId sentence value
1 1_1077 ‘Stupid things!’ Alice began in a loud, indignant voice, but she stopped hastily, for the White Rabbit cried out, ‘Silence in the court!’ and the King put on his spectacles and looked anxiously round, to make out who was talking. 0.0016609
1 1_798 ‘I don’t think they play at all fairly,’ Alice began, in rather a complaining tone, ‘and they all quarrel so dreadfully one can’t hear oneself speak—and they don’t seem to have any rules in particular; at least, if there are, nobody attends to them—and you’ve no idea how confusing it is all the things being alive; for instance, there’s the arch I’ve got to go through next walking about at the other end of the ground—and I should have croqueted the Queen’s hedgehog just now, only it ran away when it saw mine coming!’ 0.0016006
1 1_878 All the time they were playing the Queen never left off quarrelling with the other players, and shouting ‘Off with his head!’ or ‘Off with her head!’ Those whom she sentenced were taken into custody by the soldiers, who of course had to leave off being arches to do this, so that by the end of half an hour or so there were no arches left, and all the players, except the King, the Queen, and Alice, were in custody and under sentence of execution. 0.0015953


valueの値はデフォルトだとPageRanklexRankr::lexRank()の引数でusePageRank = FALSEとすると次数中心性となる。752センテンスのうちPageRankのTop3だといきなりクライマックスっぽいが10%程度で抽出し、sentenceIdの昇順でみるとそれっぽくなるのかもしれない。


日本語文章の要約

青空文庫 Aozora Bunkoから「走れメロス」を題材に日本語の文章で試すと以下のようにエラーとなる。

html <- xml2::read_html("https://www.aozora.gr.jp/cards/000035/files/1567_14913.html")
text <- html %>% 
  rvest::html_nodes(xpath = "//div[@class='main_text']") %>% 
  rvest::html_text() %>% 
  stringr::str_split(pattern = "\r\n|。") %>% 
  purrr::map(str_trim) %>% 
  purrr::set_names(nm = seq(1, length(.), 1)) %>% 
  purrr::flatten_chr() %>% 
  .[. != ""]

head(text)
[1] "メロスは激怒した"                                                        
[2] "必ず、かの邪智暴虐(じゃちぼうぎゃく)の王を除かなければならぬと決意した"
[3] "メロスには政治がわからぬ"                                                
[4] "メロスは、村の牧人である"                                                
[5] "笛を吹き、羊と遊んで暮して来た"                                          
[6] "けれども邪悪に対しては、人一倍に敏感であった"

length(text)
[1] 506

top_10 <- lexRankr::lexRank(text,
                            docId = rep(1, length(text)),
                            n = 10,
                            continuous = TRUE)
Parsing text into sentences and tokens...DONE
Calculating pairwise sentence similarities... sentenceSimil(sentenceId = tokenDf$sentenceId, token = tokenDf$token,  でエラー: 
  token must be at least length 1

tokenの長さが1以上である必要があるとのこと。


lexRankr::lexRank()の処理を確認すると

  1. 引数textsentenceTokenParse()で解析しsentenceデータフレームsentDfと tokenデータフレームtokenDfを作成
  2. sentDftokenDfからPageRank or 次数中心性をsentenceSimil()で計算しセンテンス間の距離similDfを作成
  3. lexRankFromSimil()similDfから上位n件を返す

となっており、日本語文章の場合tokenDFが空となるため上記のエラーとなっている。


sentDftokenDfは以下のようなシンプルなデータフレームとなっている。

sentTokList <- lexRankr::sentenceTokenParse(text, docId = rep(1, length(text)))
sentDf <- sentTokList$sentences
tokenDf <- sentTokList$tokens

glimpse(sentDf)
Observations: 1,254
Variables: 3
$ docId      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "…"
$ sentenceId <chr> "1_1", "1_2", "1_3", "1_4", "1_5", "1_6", "1_7", "1_8", "1_9", "1_10", "1_11…"
$ sentence   <chr> "Alice was beginning to get very tired of sitting by her sister on the bank,…"

glimpse(tokenDf)
Observations: 8,398
Variables: 3
$ docId      <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "…"
$ sentenceId <chr> "1_1", "1_1", "1_1", "1_1", "1_1", "1_1", "1_1", "1_1", "1_1", "1_1", "1_1",$ token      <chr> "alic", "begin", "tire", "sit", "sister", "bank", "peep", "book", "sister",


日本語文章でも上記のようなsentenceデータフレームと tokenデータフレームを作成し、sentenceSimil()に渡せば良さそう。 sentenceデータフレームおよびtokenデータフレームの作成は以下のように行う。

sentDf <- data.frame(
  docId = "1",
  sentenceId = glue::glue("1_{formatC(n, width=3, flag='0')}", n = seq(1, length(text))),
  sentence = text,
  stringsAsFactors = FALSE
)
sentDf %>% 
  head(n = 3) %>% 
  knitr::kable()
docId sentenceId sentence
1 1_001 メロスは激怒した
1 1_002 必ず、かの邪智暴虐(じゃちぼうぎゃく)の王を除かなければならぬと決意した
1 1_003 メロスには政治がわからぬ


tokenDf <- sentDf %>% 
  RMeCab::RMeCabDF("sentence", 1) %>% 
  purrr::set_names(nm = sentDf$sentenceId) %>% 
  map2_dfr(.x = ., .y = names(.), .f = function(x, y) {
    tibble(docId = "1",
           sentenceId = y,
           token = x,
           hinshi = names(x))
  }) %>% 
  dplyr::distinct_all() %>% 
  # 必要に応じて品詞を絞る
  # filter(hinshi %in% c("名詞", "形容詞", "動詞")) %>% 
  dplyr::select(docId, sentenceId, token)

tokenDf %>% 
  head() %>% 
  knitr::kable()
docId sentenceId token
1 1_001 メロス
1 1_001
1 1_001 激怒
1 1_001 する
1 1_001
1 1_002 必ず


あとはsentenceSimil()でセンテンス間の距離を計算し、lexRankFromSimil()で上位n件を取得するだけ。

similDf <-
  lexRankr::sentenceSimil(
    sentenceId = tokenDf$sentenceId,
    token = tokenDf$token,
    docId = tokenDf$docId
  )
topNSents <-
  lexRankr::lexRankFromSimil(
    s1 = similDf$sent1,
    s2 = similDf$sent2,
    simil = similDf$similVal,
    n = 10,
    continuous = TRUE
  )
returnDf <- topNSents %>% 
  dplyr::inner_join(sentDf, by = "sentenceId") %>% 
  # 小説のため物語の順序を考慮してsentence IDで並び替え
  dplyr::arrange(sentenceId) %>% 
  dplyr::select(docId, sentenceId, sentence, value)

returnDf %>% 
  knitr::kable()
docId sentenceId sentence value
1 1_011 この妹は、村の或る律気な一牧人を、近々、花婿(はなむこ)として迎える事になっていた 0.0027484
1 1_024 路で逢った若い衆をつかまえて、何かあったのか、二年まえに此の市に来たときは、夜でも皆が歌をうたって、まちは賑やかであった筈(はず)だが、と質問した 0.0027436
1 1_098 ただ、――」と言いかけて、メロスは足もとに視線を落し瞬時ためらい、「ただ、私に情をかけたいつもりなら、処刑までに三日間の日限を与えて下さい 0.0027991
1 1_177 祝宴に列席していた村人たちは、何か不吉なものを感じたが、それでも、めいめい気持を引きたて、狭い家の中で、むんむん蒸し暑いのも怺(こら)え、陽気に歌をうたい、手を拍(う)った 0.0027699
1 1_241 あちこちと眺めまわし、また、声を限りに呼びたててみたが、繋舟(けいしゅう)は残らず浪に浚(さら)われて影なく、渡守りの姿も見えない 0.0028263
1 1_394 」ああ、その男、その男のために私は、いまこんなに走っているのだ 0.0027287
1 1_424 」メロスは胸の張り裂ける思いで、赤く大きい夕陽ばかりを見つめていた 0.0027916
1 1_457 」と大声で刑場の群衆にむかって叫んだつもりであったが、喉(のど)がつぶれて嗄(しわが)れた声が幽(かす)かに出たばかり、群衆は、ひとりとして彼の到着に気がつかない 0.0028022
1 1_462 彼を人質にした私は、ここにいる!」と、かすれた声で精一ぱいに叫びながら、ついに磔台に昇り、釣り上げられてゆく友の両足に、齧(かじ)りついた 0.0028045
1 1_472 君が若(も)し私を殴ってくれなかったら、私は君と抱擁する資格さえ無いのだ 0.0027501


それっぽいのでは!?