INPUTしたらOUTPUT!

忘れっぽいんでメモっとく

RでGeoJSON形式ファイルをelasticsearchのBulk API用に変換する

estrellita.hatenablog.comの続き。


GeoJSON形式に変換したもののそのままではelasticsearchに投入できない。 1行ずつフェッチして投入しても良いけど通信のオーバーヘッドもあるのでBulk API用で取り込める形式に変換する。


purrrとか使うとネストしたforとかもっとスマートに書けるんだろうけど未だに書き方がよく分からない。。。

GeoJSON形式のファイルをelasticsearchのBulk API用に変換


4bulkapi以下に作られるファイルは以下のようになる。

{"index":{"_index":"towns", "_type":"town"}}
{"type":"Feature","id":0,"properties":{"KEN_NAME":"東京都","GST_NAME":"港区","MOJI":"元赤坂2丁目"},"geometry":{"type":"MultiPolygon","coordinates":[[[[139.7279,35.6826],[139.728,35.6826],[139.7286,35.6825],[139.7289,35.6825],[139.73,35.6824],[139.7306,35.6812],[139.7315,35.6815],[139.7315,35.6816],[139.732,35.6811],[139.7324,35.6808],[139.7326,35.6806],[139.7327,35.6805],[139.7329,35.6803],[139.733,35.6799],[139.7331,35.6798],[139.7332,35.6795],[139.7333,35.6792],[139.7334,35.6791],[139.7335,35.6791],[139.7329,35.6789],[139.7328,35.6789],[139.7327,35.6788],[139.7325,35.6788],[139.7324,35.6787],[139.7323,35.6783],[139.732,35.6777],[139.7321,35.6777],[139.7323,35.6775],[139.7325,35.6773],[139.7325,35.6769],[139.7324,35.6756],[139.7318,35.6753],[139.7317,35.6752],[139.7316,35.6752],[139.7313,35.675],[139.7309,35.6748],[139.7308,35.6748],[139.7298,35.6746],[139.7295,35.6745],[139.7293,35.6745],[139.7293,35.6745],[139.7291,35.6744],[139.7289,35.6743],[139.7287,35.6743],[139.728,35.674],[139.7277,35.6739],[139.7277,35.6739],[139.7274,35.6738],[139.7269,35.6737],[139.7264,35.6735],[139.7256,35.6733],[139.7252,35.6732],[139.7248,35.673],[139.724,35.6728],[139.7237,35.6735],[139.7232,35.6743],[139.7232,35.6744],[139.7229,35.6745],[139.7226,35.6746],[139.7224,35.6747],[139.7222,35.6749],[139.7221,35.6751],[139.7219,35.6754],[139.7217,35.6759],[139.7214,35.6763],[139.7212,35.6767],[139.721,35.6771],[139.7207,35.6776],[139.7205,35.678],[139.7203,35.6783],[139.7201,35.6786],[139.7202,35.6787],[139.72,35.6789],[139.7201,35.6789],[139.7204,35.6791],[139.7207,35.6792],[139.7208,35.679],[139.7211,35.6791],[139.7211,35.6791],[139.7211,35.6789],[139.7216,35.6789],[139.7216,35.6788],[139.7218,35.6788],[139.7218,35.679],[139.7224,35.679],[139.7227,35.679],[139.7233,35.6789],[139.7236,35.6789],[139.7237,35.6789],[139.7239,35.6789],[139.724,35.6787],[139.7243,35.6788],[139.7249,35.6792],[139.7251,35.6793],[139.7255,35.6794],[139.7258,35.6794],[139.726,35.6795],[139.7263,35.6797],[139.7264,35.6798],[139.7265,35.6799],[139.7266,35.6801],[139.7268,35.6805],[139.7271,35.6812],[139.7275,35.6819],[139.7277,35.6822],[139.7279,35.6826]]]]}}
{"index":{"_index":"towns", "_type":"town"}}
{"type":"Feature","id":1,"properties":{"KEN_NAME":"東京都","GST_NAME":"港区","MOJI":"北青山1丁目"},"geometry":{"type":"MultiPolygon","coordinates":[[[[139.7198,35.6791],[139.7201,35.6789],[139.72,35.6789],[139.7202,35.6787],[139.7201,35.6786],[139.7203,35.6783],[139.7205,35.678],[139.7207,35.6776],[139.721,35.6771],[139.7212,35.6767],[139.7214,35.6763],[139.7217,35.6759],[139.7219,35.6754],[139.7221,35.6751],[139.7222,35.6749],[139.7224,35.6747],[139.7226,35.6746],[139.7229,35.6745],[139.7232,35.6744],[139.7232,35.6743],[139.7237,35.6735],[139.724,35.6728],[139.7234,35.6726],[139.7232,35.6725],[139.7224,35.6722],[139.7223,35.6722],[139.7216,35.672],[139.7215,35.6719],[139.7214,35.6719],[139.7211,35.6723],[139.721,35.6724],[139.7208,35.6724],[139.7195,35.675],[139.7199,35.6753],[139.7199,35.6753],[139.7197,35.6759],[139.7195,35.6767],[139.7197,35.6768],[139.7196,35.677],[139.7194,35.6776],[139.7193,35.678],[139.7192,35.6783],[139.7191,35.6784],[139.719,35.6786],[139.719,35.6786],[139.7192,35.6786],[139.7197,35.6785],[139.7198,35.6791]]]]}}
{"index":{"_index":"towns", "_type":"town"}}
{"type":"Feature","id":2,"properties":{"KEN_NAME":"東京都","GST_NAME":"港区","MOJI":"元赤坂1丁目"},"geometry":{"type":"MultiPolygon","coordinates":[[[[139.737,35.6785],[139.737,35.6785],[139.7371,35.6781],[139.7371,35.678],[139.7363,35.678],[139.7362,35.6779],[139.7358,35.6776],[139.7352,35.6773],[139.735,35.6771],[139.7342,35.6767],[139.7336,35.6764],[139.7327,35.6758],[139.7324,35.6757],[139.7324,35.6756],[139.7325,35.6769],[139.7325,35.6773],[139.7323,35.6775],[139.7321,35.6777],[139.732,35.6777],[139.7323,35.6783],[139.7324,35.6787],[139.7325,35.6788],[139.7327,35.6788],[139.7328,35.6789],[139.7329,35.6789],[139.7335,35.6791],[139.7336,35.679],[139.734,35.6789],[139.7344,35.6789],[139.7347,35.6788],[139.7351,35.6788],[139.7354,35.6788],[139.736,35.6788],[139.7363,35.6788],[139.7369,35.6789],[139.7369,35.6787],[139.737,35.6785]]]]}}
{"index":{"_index":"towns", "_type":"town"}}
...


Bulk APIに投入するには以下のようにする。

$ curl -XPUT 'http://localhost:9200/towns/town/_bulk' --data-binary @h22ka13103.json


試しに東京タワーの緯度・経度を逆ジオコーディングしてみる。

$ curl -XPOST 'http://localhost:9200/towns/town/_search' -d '{
>   "query": {
>     "filtered" : {
>       "query" : {
>         "match_all" : {}
>       },
>       "filter" : {
>         "geo_shape": {
>           "town.geometry": {
>             "shape": {
>               "type" : "envelope",
>               "coordinates" : [[139.745433, 35.658581], [139.745433, 35.658581]] 
>             }
>           }
>         }
>       }
>     }
>   }
> }'


結果は次の通りとなり正しい住所である"東京都港区芝公園4丁目2-8"の丁目レベルまで逆ジオコーディングできている。

{"took":7,"timed_out":false,"_shards":{"total":5, "successful":5, "failed":0},
  "hits":{"total":1, "max_score":1.0, "hits":[{
      "_index":"towns",
      "_type":"town",
      "_id":"AVKafQkU3G81jtSmYSOa",
      "_score":1.0,"_source":{
        "type":"Feature","id":56,"properties":{
          "KEN_NAME":"東京都",
          "GST_NAME":"港区",
          "MOJI":"芝公園4丁目"
        },
       "geometry":{
         "type":"MultiPolygon",
         "coordinates":[[[
           [139.7433,35.6599],
           [139.7434,35.6599],
(以下略)



よくよくログを見たら以下のエラーがいくつか発生していて取り込めていない。

MapperParsingException[failed to parse [geometry]]; nested: InvalidShapeException[Self-intersection at or near point 


またelasticsearch 2.1だと

MapperParsingException[failed to parse [catchment_mpoly]]; nested: InvalidShapeException[Provided shape has duplicate consecutive coordinates at: 

が発生してほとんど取り込めない。

これらの問題に対して解決策があればご教示ください。。。