aboutsummaryrefslogtreecommitdiffstats
path: root/README.md
blob: 3fffbf4b1eb8fcb0265c3a3edbfd61c7daebb4b5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# Correct Natural Earth Shape Files

There are some tiny errors in the data provided by [Natural Earth
Data](https://www.naturalearthdata.com/), which impacted a project I was
working on. I was attempting to import [the
world](https://www.naturalearthdata.com/downloads/10m-cultural-vectors/)
in Elastic, but Elastic has some bug where you can't upload GeoJSON
through the web form, so I had to do it manually, like this:

```bash
$ fields=NAME,ECONOMY,FORMAL_EN,GDP_MD,ISO_A2
$ ogr2ogr -f ElasticSearch  -progress \
    -select $fields \
    -lco NOT_ANALYZED_FIELDS=$fields \
    -lco INDEX_NAME=countries \
    -lco OVERWRITE_INDEX=YES \
    ES:http://localhost:9200 \
    /vsizip/./ne_10m_admin_0_countries.zip/ne_10m_admin_0_countries.shp
```

But ogr2ogr yells at you after processing about 170 countries or so. If
you run the same with the `-skipfailures` option, you'll see every
country gets indexed *except* Egypt! Why?

A look at the json output from ogr2ogr (which I will spare you for the
moment), ultimately lead me to:

```code
"Self-intersection at or near point [35.621087106,23.139292914]"
```

So, I opened it in QGIS and well...

![img](./oops.png)

Funny enough the lines in the middle aren't a problem, just this
one point sitting on the border.

Fortunately, QGIS has a Geometry Checker Plugin, but unfortunately, it's
a bit complicated and was a pain to do. If you don't tune it right, you
end up having to sort through lots of "mistakes" which aren't mistakes.

For anyone else who might be down this rabbit hole, Egypt is Object ID
161--I promise that will save you time. Or you could just download my
copy of the file here.

Hoping to use this git repo as part of a bug report, once I read their
process on that.

Included here is an [ESRI shape
file](https://www.loc.gov/preservation/digital/formats/fdd/fdd000280.shtml)
in the `ne_10m_admin_0_countries` directory as well as the same output
in GeoJSON, since I think that format is a bit easier to work with.


## Steps to reproduce the bug

1. Download the original file from Natural Earth

```bash
wget https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_0_countries.zip
```

2. Try to import the file into Elastic with the series of bash
   commands given earlier. Or alternatively, just:

```bash
$ ogr2ogr -f ElasticSearch  -progress \
    -lco NOT_ANALYZED_FIELDS={ALL} \
    -lco INDEX_NAME=countries \
    -lco OVERWRITE_INDEX=YES \
    ES:http://localhost:9200 \
    /vsizip/./ne_10m_admin_0_countries.zip/ne_10m_admin_0_countries.shp
```

3. Observe you receive a similar error as in `error.json`

As a sanity check, you can re-run the same command without the fancy zip
syntax by manually unzipping:

```bash
$ mkdir -p ne && unzip ne_10m_admin_0_countries.zip -d ne/
$ ogr2ogr -f ElasticSearch  -progress \
    -lco NOT_ANALYZED_FIELDS={ALL} \
    -lco INDEX_NAME=countries \
    -lco OVERWRITE_INDEX=YES \
    ES:http://localhost:9200 \
    ne/ne_10m_admin_0_countries.shp
```

You will get the same error

### Notes

Oddly enough, converting to other formats *will not* yield the same
error. I suspect there is some check that's not done by the GeoJSON
(and maybe other) drivers that the Elastic one does.


``` bash
$ ogr2ogr -progress -f GeoJSON test.geojson /vsizip/./ne_10m_admin_0_countries.zip/ne_10m_admin_0_countries.shp
```

^That runs just fine