Image layers with deleted files are not handled correctly

3 months ago

Title. Here's a reproduction Dockerfile:

FROM alpine:latest

RUN echo "test" > /coolfile
RUN echo "exist" > /awesomefile
RUN rm -f /coolfile

Deploy this image on Railway (from ghcr.io, may not occur when it's a Dockerfile deployed on platform), and observe that both coolfile and awesomefile exists on the service through SSH. /coolfile should not exist because it was deleted in a later layer.

Above Dockerfile is pushed to [ghcr.io/6ixfalls/railway-test:latest](ghcr.io/6ixfalls/railway-test:latest)

36 Replies

3 months ago

I noticed this with rabbitmq because the management image deletes a file from the regular image. The file was not deleted.


3 months ago

We use standard buildkit as far as I know, unless you cannot reproduce this locally with buildkit, you would need to open a bug report for buildkit itself.


3 months ago

The issue isn't with the image building, it's with the runtime


3 months ago

its deleted properly in the layer but not when run


3 months ago

i presume you're unpacking the layers one by one to create the "final" container fs (or whatever runtime you use does this)


3 months ago

Podman, can you reduce this with podman locally?


3 months ago

not reproducible with podman


3 months ago

5.7.0


3 months ago

could this be looked into? this is extremely easy to reproduce and has led to issues with the rabbitmq template (and may also be a source of confusion for other users).


3 months ago

you can deploy a service with the image ghcr.io/6ixfalls/railway-test:latest and a start command sleep infinity - after using railway ssh, you can notice that both files exist even though one has been deleted


3 months ago

this issue does not exist on any other container runtime


3 months ago

I will ticket this once back from winter break, but full disclosure, we would need further reports in order to look into it, it would also need to at least be reproducible with the metal builder.


3 months ago

the builder isnt the issue though, it's the runtime. however, the same behavior likely exists with the metal builder because the final image is the same


3 months ago

The runtime is stock podman.


mioi
PRO

3 months ago

hi, i just wanted to report i'm seeing this same exact behavior and is pretty frustrating! i'm new to railway and am a big fan. i can't say when this started happening, but i noticed it only today. here's an example Dockerfile that illustrates the behavior i am seeing:

# Minimal reproduction case for Railway overlay filesystem bug
# This demonstrates that files deleted with rm -rf still appear in the running container

FROM alpine:latest

# The base alpine image comes with /etc/apk directory
# Let's try to delete it in the same RUN command where we do other setup
RUN set -eux; \
    rm -rf /etc/apk; \
    ls -la /etc/ || true

# Create test files that we'll try to delete
RUN echo "test" > /should-be-deleted.txt && \
    mkdir -p /should-be-deleted-dir && \
    echo "test" > /should-be-deleted-dir/file.txt

# Now try to delete them in a separate layer
RUN rm -rf /should-be-deleted.txt /should-be-deleted-dir

# Expected result: /should-be-deleted.txt and /should-be-deleted-dir should NOT exist
# Expected result: /etc/apk should NOT exist
# Actual result on Railway: They still exist!

# To test this bug, build and deploy to Railway.
# Then `railway ssh` and check:
# ls -la /should-be-deleted.txt (should give "No such file")
# ls -la /should-be-deleted-dir (should give "No such file")
# ls -la /etc/apk (should give "No such file")

CMD ["sh", "-c", "echo 'Checking if deleted files exist:'; ls -la /should-be-deleted.txt 2>&1; ls -la /should-be-deleted-dir 2>&1; ls -la /etc/apk 2>&1; echo 'If you see the files above, the overlay FS is broken'; echo 'Container staying alive for inspection...'; tail -f /dev/null"]

3 months ago

what version? could this potentially be a bug in whichever version you're using?


3 months ago

I couldn't disclose the version.


3 months ago

https://github.com/6ixfalls/railway-image-layers
same issue occurs with the metal builder, deployed this image and coolfile exists when it shouldnt


3 months ago

Have you tried with podman locally?


2 months ago

mhm, not reproducible on this version


2 months ago

still seeing this, still affecting the rabbitmq template


2 months ago

I'm sorry, but we would need further reports in order for us to be able to prioritize looking into this.


mioi
PRO

2 months ago

still happening here too. it's easily reproducible, so it would be great if you all could possibly take a look at this! thanks!


mioi

still happening here too. it's easily reproducible, so it would be great if you all could possibly take a look at this! thanks!

2 months ago

Please also share your reproducible example.


brody

Please also share your reproducible example.

mioi
PRO

2 months ago


2 months ago

Thank you, I'll ticket this.


brody

Thank you, I'll ticket this.

mioi
PRO

2 months ago

Excellent! Thanks, brody!


vedmaka
HOBBY

2 months ago

Linking related topic https://station.railway.com/questions/files-that-never-existed-on-the-image-ge-5f629342 where this reproduces and this is super frustrating as it's completely non reproducible on any other platform that runs containers, but is reproducible on Railway runtime


2 months ago

This also applies to mv too.

With:

FROM alpine:latest

RUN CACHE_BUST=2 
RUN echo "test" > test.txt
RUN ls -la . && stat test.txt
RUN mv test.txt test2.txt
RUN ls -la . && stat test2.txt || echo "gone"

ENTRYPOINT ["sleep", "infinity"]

We get:

PS C:\Development\railway-rm-investigaton> railway ssh
  ✓ Connected to interactive shell                                                                                                                           
/ # ls
bin        etc        lib        mnt        proc       run        srv        test.txt   tmp        var
dev        home       media      opt        root       sbin       sys        test2.txt  usr
/ #

vedmaka
HOBBY

2 months ago

So far it seems the whiteout files are being ignored by the tooling that extracts the image layers on Railway, i.e.:

FROM debian:bookworm
RUN touch /repro-file
RUN rm -f /repro-file
CMD sleep infinity
docker build -t whiteout-repro .

docker run --rm -it whiteout-repro ls -l /repro-file
ls: cannot access '/repro-file': No such file or directory

docker save whiteout-repro -o repro.tar

mkdir repro && tar -xf repro.tar -C repro

jq -r '.manifests[0].digest' repro/index.json
sha256:5f5246ed027b82fb1bbe00bd016e3b15dec5b5b4d6e0dc218e1af61a05220c5a

jq . repro/blobs/sha256/0cfaa8515320c857097036846c1c0e92846e2bc5452eebccb346fd25ffa38e2c

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:d596c3844275eb4764b9c23547e1723eb4dfe841781b2227356524842cde00ba",
    "size": 1071
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:1029f5ddc0d24726f1cefbb8def7a88f8ec819a1fdc4c05ce523011b4b73c72d",
      "size": 48366072
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:e11c9e95aab1c1fd8e0353abb5a14d36ce3a7d85d3c368834554e20a77836f73",
      "size": 98
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:9f67ea6e4b03671446cf49a43744772dbcc605cdd4c9777895a798bd8b346833",
      "size": 80
    }
  ]
}

tar -tf repro/blobs/sha256/9f67ea6e4b03671446cf49a43744772dbcc605cdd4c9777895a798bd8b346833 | grep '\.wh\.'

.wh.repro-file

So the .wh.repro-file whiteout file is there, but when deployed to Railway it's ignored and the repro-file can be found in the container FS

AFAIK this has not been happening earlier, at least few months ago this was not the case. Thus if any upgrades/changes to the container tools was made at Railway - that must be the first thing to research for the bug

Would be great for this to be escalated with high priority as this can easily break many images deployed to Railway


Railway
BOT

2 months ago

Hello!

We've escalated your issue to our engineering team.

We aim to provide an update within 1 business day.

Please reply to this thread if you have any questions!

Status changed to Awaiting User Response Railway about 2 months ago


Railway
BOT

2 months ago

🛠️ The ticket Build caching issue has been marked as todo.


a month ago

any update on this? reproduction steps still show this bug


a month ago

looks like the railway bot messages weren't forwarded


vedmaka
HOBBY

a month ago

I am also looking for any update on this


Status changed to Awaiting Railway Response Railway about 1 month ago


mioi
PRO

17 days ago

Any updates on this? I can confirm the issue is still there.


16 days ago

Seems like this is still an issue. For reference, we were deploying Six's RabbitMQ template, and it caused some issues with missing analytics due to this problem.


Loading...