Skip to content
  1. Jan 21, 2016
    • Filipe Manana's avatar
      Btrfs: fix race when listing an inode's xattrs · c0cdea62
      Filipe Manana authored
      [ Upstream commit f1cd1f0b
      
       ]
      
      When listing a inode's xattrs we have a time window where we race against
      a concurrent operation for adding a new hard link for our inode that makes
      us not return any xattr to user space. In order for this to happen, the
      first xattr of our inode needs to be at slot 0 of a leaf and the previous
      leaf must still have room for an inode ref (or extref) item, and this can
      happen because an inode's listxattrs callback does not lock the inode's
      i_mutex (nor does the VFS does it for us), but adding a hard link to an
      inode makes the VFS lock the inode's i_mutex before calling the inode's
      link callback.
      
      If we have the following leafs:
      
                     Leaf X (has N items)                    Leaf Y
      
       [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 XATTR_ITEM 12345), ... ]
                 slot N - 2         slot N - 1              slot 0
      
      The race illustrated by the following sequence diagram is possible:
      
             CPU 1                                               CPU 2
      
        btrfs_listxattr()
      
          searches for key (257 XATTR_ITEM 0)
      
          gets path with path->nodes[0] == leaf X
          and path->slots[0] == N
      
          because path->slots[0] is >=
          btrfs_header_nritems(leaf X), it calls
          btrfs_next_leaf()
      
          btrfs_next_leaf()
            releases the path
      
                                                         adds key (257 INODE_REF 666)
                                                         to the end of leaf X (slot N),
                                                         and leaf X now has N + 1 items
      
            searches for the key (257 INODE_REF 256),
            with path->keep_locks == 1, because that
            is the last key it saw in leaf X before
            releasing the path
      
            ends up at leaf X again and it verifies
            that the key (257 INODE_REF 256) is no
            longer the last key in leaf X, so it
            returns with path->nodes[0] == leaf X
            and path->slots[0] == N, pointing to
            the new item with key (257 INODE_REF 666)
      
          btrfs_listxattr's loop iteration sees that
          the type of the key pointed by the path is
          different from the type BTRFS_XATTR_ITEM_KEY
          and so it breaks the loop and stops looking
          for more xattr items
            --> the application doesn't get any xattr
                listed for our inode
      
      So fix this by breaking the loop only if the key's type is greater than
      BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      c0cdea62
    • Filipe Manana's avatar
      Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow · 938165ad
      Filipe Manana authored
      [ Upstream commit 1d512cb7
      
       ]
      
      If we are using the NO_HOLES feature, we have a tiny time window when
      running delalloc for a nodatacow inode where we can race with a concurrent
      link or xattr add operation leading to a BUG_ON.
      
      This happens because at run_delalloc_nocow() we end up casting a leaf item
      of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
      file extent item (struct btrfs_file_extent_item) and then analyse its
      extent type field, which won't match any of the expected extent types
      (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
      explicit BUG_ON(1).
      
      The following sequence diagram shows how the race happens when running a
      no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
      neighbour leafs:
      
                   Leaf X (has N items)                    Leaf Y
      
       [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
                    slot N - 2         slot N - 1              slot 0
      
       (Note the implicit hole for inode 257 regarding the [0, 8K[ range)
      
             CPU 1                                         CPU 2
      
       run_dealloc_nocow()
         btrfs_lookup_file_extent()
           --> searches for a key with value
               (257 EXTENT_DATA 4096) in the
               fs/subvol tree
           --> returns us a path with
               path->nodes[0] == leaf X and
               path->slots[0] == N
      
         because path->slots[0] is >=
         btrfs_header_nritems(leaf X), it
         calls btrfs_next_leaf()
      
         btrfs_next_leaf()
           --> releases the path
      
                                                    hard link added to our inode,
                                                    with key (257 INODE_REF 500)
                                                    added to the end of leaf X,
                                                    so leaf X now has N + 1 keys
      
           --> searches for the key
               (257 INODE_REF 256), because
               it was the last key in leaf X
               before it released the path,
               with path->keep_locks set to 1
      
           --> ends up at leaf X again and
               it verifies that the key
               (257 INODE_REF 256) is no longer
               the last key in the leaf, so it
               returns with path->nodes[0] ==
               leaf X and path->slots[0] == N,
               pointing to the new item with
               key (257 INODE_REF 500)
      
         the loop iteration of run_dealloc_nocow()
         does not break out the loop and continues
         because the key referenced in the path
         at path->nodes[0] and path->slots[0] is
         for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
         and its offset (500) is less then our delalloc
         range's end (8192)
      
         the item pointed by the path, an inode reference item,
         is (incorrectly) interpreted as a file extent item and
         we get an invalid extent type, leading to the BUG_ON(1):
      
         if (extent_type == BTRFS_FILE_EXTENT_REG ||
            extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
             (...)
         } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
             (...)
         } else {
             BUG_ON(1)
         }
      
      The same can happen if a xattr is added concurrently and ends up having
      a key with an offset smaller then the delalloc's range end.
      
      So fix this by skipping keys with a type smaller than
      BTRFS_EXTENT_DATA_KEY.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      938165ad
    • Filipe Manana's avatar
      Btrfs: fix race leading to incorrect item deletion when dropping extents · c03844ae
      Filipe Manana authored
      [ Upstream commit aeafbf84
      
       ]
      
      While running a stress test I got the following warning triggered:
      
        [191627.672810] ------------[ cut here ]------------
        [191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]()
        (...)
        [191627.701485] Call Trace:
        [191627.702037]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
        [191627.702992]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
        [191627.704091]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
        [191627.705380]  [<ffffffffa0664499>] ? __btrfs_drop_extents+0x391/0xa50 [btrfs]
        [191627.706637]  [<ffffffff8104b46d>] warn_slowpath_null+0x1a/0x1c
        [191627.707789]  [<ffffffffa0664499>] __btrfs_drop_extents+0x391/0xa50 [btrfs]
        [191627.709155]  [<ffffffff8115663c>] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0
        [191627.712444]  [<ffffffff81155007>] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18
        [191627.714162]  [<ffffffffa06570c9>] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs]
        [191627.715887]  [<ffffffffa065422b>] ? start_transaction+0x3bb/0x610 [btrfs]
        [191627.717287]  [<ffffffffa065b604>] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs]
        [191627.728865]  [<ffffffffa065b888>] finish_ordered_fn+0x15/0x17 [btrfs]
        [191627.730045]  [<ffffffffa067d688>] normal_work_helper+0x14c/0x32c [btrfs]
        [191627.731256]  [<ffffffffa067d96a>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
        [191627.732661]  [<ffffffff81061119>] process_one_work+0x24c/0x4ae
        [191627.733822]  [<ffffffff810615b0>] worker_thread+0x206/0x2c2
        [191627.734857]  [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
        [191627.736052]  [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
        [191627.737349]  [<ffffffff810669a6>] kthread+0xef/0xf7
        [191627.738267]  [<ffffffff810f3b3a>] ? time_hardirqs_on+0x15/0x28
        [191627.739330]  [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
        [191627.741976]  [<ffffffff81465592>] ret_from_fork+0x42/0x70
        [191627.743080]  [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
        [191627.744206] ---[ end trace bbfddacb7aaada8d ]---
      
        $ cat -n fs/btrfs/file.c
        691  int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
        (...)
        758                  btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
        759                  if (key.objectid > ino ||
        760                      key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
        761                          break;
        762
        763                  fi = btrfs_item_ptr(leaf, path->slots[0],
        764                                      struct btrfs_file_extent_item);
        765                  extent_type = btrfs_file_extent_type(leaf, fi);
        766
        767                  if (extent_type == BTRFS_FILE_EXTENT_REG ||
        768                      extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
        (...)
        774                  } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
        (...)
        778                  } else {
        779                          WARN_ON(1);
        780                          extent_end = search_start;
        781                  }
        (...)
      
      This happened because the item we were processing did not match a file
      extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this
      case we cast the item to a struct btrfs_file_extent_item pointer and
      then find a type field value that does not match any of the expected
      values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens
      due to a tiny time window where a race can happen as exemplified below.
      For example, consider the following scenario where we're using the
      NO_HOLES feature and we have the following two neighbour leafs:
      
                     Leaf X (has N items)                    Leaf Y
      
      [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
                slot N - 2         slot N - 1              slot 0
      
      Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather
      than explicit because NO_HOLES is enabled). Now if our inode has an
      ordered extent for the range [4K, 8K[ that is finishing, the following
      can happen:
      
                CPU 1                                       CPU 2
      
        btrfs_finish_ordered_io()
          insert_reserved_file_extent()
            __btrfs_drop_extents()
               Searches for the key
                (257 EXTENT_DATA 4096) through
                btrfs_lookup_file_extent()
      
               Key not found and we get a path where
               path->nodes[0] == leaf X and
               path->slots[0] == N
      
               Because path->slots[0] is >=
               btrfs_header_nritems(leaf X), we call
               btrfs_next_leaf()
      
               btrfs_next_leaf() releases the path
      
                                                        inserts key
                                                        (257 INODE_REF 4096)
                                                        at the end of leaf X,
                                                        leaf X now has N + 1 keys,
                                                        and the new key is at
                                                        slot N
      
               btrfs_next_leaf() searches for
               key (257 INODE_REF 256), with
               path->keep_locks set to 1,
               because it was the last key it
               saw in leaf X
      
                 finds it in leaf X again and
                 notices it's no longer the last
                 key of the leaf, so it returns 0
                 with path->nodes[0] == leaf X and
                 path->slots[0] == N (which is now
                 < btrfs_header_nritems(leaf X)),
                 pointing to the new key
                 (257 INODE_REF 4096)
      
               __btrfs_drop_extents() casts the
               item at path->nodes[0], slot
               path->slots[0], to a struct
               btrfs_file_extent_item - it does
               not skip keys for the target
               inode with a type less than
               BTRFS_EXTENT_DATA_KEY
               (BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY)
      
               sees a bogus value for the type
               field triggering the WARN_ON in
               the trace shown above, and sets
               extent_end = search_start (4096)
      
               does the if-then-else logic to
               fixup 0 length extent items created
               by a past bug from hole punching:
      
                 if (extent_end == key.offset &&
                     extent_end >= search_start)
                     goto delete_extent_item;
      
               that evaluates to true and it ends
               up deleting the key pointed to by
               path->slots[0], (257 INODE_REF 4096),
               from leaf X
      
      The same could happen for example for a xattr that ends up having a key
      with an offset value that matches search_start (very unlikely but not
      impossible).
      
      So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are
      skipped, never casted to struct btrfs_file_extent_item and never deleted
      by accident. Also protect against the unexpected case of getting a key
      for a lower inode number by skipping that key and issuing a warning.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      c03844ae
    • Filipe Manana's avatar
      Btrfs: fix file corruption and data loss after cloning inline extents · d7ca88a4
      Filipe Manana authored
      [ Upstream commit 8039d87d
      
       ]
      
      Currently the clone ioctl allows to clone an inline extent from one file
      to another that already has other (non-inlined) extents. This is a problem
      because btrfs is not designed to deal with files having inline and regular
      extents, if a file has an inline extent then it must be the only extent
      in the file and must start at file offset 0. Having a file with an inline
      extent followed by regular extents results in EIO errors when doing reads
      or writes against the first 4K of the file.
      
      Also, the clone ioctl allows one to lose data if the source file consists
      of a single inline extent, with a size of N bytes, and the destination
      file consists of a single inline extent with a size of M bytes, where we
      have M > N. In this case the clone operation removes the inline extent
      from the destination file and then copies the inline extent from the
      source file into the destination file - we lose the M - N bytes from the
      destination file, a read operation will get the value 0x00 for any bytes
      in the the range [N, M] (the destination inode's i_size remained as M,
      that's why we can read past N bytes).
      
      So fix this by not allowing such destructive operations to happen and
      return errno EOPNOTSUPP to user space.
      
      Currently the fstest btrfs/035 tests the data loss case but it totally
      ignores this - i.e. expects the operation to succeed and does not check
      the we got data loss.
      
      The following test case for fstests exercises all these cases that result
      in file corruption and data loss:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_cloner
        _require_btrfs_fs_feature "no_holes"
        _require_btrfs_mkfs_feature "no-holes"
      
        rm -f $seqres.full
      
        test_cloning_inline_extents()
        {
            local mkfs_opts=$1
            local mount_opts=$2
      
            _scratch_mkfs $mkfs_opts >>$seqres.full 2>&1
            _scratch_mount $mount_opts
      
            # File bar, the source for all the following clone operations, consists
            # of a single inline extent (50 bytes).
            $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \
                | _filter_xfs_io
      
            # Test cloning into a file with an extent (non-inlined) where the
            # destination offset overlaps that extent. It should not be possible to
            # clone the inline extent from file bar into this file.
            $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \
                | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo
      
            # Doing IO against any range in the first 4K of the file should work.
            # Due to a past clone ioctl bug which allowed cloning the inline extent,
            # these operations resulted in EIO errors.
            echo "File foo data after clone operation:"
            # All bytes should have the value 0xaa (clone operation failed and did
            # not modify our file).
            od -t x1 $SCRATCH_MNT/foo
            $XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io
      
            # Test cloning the inline extent against a file which has a hole in its
            # first 4K followed by a non-inlined extent. It should not be possible
            # as well to clone the inline extent from file bar into this file.
            $XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \
                | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2
      
            # Doing IO against any range in the first 4K of the file should work.
            # Due to a past clone ioctl bug which allowed cloning the inline extent,
            # these operations resulted in EIO errors.
            echo "File foo2 data after clone operation:"
            # All bytes should have the value 0x00 (clone operation failed and did
            # not modify our file).
            od -t x1 $SCRATCH_MNT/foo2
            $XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io
      
            # Test cloning the inline extent against a file which has a size of zero
            # but has a prealloc extent. It should not be possible as well to clone
            # the inline extent from file bar into this file.
            $XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3
      
            # Doing IO against any range in the first 4K of the file should work.
            # Due to a past clone ioctl bug which allowed cloning the inline extent,
            # these operations resulted in EIO errors.
            echo "First 50 bytes of foo3 after clone operation:"
            # Should not be able to read any bytes, file has 0 bytes i_size (the
            # clone operation failed and did not modify our file).
            od -t x1 $SCRATCH_MNT/foo3
            $XFS_IO_PROG -c "pwrite -S 0xff 0 90" $SCRATCH_MNT/foo3 | _filter_xfs_io
      
            # Test cloning the inline extent against a file which consists of a
            # single inline extent that has a size not greater than the size of
            # bar's inline extent (40 < 50).
            # It should be possible to do the extent cloning from bar to this file.
            $XFS_IO_PROG -f -c "pwrite -S 0x01 0 40" $SCRATCH_MNT/foo4 \
                | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo4
      
            # Doing IO against any range in the first 4K of the file should work.
            echo "File foo4 data after clone operation:"
            # Must match file bar's content.
            od -t x1 $SCRATCH_MNT/foo4
            $XFS_IO_PROG -c "pwrite -S 0x02 0 90" $SCRATCH_MNT/foo4 | _filter_xfs_io
      
            # Test cloning the inline extent against a file which consists of a
            # single inline extent that has a size greater than the size of bar's
            # inline extent (60 > 50).
            # It should not be possible to clone the inline extent from file bar
            # into this file.
            $XFS_IO_PROG -f -c "pwrite -S 0x03 0 60" $SCRATCH_MNT/foo5 \
                | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo5
      
            # Reading the file should not fail.
            echo "File foo5 data after clone operation:"
            # Must have a size of 60 bytes, with all bytes having a value of 0x03
            # (the clone operation failed and did not modify our file).
            od -t x1 $SCRATCH_MNT/foo5
      
            # Test cloning the inline extent against a file which has no extents but
            # has a size greater than bar's inline extent (16K > 50).
            # It should not be possible to clone the inline extent from file bar
            # into this file.
            $XFS_IO_PROG -f -c "truncate 16K" $SCRATCH_MNT/foo6 | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo6
      
            # Reading the file should not fail.
            echo "File foo6 data after clone operation:"
            # Must have a size of 16K, with all bytes having a value of 0x00 (the
            # clone operation failed and did not modify our file).
            od -t x1 $SCRATCH_MNT/foo6
      
            # Test cloning the inline extent against a file which has no extents but
            # has a size not greater than bar's inline extent (30 < 50).
            # It should be possible to clone the inline extent from file bar into
            # this file.
            $XFS_IO_PROG -f -c "truncate 30" $SCRATCH_MNT/foo7 | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo7
      
            # Reading the file should not fail.
            echo "File foo7 data after clone operation:"
            # Must have a size of 50 bytes, with all bytes having a value of 0xbb.
            od -t x1 $SCRATCH_MNT/foo7
      
            # Test cloning the inline extent against a file which has a size not
            # greater than the size of bar's inline extent (20 < 50) but has
            # a prealloc extent that goes beyond the file's size. It should not be
            # possible to clone the inline extent from bar into this file.
            $XFS_IO_PROG -f -c "falloc -k 0 1M" \
                            -c "pwrite -S 0x88 0 20" \
                            $SCRATCH_MNT/foo8 | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo8
      
            echo "File foo8 data after clone operation:"
            # Must have a size of 20 bytes, with all bytes having a value of 0x88
            # (the clone operation did not modify our file).
            od -t x1 $SCRATCH_MNT/foo8
      
            _scratch_unmount
        }
      
        echo -e "\nTesting without compression and without the no-holes feature...\n"
        test_cloning_inline_extents
      
        echo -e "\nTesting with compression and without the no-holes feature...\n"
        test_cloning_inline_extents "" "-o compress"
      
        echo -e "\nTesting without compression and with the no-holes feature...\n"
        test_cloning_inline_extents "-O no-holes" ""
      
        echo -e "\nTesting with compression and with the no-holes feature...\n"
        test_cloning_inline_extents "-O no-holes" "-o compress"
      
        status=0
        exit
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      d7ca88a4
    • Quentin Casasnovas's avatar
      RDS: fix race condition when sending a message on unbound socket · 96c7b10c
      Quentin Casasnovas authored
      [ Upstream commit 8c7188b2 ]
      
      Sasha's found a NULL pointer dereference in the RDS connection code when
      sending a message to an apparently unbound socket.  The problem is caused
      by the code checking if the socket is bound in rds_sendmsg(), which checks
      the rs_bound_addr field without taking a lock on the socket.  This opens a
      race where rs_bound_addr is temporarily set but where the transport is not
      in rds_bind(), leading to a NULL pointer dereference when trying to
      dereference 'trans' in __rds_conn_create().
      
      Vegard wrote a reproducer for this issue, so kindly ask him to share if
      you're interested.
      
      I cannot reproduce the NULL pointer dereference using Vegard's reproducer
      with this patch, whereas I could without.
      
      Complete earlier incomplete fix to CVE-2015-6937:
      
        74e98eb0
      
       ("RDS: verify the underlying transport exists before creating a connection")
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: stable@vger.kernel.org
      
      Reviewed-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Reviewed-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      96c7b10c
  2. Jan 16, 2016
  3. Dec 15, 2015
  4. Dec 14, 2015
    • Eric Dumazet's avatar
      ipv6: sctp: implement sctp_v6_destroy_sock() · 1ced2353
      Eric Dumazet authored
      [ Upstream commit 602dd62d
      
       ]
      
      Dmitry Vyukov reported a memory leak using IPV6 SCTP sockets.
      
      We need to call inet6_destroy_sock() to properly release
      inet6 specific fields.
      
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      1ced2353
    • Konstantin Khlebnikov's avatar
      net/neighbour: fix crash at dumping device-agnostic proxy entries · 18c103ad
      Konstantin Khlebnikov authored
      [ Upstream commit 6adc5fd6
      
       ]
      
      Proxy entries could have null pointer to net-device.
      
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Fixes: 84920c14
      
       ("net: Allow ipv6 proxies and arp proxies be shown with iproute2")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      18c103ad
    • Eric Dumazet's avatar
      ipv6: add complete rcu protection around np->opt · 46ddb98e
      Eric Dumazet authored
      [ Upstream commit 45f6fad8 ]
      
      This patch addresses multiple problems :
      
      UDP/RAW sendmsg() need to get a stable struct ipv6_txoptions
      while socket is not locked : Other threads can change np->opt
      concurrently. Dmitry posted a syzkaller
      (http://github.com/google/syzkaller
      
      ) program desmonstrating
      use-after-free.
      
      Starting with TCP/DCCP lockless listeners, tcp_v6_syn_recv_sock()
      and dccp_v6_request_recv_sock() also need to use RCU protection
      to dereference np->opt once (before calling ipv6_dup_options())
      
      This patch adds full RCU protection to np->opt
      
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      46ddb98e
    • Michal Kubeček's avatar
      ipv6: distinguish frag queues by device for multicast and link-local packets · e33c9be7
      Michal Kubeček authored
      [ Upstream commit 264640fc ]
      
      If a fragmented multicast packet is received on an ethernet device which
      has an active macvlan on top of it, each fragment is duplicated and
      received both on the underlying device and the macvlan. If some
      fragments for macvlan are processed before the whole packet for the
      underlying device is reassembled, the "overlapping fragments" test in
      ip6_frag_queue() discards the whole fragment queue.
      
      To resolve this, add device ifindex to the search key and require it to
      match reassembling multicast packets and packets to link-local
      addresses.
      
      Note: similar patch has been already submitted by Yoshifuji Hideaki in
      
        http://patchwork.ozlabs.org/patch/220979/
      
      
      
      but got lost and forgotten for some reason.
      
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      e33c9be7
    • Aaro Koskinen's avatar
      broadcom: fix PHY_ID_BCM5481 entry in the id table · fbf44baf
      Aaro Koskinen authored
      [ Upstream commit 3c25a860 ]
      
      Commit fcb26ec5 ("broadcom: move all PHY_ID's to header")
      updated broadcom_tbl to use PHY_IDs, but incorrectly replaced 0x0143bca0
      with PHY_ID_BCM5482 (making a duplicate entry, and completely omitting
      the original). Fix that.
      
      Fixes: fcb26ec5
      
       ("broadcom: move all PHY_ID's to header")
      Signed-off-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      fbf44baf
    • Nikolay Aleksandrov's avatar
      net: ip6mr: fix static mfc/dev leaks on table destruction · 357dc4e6
      Nikolay Aleksandrov authored
      [ Upstream commit 4c698046 ]
      
      Similar to ipv4, when destroying an mrt table the static mfc entries and
      the static devices are kept, which leads to devices that can never be
      destroyed (because of refcnt taken) and leaked memory. Make sure that
      everything is cleaned up on netns destruction.
      
      Fixes: 8229efda
      
       ("netns: ip6mr: enable namespace support in ipv6 multicast forwarding code")
      CC: Benjamin Thery <benjamin.thery@bull.net>
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      357dc4e6
    • Nikolay Aleksandrov's avatar
      net: ipmr: fix static mfc/dev leaks on table destruction · 691e3dcb
      Nikolay Aleksandrov authored
      [ Upstream commit 0e615e96
      
       ]
      
      When destroying an mrt table the static mfc entries and the static
      devices are kept, which leads to devices that can never be destroyed
      (because of refcnt taken) and leaked memory, for example:
      unreferenced object 0xffff880034c144c0 (size 192):
        comm "mfc-broken", pid 4777, jiffies 4320349055 (age 46001.964s)
        hex dump (first 32 bytes):
          98 53 f0 34 00 88 ff ff 98 53 f0 34 00 88 ff ff  .S.4.....S.4....
          ef 0a 0a 14 01 02 03 04 00 00 00 00 01 00 00 00  ................
        backtrace:
          [<ffffffff815c1b9e>] kmemleak_alloc+0x4e/0xb0
          [<ffffffff811ea6e0>] kmem_cache_alloc+0x190/0x300
          [<ffffffff815931cb>] ip_mroute_setsockopt+0x5cb/0x910
          [<ffffffff8153d575>] do_ip_setsockopt.isra.11+0x105/0xff0
          [<ffffffff8153e490>] ip_setsockopt+0x30/0xa0
          [<ffffffff81564e13>] raw_setsockopt+0x33/0x90
          [<ffffffff814d1e14>] sock_common_setsockopt+0x14/0x20
          [<ffffffff814d0b51>] SyS_setsockopt+0x71/0xc0
          [<ffffffff815cdbf6>] entry_SYSCALL_64_fastpath+0x16/0x7a
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Make sure that everything is cleaned on netns destruction.
      
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      691e3dcb
    • Daniel Borkmann's avatar
      net, scm: fix PaX detected msg_controllen overflow in scm_detach_fds · 306c5357
      Daniel Borkmann authored
      [ Upstream commit 6900317f ]
      
      David and HacKurx reported a following/similar size overflow triggered
      in a grsecurity kernel, thanks to PaX's gcc size overflow plugin:
      
      (Already fixed in later grsecurity versions by Brad and PaX Team.)
      
      [ 1002.296137] PAX: size overflow detected in function scm_detach_fds net/core/scm.c:314
                     cicus.202_127 min, count: 4, decl: msg_controllen; num: 0; context: msghdr;
      [ 1002.296145] CPU: 0 PID: 3685 Comm: scm_rights_recv Not tainted 4.2.3-grsec+ #7
      [ 1002.296149] Hardware name: Apple Inc. MacBookAir5,1/Mac-66F35F19FE2A0D05, [...]
      [ 1002.296153]  ffffffff81c27366 0000000000000000 ffffffff81c27375 ffffc90007843aa8
      [ 1002.296162]  ffffffff818129ba 0000000000000000 ffffffff81c27366 ffffc90007843ad8
      [ 1002.296169]  ffffffff8121f838 fffffffffffffffc fffffffffffffffc ffffc90007843e60
      [ 1002.296176] Call Trace:
      [ 1002.296190]  [<ffffffff818129ba>] dump_stack+0x45/0x57
      [ 1002.296200]  [<ffffffff8121f838>] report_size_overflow+0x38/0x60
      [ 1002.296209]  [<ffffffff816a979e>] scm_detach_fds+0x2ce/0x300
      [ 1002.296220]  [<ffffffff81791899>] unix_stream_read_generic+0x609/0x930
      [ 1002.296228]  [<ffffffff81791c9f>] unix_stream_recvmsg+0x4f/0x60
      [ 1002.296236]  [<ffffffff8178dc00>] ? unix_set_peek_off+0x50/0x50
      [ 1002.296243]  [<ffffffff8168fac7>] sock_recvmsg+0x47/0x60
      [ 1002.296248]  [<ffffffff81691522>] ___sys_recvmsg+0xe2/0x1e0
      [ 1002.296257]  [<ffffffff81693496>] __sys_recvmsg+0x46/0x80
      [ 1002.296263]  [<ffffffff816934fc>] SyS_recvmsg+0x2c/0x40
      [ 1002.296271]  [<ffffffff8181a3ab>] entry_SYSCALL_64_fastpath+0x12/0x85
      
      Further investigation showed that this can happen when an *odd* number of
      fds are being passed over AF_UNIX sockets.
      
      In these cases CMSG_LEN(i * sizeof(int)) and CMSG_SPACE(i * sizeof(int)),
      where i is the number of successfully passed fds, differ by 4 bytes due
      to the extra CMSG_ALIGN() padding in CMSG_SPACE() to an 8 byte boundary
      on 64 bit. The padding is used to align subsequent cmsg headers in the
      control buffer.
      
      When the control buffer passed in from the receiver side *lacks* these 4
      bytes (e.g. due to buggy/wrong API usage), then msg->msg_controllen will
      overflow in scm_detach_fds():
      
        int cmlen = CMSG_LEN(i * sizeof(int));  <--- cmlen w/o tail-padding
        err = put_user(SOL_SOCKET, &cm->cmsg_level);
        if (!err)
          err = put_user(SCM_RIGHTS, &cm->cmsg_type);
        if (!err)
          err = put_user(cmlen, &cm->cmsg_len);
        if (!err) {
          cmlen = CMSG_SPACE(i * sizeof(int));  <--- cmlen w/ 4 byte extra tail-padding
          msg->msg_control += cmlen;
          msg->msg_controllen -= cmlen;         <--- iff no tail-padding space here ...
        }                                            ... wrap-around
      
      F.e. it will wrap to a length of 18446744073709551612 bytes in case the
      receiver passed in msg->msg_controllen of 20 bytes, and the sender
      properly transferred 1 fd to the receiver, so that its CMSG_LEN results
      in 20 bytes and CMSG_SPACE in 24 bytes.
      
      In case of MSG_CMSG_COMPAT (scm_detach_fds_compat()), I haven't seen an
      issue in my tests as alignment seems always on 4 byte boundary. Same
      should be in case of native 32 bit, where we end up with 4 byte boundaries
      as well.
      
      In practice, passing msg->msg_controllen of 20 to recvmsg() while receiving
      a single fd would mean that on successful return, msg->msg_controllen is
      being set by the kernel to 24 bytes instead, thus more than the input
      buffer advertised. It could f.e. become an issue if such application later
      on zeroes or copies the control buffer based on the returned msg->msg_controllen
      elsewhere.
      
      Maximum number of fds we can send is a hard upper limit SCM_MAX_FD (253).
      
      Going over the code, it seems like msg->msg_controllen is not being read
      after scm_detach_fds() in scm_recv() anymore by the kernel, good!
      
      Relevant recvmsg() handler are unix_dgram_recvmsg() (unix_seqpacket_recvmsg())
      and unix_stream_recvmsg(). Both return back to their recvmsg() caller,
      and ___sys_recvmsg() places the updated length, that is, new msg_control -
      old msg_control pointer into msg->msg_controllen (hence the 24 bytes seen
      in the example).
      
      Long time ago, Wei Yongjun fixed something related in commit 1ac70e7a
      ("[NET]: Fix function put_cmsg() which may cause usr application memory
      overflow").
      
      RFC3542, section 20.2. says:
      
        The fields shown as "XX" are possible padding, between the cmsghdr
        structure and the data, and between the data and the next cmsghdr
        structure, if required by the implementation. While sending an
        application may or may not include padding at the end of last
        ancillary data in msg_controllen and implementations must accept both
        as valid. On receiving a portable application must provide space for
        padding at the end of the last ancillary data as implementations may
        copy out the padding at the end of the control message buffer and
        include it in the received msg_controllen. When recvmsg() is called
        if msg_controllen is too small for all the ancillary data items
        including any trailing padding after the last item an implementation
        may set MSG_CTRUNC.
      
      Since we didn't place MSG_CTRUNC for already quite a long time, just do
      the same as in 1ac70e7a
      
       to avoid an overflow.
      
      Btw, even man-page author got this wrong :/ See db939c9b26e9 ("cmsg.3: Fix
      error in SCM_RIGHTS code sample"). Some people must have copied this (?),
      thus it got triggered in the wild (reported several times during boot by
      David and HacKurx).
      
      No Fixes tag this time as pre 2002 (that is, pre history tree).
      
      Reported-by: default avatarDavid Sterba <dave@jikos.cz>
      Reported-by: default avatarHacKurx <hackurx@gmail.com>
      Cc: PaX Team <pageexec@freemail.hu>
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: Eric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      306c5357
    • Eric Dumazet's avatar
      tcp: initialize tp->copied_seq in case of cross SYN connection · 7d76c6f7
      Eric Dumazet authored
      [ Upstream commit 142a2e7e ]
      
      Dmitry provided a syzkaller (http://github.com/google/syzkaller
      
      )
      generated program that triggers the WARNING at
      net/ipv4/tcp.c:1729 in tcp_recvmsg() :
      
      WARN_ON(tp->copied_seq != tp->rcv_nxt &&
              !(flags & (MSG_PEEK | MSG_TRUNC)));
      
      His program is specifically attempting a Cross SYN TCP exchange,
      that we support (for the pleasure of hackers ?), but it looks we
      lack proper tcp->copied_seq initialization.
      
      Thanks again Dmitry for your report and testings.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Tested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      7d76c6f7
    • Eric Dumazet's avatar
      tcp: fix potential huge kmalloc() calls in TCP_REPAIR · ea5ef9fc
      Eric Dumazet authored
      [ Upstream commit 5d4c9bfb ]
      
      tcp_send_rcvq() is used for re-injecting data into tcp receive queue.
      
      Problems :
      
      - No check against size is performed, allowed user to fool kernel in
        attempting very large memory allocations, eventually triggering
        OOM when memory is fragmented.
      
      - In case of fault during the copy we do not return correct errno.
      
      Lets use alloc_skb_with_frags() to cook optimal skbs.
      
      Fixes: 292e8d8c ("tcp: Move rcvq sending to tcp_input.c")
      Fixes: c0e88ff0
      
       ("tcp: Repair socket queues")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      ea5ef9fc
    • Eric Dumazet's avatar
      tcp: md5: fix lockdep annotation · 7af83230
      Eric Dumazet authored
      [ Upstream commit 1b8e6a01 ]
      
      When a passive TCP is created, we eventually call tcp_md5_do_add()
      with sk pointing to the child. It is not owner by the user yet (we
      will add this socket into listener accept queue a bit later anyway)
      
      But we do own the spinlock, so amend the lockdep annotation to avoid
      following splat :
      
      [ 8451.090932] net/ipv4/tcp_ipv4.c:923 suspicious rcu_dereference_protected() usage!
      [ 8451.090932]
      [ 8451.090932] other info that might help us debug this:
      [ 8451.090932]
      [ 8451.090934]
      [ 8451.090934] rcu_scheduler_active = 1, debug_locks = 1
      [ 8451.090936] 3 locks held by socket_sockopt_/214795:
      [ 8451.090936]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff855c6ac1>] __netif_receive_skb_core+0x151/0xe90
      [ 8451.090947]  #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff85618143>] ip_local_deliver_finish+0x43/0x2b0
      [ 8451.090952]  #2:  (slock-AF_INET){+.-...}, at: [<ffffffff855acda5>] sk_clone_lock+0x1c5/0x500
      [ 8451.090958]
      [ 8451.090958] stack backtrace:
      [ 8451.090960] CPU: 7 PID: 214795 Comm: socket_sockopt_
      
      [ 8451.091215] Call Trace:
      [ 8451.091216]  <IRQ>  [<ffffffff856fb29c>] dump_stack+0x55/0x76
      [ 8451.091229]  [<ffffffff85123b5b>] lockdep_rcu_suspicious+0xeb/0x110
      [ 8451.091235]  [<ffffffff8564544f>] tcp_md5_do_add+0x1bf/0x1e0
      [ 8451.091239]  [<ffffffff85645751>] tcp_v4_syn_recv_sock+0x1f1/0x4c0
      [ 8451.091242]  [<ffffffff85642b27>] ? tcp_v4_md5_hash_skb+0x167/0x190
      [ 8451.091246]  [<ffffffff85647c78>] tcp_check_req+0x3c8/0x500
      [ 8451.091249]  [<ffffffff856451ae>] ? tcp_v4_inbound_md5_hash+0x11e/0x190
      [ 8451.091253]  [<ffffffff85647170>] tcp_v4_rcv+0x3c0/0x9f0
      [ 8451.091256]  [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0
      [ 8451.091260]  [<ffffffff856181b6>] ip_local_deliver_finish+0xb6/0x2b0
      [ 8451.091263]  [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0
      [ 8451.091267]  [<ffffffff85618d38>] ip_local_deliver+0x48/0x80
      [ 8451.091270]  [<ffffffff85618510>] ip_rcv_finish+0x160/0x700
      [ 8451.091273]  [<ffffffff8561900e>] ip_rcv+0x29e/0x3d0
      [ 8451.091277]  [<ffffffff855c74b7>] __netif_receive_skb_core+0xb47/0xe90
      
      Fixes: a8afca03
      
       ("tcp: md5: protects md5sig_info with RCU")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      7af83230
    • Bjørn Mork's avatar
      net: qmi_wwan: add XS Stick W100-2 from 4G Systems · 922e1fd3
      Bjørn Mork authored
      [ Upstream commit 68242a5a
      
       ]
      
      Thomas reports
      "
      4gsystems sells two total different LTE-surfsticks under the same name.
      ..
      The newer version of XS Stick W100 is from "omega"
      ..
      Under windows the driver switches to the same ID, and uses MI03\6 for
      network and MI01\6 for modem.
      ..
      echo "1c9e 9b01" > /sys/bus/usb/drivers/qmi_wwan/new_id
      echo "1c9e 9b01" > /sys/bus/usb-serial/drivers/option1/new_id
      
      T:  Bus=01 Lev=01 Prnt=01 Port=03 Cnt=01 Dev#=  4 Spd=480 MxCh= 0
      D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
      P:  Vendor=1c9e ProdID=9b01 Rev=02.32
      S:  Manufacturer=USB Modem
      S:  Product=USB Modem
      S:  SerialNumber=
      C:  #Ifs= 5 Cfg#= 1 Atr=80 MxPwr=500mA
      I:  If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
      I:  If#= 1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
      I:  If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
      I:  If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=qmi_wwan
      I:  If#= 4 Alt= 0 #EPs= 2 Cls=08(stor.) Sub=06 Prot=50 Driver=usb-storage
      
      Now all important things are there:
      
      wwp0s29f7u2i3 (net), ttyUSB2 (at), cdc-wdm0 (qmi), ttyUSB1 (at)
      
      There is also ttyUSB0, but it is not usable, at least not for at.
      
      The device works well with qmi and ModemManager-NetworkManager.
      "
      
      Reported-by: default avatarThomas Schäfer <tschaefer@t-online.de>
      Signed-off-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      922e1fd3
    • Neil Horman's avatar
      snmp: Remove duplicate OUTMCAST stat increment · a3fd7012
      Neil Horman authored
      [ Upstream commit 41033f02
      
       ]
      
      the OUTMCAST stat is double incremented, getting bumped once in the mcast code
      itself, and again in the common ip output path.  Remove the mcast bump, as its
      not needed
      
      Validated by the reporter, with good results
      
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Reported-by: default avatarClaus Jensen <claus.jensen@microsemi.com>
      CC: Claus Jensen <claus.jensen@microsemi.com>
      CC: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      a3fd7012
    • Jason A. Donenfeld's avatar
      ip_tunnel: disable preemption when updating per-cpu tstats · 6cc44e69
      Jason A. Donenfeld authored
      [ Upstream commit b4fe85f9
      
       ]
      
      Drivers like vxlan use the recently introduced
      udp_tunnel_xmit_skb/udp_tunnel6_xmit_skb APIs. udp_tunnel6_xmit_skb
      makes use of ip6tunnel_xmit, and ip6tunnel_xmit, after sending the
      packet, updates the struct stats using the usual
      u64_stats_update_begin/end calls on this_cpu_ptr(dev->tstats).
      udp_tunnel_xmit_skb makes use of iptunnel_xmit, which doesn't touch
      tstats, so drivers like vxlan, immediately after, call
      iptunnel_xmit_stats, which does the same thing - calls
      u64_stats_update_begin/end on this_cpu_ptr(dev->tstats).
      
      While vxlan is probably fine (I don't know?), calling a similar function
      from, say, an unbound workqueue, on a fully preemptable kernel causes
      real issues:
      
      [  188.434537] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u8:0/6
      [  188.435579] caller is debug_smp_processor_id+0x17/0x20
      [  188.435583] CPU: 0 PID: 6 Comm: kworker/u8:0 Not tainted 4.2.6 #2
      [  188.435607] Call Trace:
      [  188.435611]  [<ffffffff8234e936>] dump_stack+0x4f/0x7b
      [  188.435615]  [<ffffffff81915f3d>] check_preemption_disabled+0x19d/0x1c0
      [  188.435619]  [<ffffffff81915f77>] debug_smp_processor_id+0x17/0x20
      
      The solution would be to protect the whole
      this_cpu_ptr(dev->tstats)/u64_stats_update_begin/end blocks with
      disabling preemption and then reenabling it.
      
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      6cc44e69
    • lucien's avatar
      sctp: translate host order to network order when setting a hmacid · 84564ff4
      lucien authored
      [ Upstream commit ed5a377d ]
      
      now sctp auth cannot work well when setting a hmacid manually, which
      is caused by that we didn't use the network order for hmacid, so fix
      it by adding the transformation in sctp_auth_ep_set_hmacs.
      
      even we set hmacid with the network order in userspace, it still
      can't work, because of this condition in sctp_auth_ep_set_hmacs():
      
      		if (id > SCTP_AUTH_HMAC_ID_MAX)
      			return -EOPNOTSUPP;
      
      so this wasn't working before and thus it won't break compatibility.
      
      Fixes: 65b07e5d
      
       ("[SCTP]: API updates to suport SCTP-AUTH extensions.")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      84564ff4
    • Daniel Borkmann's avatar
      packet: fix tpacket_snd max frame len · 29ac1e1e
      Daniel Borkmann authored
      [ Upstream commit 5cfb4c8d ]
      
      Since it's introduction in commit 69e3c75f ("net: TX_RING and
      packet mmap"), TX_RING could be used from SOCK_DGRAM and SOCK_RAW
      side. When used with SOCK_DGRAM only, the size_max > dev->mtu +
      reserve check should have reserve as 0, but currently, this is
      unconditionally set (in it's original form as dev->hard_header_len).
      
      I think this is not correct since tpacket_fill_skb() would then
      take dev->mtu and dev->hard_header_len into account for SOCK_DGRAM,
      the extra VLAN_HLEN could be possible in both cases. Presumably, the
      reserve code was copied from packet_snd(), but later on missed the
      check. Make it similar as we have it in packet_snd().
      
      Fixes: 69e3c75f
      
       ("net: TX_RING and packet mmap")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      29ac1e1e
    • Daniel Borkmann's avatar
      packet: infer protocol from ethernet header if unset · e0bc2c24
      Daniel Borkmann authored
      [ Upstream commit c72219b7
      
       ]
      
      In case no struct sockaddr_ll has been passed to packet
      socket's sendmsg() when doing a TX_RING flush run, then
      skb->protocol is set to po->num instead, which is the protocol
      passed via socket(2)/bind(2).
      
      Applications only xmitting can go the path of allocating the
      socket as socket(PF_PACKET, <mode>, 0) and do a bind(2) on the
      TX_RING with sll_protocol of 0. That way, register_prot_hook()
      is neither called on creation nor on bind time, which saves
      cycles when there's no interest in capturing anyway.
      
      That leaves us however with po->num 0 instead and therefore
      the TX_RING flush run sets skb->protocol to 0 as well. Eric
      reported that this leads to problems when using tools like
      trafgen over bonding device. I.e. the bonding's hash function
      could invoke the kernel's flow dissector, which depends on
      skb->protocol being properly set. In the current situation, all
      the traffic is then directed to a single slave.
      
      Fix it up by inferring skb->protocol from the Ethernet header
      when not set and we have ARPHRD_ETHER device type. This is only
      done in case of SOCK_RAW and where we have a dev->hard_header_len
      length. In case of ARPHRD_ETHER devices, this is guaranteed to
      cover ETH_HLEN, and therefore being accessed on the skb after
      the skb_store_bits().
      
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      e0bc2c24